From 1a4a6cb6d2aa2c8512e9637dc5dd95997321c444 Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Sun, 4 Jan 2026 17:57:39 +0800 Subject: Fix the search engine post. --- _log/site-search.md | 120 ++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 93 insertions(+), 27 deletions(-) (limited to '_log/site-search.md') diff --git a/_log/site-search.md b/_log/site-search.md index 6af9f63..d178d07 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -1,25 +1,49 @@ --- -title: Perl + FastCGI + SA search engine -date: 2026-01-02 +title: Search engine (Perl + FastCGI + SA) +date: 2026-01-03 layout: post --- -Number of articles growing. Need search. +Article count on the website is growing. Need a way to search. -Requirements: substring match, case-insensitive, fast, secure. No JavaScript. +Requirements: matches substrings, case-insensitive, fast, secure. No +JavaScript. -Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script. +Architecture: browser → httpd → slowcgi → Perl CGI script. -Data structure: suffix array. Three files: corpus.bin (articles), sa.bin -(sorted byte offsets), file_map.dat (metadata). +httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access +governed by file system permissions--no secrets to manage. Secure by default. -Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null -byte sentinel for document boundaries. Sort lexicographically:: +2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files +using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast +enough. Start to feel the O(N) pull at higher file counts. + +Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can +be mitigated. Tempted to stop here. + +2026-01-03: Suffix array (SA) based index lookup. + +Inefficiency of scanning every file on each request troubles me. Regex search +depends almost entirely on hardware for speed. + +Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted +byte offsets), file_map.dat (metadata). Index created with the site: + +``` +$ JEKYLL_ENV=production bundle exec jekyll build +$ cd cgi-bin/ +$ perl indexer.pl +``` + +Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and +saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix +array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the +lexicographical order: ``` -# Use a block that forces byte-level comparison +my @sa = 0 .. (length($corpus) - 1); { - use bytes; + use bytes; # Force compare 8-bit Unicode value comparisons @sa = sort { # First 64 bytes check (fast path) (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || @@ -27,31 +51,73 @@ byte sentinel for document boundaries. Sort lexicographically:: (substr($corpus, $a) cmp substr($corpus, $b)) } @sa; } + +CORPUS: a s c i \0 i m x +OFFSET: 0 1 2 3 4 5 6 7 + +SORTED SUFFIXES: + +[4] \0imx +[0] asci\0imx +[2] ci\0imx +[3] i\0imx <-- "i" from "asci" +[5] imx <-- "i" from "imx" +[6] mx +[1] sci\0imx +[7] x +``` + +Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a +cache line), reducing complexity to O(N log N). + +Search: Textbook range query with two binary searches. Random access to offsets +and the text is possible via the fixed-width offsets: + +``` +seek($fh_sa, $mid * 4, 0); +read($fh_sa, my $bin_off, 4); +my $off = unpack("L", $bin_off); +seek($fh_cp, $off, 0); +read($fh_cp, my $text, $query_len); ``` -Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte -length targets cache lines. +Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy +to implement. Note to self: mmap. -Search: binary search for range query. Cap at 20 results--define limits or be -surprised by them. +Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB). -File IO and memory: many seek/read small chunks beat one large allocation (see -benchmarks for find_one_file.cgi). +500 files: + - Index size: 204.94 KB + - Indexing time: 0.1475 s + - Peak RAM (SA): 8828 KB + - Peak RAM (Regex): 9136 KB + - Search (SA): 0.0012 s + - Search (Regex): 0.0407 s -Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles): +1,000 files: + - Index size: 410.51 KB + - Indexing time: 0.3101 s + - Peak RAM (SA): 8980 KB + - Peak RAM (Regex): 9460 KB + - Search (SA): 0.0019 s + - Search (Regex): 0.0795 s -1,000 files: 0.31s indexing, 410 KB total index. -10,000 files: 10.97s indexing, 4.16 MB total index. +10,000 files: + - Index size: 4163.44 KB + - Indexing time: 10.9661 s + - Peak RAM (SA): 12504 KB + - Peak RAM (Regex): 12804 KB + - Search (SA): 0.0161 s + - Search (Regex): 0.9120 s -Search 'arduino' (0 matches): -1,000 files: 0.002s (SA) vs 0.080s (naive regex). -10,000 files: 0.016s (SA) vs 0.912s (naive regex). +Security: Much of it comes from its architectural simplicity. No dependencies +to manage, no secrets to hide, no assets for lateral movement. Runs in chroot. -Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS). -Sanitize input--strip non-printables, limit length, and quote metacharacters -(ReDOS). No exec/system (command injection). Chroot. +Resource exhaustion and XSS attacks are inherent. The former is mitigated by +limiting concurrent searches via lock-file semaphores and capping query length +(64B) and result sets (20). All output is HTML-escaped to prevent XSS. -Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies. +Verdict: Fast SA lookup; Works on every conceivable web browser. Commit: [6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e) -- cgit v1.2.3