diff options
| author | Sadeep Madurange <sadeep@asciimx.com> | 2026-04-23 22:39:28 +0800 |
|---|---|---|
| committer | Sadeep Madurange <sadeep@asciimx.com> | 2026-04-23 22:49:25 +0800 |
| commit | bc38c10bfdd23fd51c7c3c320de1d84ac98b8dfd (patch) | |
| tree | 4f219e23de1da90c710f5463fb3371e610685a0b | |
| parent | 5bf4df90566d83542d17c7d29edfe4584eb79db7 (diff) | |
| download | www-bc38c10bfdd23fd51c7c3c320de1d84ac98b8dfd.tar.gz | |
Tuen site search post into a journal entry.
| -rw-r--r-- | _log/site-search.md | 71 | ||||
| -rw-r--r-- | _log/vcs-1.md | 2 |
2 files changed, 29 insertions, 44 deletions
diff --git a/_log/site-search.md b/_log/site-search.md index 0848dce..50de7b2 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -1,32 +1,17 @@ --- -title: Built a search engine for website based on suffix arrays +title: Overengineered search for my website date: 2026-01-03 layout: post --- -Built search. +Developed a suffix-array-based search engine for the site today. While a simple +regex search was enough, I couldn't resist the technical elegance of a proper +index. -Requirements: match substrings, ignore case, handle 50 concurrent queries with -1 vCPU, 1 GB RAM, 25 GB SSD. No JavaScript—work with uBlock Origin, NoScript, -text browsers. - -Architecture: browser ↔ httpd ↔ slowcgi ↔ search engine. - -Server-side regex is viable for a personal site. But an index has clear -advantages. Not that much harder to implement. - -Index: suffix array (SA) implemented in Perl. Index— corpus.bin, sa.bin, -file_map.dat—built with site: - -``` -$ JEKYLL_ENV=production bundle exec jekyll build -$ cd cgi-bin/ -$ perl indexer.pl -``` - -Indexer extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte sentinel -marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned -integers, sorted lexicographically: +Indexer: Implemented the indexer in Perl to crawl the HTML, lowercase the text, +and encode it into UTF-8 bytes. Used a null byte sentinel to mark document +boundaries and stored the lexicographically sorted 32-bit unsigned integer +offsets to sa.bin: ``` my @sa = 0 .. (length($corpus) - 1); @@ -41,14 +26,16 @@ my @sa = 0 .. (length($corpus) - 1); } ``` -32-bit offsets limit index size to 4GB—more than sufficient and, if necessary, -easily expanded. +32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but +comforting to have. -Sort (O(L⋅N log N)) is the real bottleneck. Fast path caps L at 64 bytes -(length of a typical cache line). +It takes about 50ms to index my 12-entry website on a T490. As the site grows, +the O(L⋅N log N) sort could become a bottleneck. So I introduced a fast path +that caps L at 64 bytes—roughly the size of a cache line on common hardware. -Search: Textbook range query with two binary searches. Fixed-width offsets -enable random access to index: +Search: Implemented the search in a FastCGI script as a textbook range query +with two binary searches. Leveraged the fixed-width offsets for fast random +access to the index: ``` seek($fh_sa, $mid * 4, 0); @@ -58,10 +45,12 @@ seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); ``` -Small seek/reads are fast on modern SSDs; keeps memory footprint small. +Chose seek + read over mmap because it outperformed mmap for <1k files. At +10k, mmap was occasionally faster (~200 µs), but used more memory—possibly +due to OpenBSD’s VM security trade-offs. Results may vary by OS. -Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear -regex search: +Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against +linear regex search: <pre class="pre-no-style"> ============================================================= @@ -100,19 +89,15 @@ Index size | 4163.44 KB | N/A ------------------------------------------------------------- </pre> -Seek/read consistently outperformed mmap at <1k files. At 10k, mmap was -occasionally faster (~200 µs), but used more memory—possibly OpenBSD's VM -security trade-offs. Results may vary by OS. - -Security: httpd, slowcgi, Perl in OpenBSD base system. File system permissions -govern access. Runs in chroot. +Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system +permissions to govern access. Hardened the system by running it in chroot. -Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit -concurrent searches. Query length (64B) and result set (20) are capped. All -output is HTML-escaped to prevent XSS. +Resource exhaustion and XSS attacks are inherent. Limited concurrent searches +using lock-file semaphores, and capped the query length (64 B) and the result +set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape. -Warranty: 10,000 / 12 → 833 years. <br> -Next release: inverted index—Anno Domini 2859. +At my current pace of twelve articles a year, the search should work without +intervention for the next 833 years. Penciled in the next release for 2859 AD. Commit: <a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e" diff --git a/_log/vcs-1.md b/_log/vcs-1.md index c0a7504..d6ae0b2 100644 --- a/_log/vcs-1.md +++ b/_log/vcs-1.md @@ -46,7 +46,7 @@ Status and add commands scan the work tree, sort entries by path, and perform a two-finger walk with the index. Linear index access trades random-access speed for sequential IO. -Operations are performed in memory — often using text streams and pipes. +Operations are performed in memory—often using text streams and pipes. MEM_LIMIT can be used to fall back on the disk for large repositories: ``` |
