From bc38c10bfdd23fd51c7c3c320de1d84ac98b8dfd Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Thu, 23 Apr 2026 22:39:28 +0800 Subject: Tuen site search post into a journal entry. --- _log/site-search.md | 71 +++++++++++++++++++++-------------------------------- _log/vcs-1.md | 2 +- 2 files changed, 29 insertions(+), 44 deletions(-) diff --git a/_log/site-search.md b/_log/site-search.md index 0848dce..50de7b2 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -1,32 +1,17 @@ --- -title: Built a search engine for website based on suffix arrays +title: Overengineered search for my website date: 2026-01-03 layout: post --- -Built search. +Developed a suffix-array-based search engine for the site today. While a simple +regex search was enough, I couldn't resist the technical elegance of a proper +index. -Requirements: match substrings, ignore case, handle 50 concurrent queries with -1 vCPU, 1 GB RAM, 25 GB SSD. No JavaScript—work with uBlock Origin, NoScript, -text browsers. - -Architecture: browser ↔ httpd ↔ slowcgi ↔ search engine. - -Server-side regex is viable for a personal site. But an index has clear -advantages. Not that much harder to implement. - -Index: suffix array (SA) implemented in Perl. Index— corpus.bin, sa.bin, -file_map.dat—built with site: - -``` -$ JEKYLL_ENV=production bundle exec jekyll build -$ cd cgi-bin/ -$ perl indexer.pl -``` - -Indexer extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte sentinel -marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned -integers, sorted lexicographically: +Indexer: Implemented the indexer in Perl to crawl the HTML, lowercase the text, +and encode it into UTF-8 bytes. Used a null byte sentinel to mark document +boundaries and stored the lexicographically sorted 32-bit unsigned integer +offsets to sa.bin: ``` my @sa = 0 .. (length($corpus) - 1); @@ -41,14 +26,16 @@ my @sa = 0 .. (length($corpus) - 1); } ``` -32-bit offsets limit index size to 4GB—more than sufficient and, if necessary, -easily expanded. +32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but +comforting to have. -Sort (O(L⋅N log N)) is the real bottleneck. Fast path caps L at 64 bytes -(length of a typical cache line). +It takes about 50ms to index my 12-entry website on a T490. As the site grows, +the O(L⋅N log N) sort could become a bottleneck. So I introduced a fast path +that caps L at 64 bytes—roughly the size of a cache line on common hardware. -Search: Textbook range query with two binary searches. Fixed-width offsets -enable random access to index: +Search: Implemented the search in a FastCGI script as a textbook range query +with two binary searches. Leveraged the fixed-width offsets for fast random +access to the index: ``` seek($fh_sa, $mid * 4, 0); @@ -58,10 +45,12 @@ seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); ``` -Small seek/reads are fast on modern SSDs; keeps memory footprint small. +Chose seek + read over mmap because it outperformed mmap for <1k files. At +10k, mmap was occasionally faster (~200 µs), but used more memory—possibly +due to OpenBSD’s VM security trade-offs. Results may vary by OS. -Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear -regex search: +Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against +linear regex search:
 =============================================================
@@ -100,19 +89,15 @@ Index size      |           4163.44 KB |                 N/A
 -------------------------------------------------------------
 
-Seek/read consistently outperformed mmap at <1k files. At 10k, mmap was -occasionally faster (~200 µs), but used more memory—possibly OpenBSD's VM -security trade-offs. Results may vary by OS. - -Security: httpd, slowcgi, Perl in OpenBSD base system. File system permissions -govern access. Runs in chroot. +Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system +permissions to govern access. Hardened the system by running it in chroot. -Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit -concurrent searches. Query length (64B) and result set (20) are capped. All -output is HTML-escaped to prevent XSS. +Resource exhaustion and XSS attacks are inherent. Limited concurrent searches +using lock-file semaphores, and capped the query length (64 B) and the result +set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape. -Warranty: 10,000 / 12 → 833 years.
-Next release: inverted index—Anno Domini 2859. +At my current pace of twelve articles a year, the search should work without +intervention for the next 833 years. Penciled in the next release for 2859 AD. Commit: