From 95a9897b028318e26d68f94e26efc7c6e06e67f8 Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Sun, 8 Feb 2026 23:14:31 +0800 Subject: Rewrite site search post. --- _log/site-search.md | 44 +++++++++++++++++++++++++------------------- 1 file changed, 25 insertions(+), 19 deletions(-) (limited to '_log') diff --git a/_log/site-search.md b/_log/site-search.md index 62059e0..f87f6e3 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -1,17 +1,22 @@ --- -title: 'Search: suffix array + FastCGI' +title: 'Site search: suffix array + FastCGI' date: 2026-01-03 layout: post --- -Needed search for site. +Built search. -Requirements: match substrings, case-insensitive, fast, secure. No JavaScript. +Requirements: match substrings, ignore case, handle 50 concurrent queries with +1 vCPU, 1 GB RAM, 25 GB SSD. No JavaScript—work with uBlock Origin, NoScript, +text browsers. -Architecture: browser ↔ httpd ↔ slowcgi ↔ Perl (SA). +Architecture: browser ↔ httpd ↔ slowcgi ↔ search engine. -Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index -built with site: +Server-side regex is viable for a personal site. But an index has clear +advantages. Not much harder to implement. + +Index: suffix array (SA) implemented in Perl. Three files: corpus.bin, sa.bin, +file_map.dat. Built with site: ``` $ JEKYLL_ENV=production bundle exec jekyll build @@ -19,9 +24,9 @@ $ cd cgi-bin/ $ perl indexer.pl ``` -Indexer: Extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte -sentinel marks document boundaries. sa.bin stores suffix offsets as 32-bit -unsigned integers, sorted lexicographically: +Indexer extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte sentinel +marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned +integers, sorted lexicographically: ``` my @sa = 0 .. (length($corpus) - 1); @@ -36,13 +41,14 @@ my @sa = 0 .. (length($corpus) - 1); } ``` -Sort is the bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at 64 -bytes (length of a cache line) → O(N log N). +32-bit offsets limit index size to 4GB—more than sufficient and, if necessary, +easily expanded. -32-bit offsets limit index size to 4GB. +Sort is the real bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at +64 bytes (length of a typical cache line). Search: Textbook range query with twin binary searches. Fixed-width offsets -enable random access: +enable random access to index: ``` seek($fh_sa, $mid * 4, 0); @@ -52,7 +58,7 @@ seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); ``` -Small seek/reads are fast on modern SSDs. Keeps memory usage low. +Small seek/reads are fast on modern SSDs; keeps memory footprint small. NOTE: mmap. @@ -81,18 +87,18 @@ regex search: - Peak RAM (SA): 12504 KB - Peak RAM (Regex): 12804 KB - Search (SA): 0.0161 s - - Search (Regex): 0.9120 S + - Search (Regex): 0.9120 s -Security: httpd, slowcgi, Perl are in the base system--no dependencies. File -system permissions govern access. Runs in chroot. +Security: httpd, slowcgi, Perl in OpenBSD base system. File system permissions +govern access. Runs in chroot. Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit -concurrent searches. Query length (64B) and result set (20) are capped. All +concurrent searches; query length (64B) and result set (20) are capped. All output is HTML-escaped to prevent XSS. Warranty: 10,000 / 12 → 833 years. -Next release: inverted index; Anno Domini 2859. +Next: inverted index, release date: Anno Domini 2859. Commit: