From bc38c10bfdd23fd51c7c3c320de1d84ac98b8dfd Mon Sep 17 00:00:00 2001
From: Sadeep Madurange <sadeep@asciimx.com>
Date: Thu, 23 Apr 2026 22:39:28 +0800
Subject: Tuen site search post into a journal entry.

---
 _log/site-search.md | 71 +++++++++++++++++++++--------------------------------
 _log/vcs-1.md       |  2 +-
 2 files changed, 29 insertions(+), 44 deletions(-)
diff --git a/_log/site-search.md b/_log/site-search.md
index 0848dce..50de7b2 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -1,32 +1,17 @@
 ---
-title: Built a search engine for website based on suffix arrays
+title: Overengineered search for my website
 date: 2026-01-03
 layout: post
 ---
 
-Built search.
+Developed a suffix-array-based search engine for the site today. While a simple
+regex search was enough, I couldn't resist the technical elegance of a proper
+index.
 
-Requirements: match substrings, ignore case, handle 50 concurrent queries with
-1 vCPU, 1 GB RAM, 25 GB SSD. No JavaScript—work with uBlock Origin, NoScript,
-text browsers.
-
-Architecture: browser ↔ httpd ↔ slowcgi ↔ search engine.
-
-Server-side regex is viable for a personal site. But an index has clear
-advantages. Not that much harder to implement.
-
-Index: suffix array (SA) implemented in Perl. Index— corpus.bin, sa.bin,
-file_map.dat—built with site:
-
-```
-$ JEKYLL_ENV=production bundle exec jekyll build
-$ cd cgi-bin/
-$ perl indexer.pl
-```
-
-Indexer extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte sentinel
-marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned
-integers, sorted lexicographically:
+Indexer: Implemented the indexer in Perl to crawl the HTML, lowercase the text,
+and encode it into UTF-8 bytes. Used a null byte sentinel to mark document
+boundaries and stored the lexicographically sorted 32-bit unsigned integer
+offsets to sa.bin:
 
 ```
 my @sa = 0 .. (length($corpus) - 1);
@@ -41,14 +26,16 @@ my @sa = 0 .. (length($corpus) - 1);
 }
 ```
 
-32-bit offsets limit index size to 4GB—more than sufficient and, if necessary,
-easily expanded. 
+32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but
+comforting to have.
 
-Sort (O(L⋅N log N)) is the real bottleneck. Fast path caps L at 64 bytes
-(length of a typical cache line).
+It takes about 50ms to index my 12-entry website on a T490. As the site grows,
+the O(L⋅N log N) sort could become a bottleneck. So I introduced a fast path
+that caps L at 64 bytes—roughly the size of a cache line on common hardware.
 
-Search: Textbook range query with two binary searches. Fixed-width offsets
-enable random access to index:
+Search: Implemented the search in a FastCGI script as a textbook range query
+with two binary searches. Leveraged the fixed-width offsets for fast random
+access to the index:
 
 ```
 seek($fh_sa, $mid * 4, 0);
@@ -58,10 +45,12 @@ seek($fh_cp, $off, 0);
 read($fh_cp, my $text, $query_len);
 ```
 
-Small seek/reads are fast on modern SSDs; keeps memory footprint small. 
+Chose seek + read over mmap because it outperformed mmap for <1k files. At
+10k, mmap was occasionally faster (~200 µs), but used more memory—possibly
+due to OpenBSD’s VM security trade-offs. Results may vary by OS.
 
-Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear
-regex search:
+Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against
+linear regex search:
 
 <pre class="pre-no-style">
 =============================================================
@@ -100,19 +89,15 @@ Index size      |           4163.44 KB |                 N/A
 -------------------------------------------------------------
 </pre>
 
-Seek/read consistently outperformed mmap at <1k files. At 10k, mmap was
-occasionally faster (~200 µs), but used more memory—possibly OpenBSD's VM
-security trade-offs. Results may vary by OS.
-
-Security: httpd, slowcgi, Perl in OpenBSD base system. File system permissions
-govern access. Runs in chroot. 
+Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system
+permissions to govern access. Hardened the system by running it in chroot.
 
-Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit
-concurrent searches. Query length (64B) and result set (20) are capped. All
-output is HTML-escaped to prevent XSS.
+Resource exhaustion and XSS attacks are inherent. Limited concurrent searches
+using lock-file semaphores, and capped the query length (64 B) and the result
+set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape.
 
-Warranty: 10,000 / 12 → 833 years. <br>
-Next release: inverted index—Anno Domini 2859.
+At my current pace of twelve articles a year, the search should work without
+intervention for the next 833 years. Penciled in the next release for 2859 AD.
 
 Commit: <a
 href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"
diff --git a/_log/vcs-1.md b/_log/vcs-1.md
index c0a7504..d6ae0b2 100644
--- a/_log/vcs-1.md
+++ b/_log/vcs-1.md
@@ -46,7 +46,7 @@ Status and add commands scan the work tree, sort entries by path, and perform a
 two-finger walk with the index. Linear index access trades random-access speed
 for sequential IO.
 
-Operations are performed in memory — often using text streams and pipes.
+Operations are performed in memory—often using text streams and pipes.
 MEM_LIMIT can be used to fall back on the disk for large repositories:
 
 ```
-- 
cgit v1.2.3