From 95a9897b028318e26d68f94e26efc7c6e06e67f8 Mon Sep 17 00:00:00 2001
From: Sadeep Madurange <sadeep@asciimx.com>
Date: Sun, 8 Feb 2026 23:14:31 +0800
Subject: Rewrite site search post.

---
 _log/site-search.md | 44 +++++++++++++++++++++++++-------------------
 1 file changed, 25 insertions(+), 19 deletions(-)

(limited to '_log')

diff --git a/_log/site-search.md b/_log/site-search.md
index 62059e0..f87f6e3 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -1,17 +1,22 @@
 ---
-title: 'Search: suffix array + FastCGI'
+title: 'Site search: suffix array + FastCGI'
 date: 2026-01-03
 layout: post
 ---
 
-Needed search for site.
+Built search.
 
-Requirements: match substrings, case-insensitive, fast, secure. No JavaScript.
+Requirements: match substrings, ignore case, handle 50 concurrent queries with
+1 vCPU, 1 GB RAM, 25 GB SSD. No JavaScript—work with uBlock Origin, NoScript,
+text browsers.
 
-Architecture: browser ↔ httpd ↔ slowcgi ↔ Perl (SA).
+Architecture: browser ↔ httpd ↔ slowcgi ↔ search engine.
 
-Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
-built with site:
+Server-side regex is viable for a personal site. But an index has clear
+advantages. Not much harder to implement.
+
+Index: suffix array (SA) implemented in Perl. Three files: corpus.bin, sa.bin,
+file_map.dat. Built with site:
 
 ```
 $ JEKYLL_ENV=production bundle exec jekyll build
@@ -19,9 +24,9 @@ $ cd cgi-bin/
 $ perl indexer.pl
 ```
 
-Indexer: Extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte
-sentinel marks document boundaries. sa.bin stores suffix offsets as 32-bit
-unsigned integers, sorted lexicographically:
+Indexer extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte sentinel
+marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned
+integers, sorted lexicographically:
 
 ```
 my @sa = 0 .. (length($corpus) - 1);
@@ -36,13 +41,14 @@ my @sa = 0 .. (length($corpus) - 1);
 }
 ```
 
-Sort is the bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at 64
-bytes (length of a cache line) → O(N log N).
+32-bit offsets limit index size to 4GB—more than sufficient and, if necessary,
+easily expanded. 
 
-32-bit offsets limit index size to 4GB.
+Sort is the real bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at
+64 bytes (length of a typical cache line).
 
 Search: Textbook range query with twin binary searches. Fixed-width offsets
-enable random access:
+enable random access to index:
 
 ```
 seek($fh_sa, $mid * 4, 0);
@@ -52,7 +58,7 @@ seek($fh_cp, $off, 0);
 read($fh_cp, my $text, $query_len);
 ```
 
-Small seek/reads are fast on modern SSDs. Keeps memory usage low. 
+Small seek/reads are fast on modern SSDs; keeps memory footprint small. 
 
 NOTE: mmap.
 
@@ -81,18 +87,18 @@ regex search:
  - Peak RAM (SA): 12504 KB
  - Peak RAM (Regex): 12804 KB
  - Search (SA): 0.0161 s
- - Search (Regex): 0.9120 S
+ - Search (Regex): 0.9120 s
 
-Security: httpd, slowcgi, Perl are in the base system--no dependencies. File
-system permissions govern access. Runs in chroot. 
+Security: httpd, slowcgi, Perl in OpenBSD base system. File system permissions
+govern access. Runs in chroot. 
 
 Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit
-concurrent searches. Query length (64B) and result set (20) are capped. All
+concurrent searches; query length (64B) and result set (20) are capped. All
 output is HTML-escaped to prevent XSS.
 
 Warranty: 10,000 / 12 → 833 years. 
 
-Next release: inverted index; Anno Domini 2859. 
+Next: inverted index, release date: Anno Domini 2859. 
 
 Commit: <a
 href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"
-- 
cgit v1.2.3