Tighten prose.

author: Sadeep Madurange <sadeep@asciimx.com> 2026-01-08 22:28:37 +0800
committer: Sadeep Madurange <sadeep@asciimx.com> 2026-01-08 22:28:37 +0800
commit: e836c4b9e78cc3892cdebf8126cb650f1b91ed37 (patch)
tree: 4e52155aef0105cc9b888a42b3e760455a2bcb36 /_log/site-search.md
parent: 57ff09d2eefefa2462a2af0175e3e8164c7bc828 (diff)
download: www-e836c4b9e78cc3892cdebf8126cb650f1b91ed37.tar.gz
1 files changed, 33 insertions, 30 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
index d178d07..0cfbf99 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -1,33 +1,34 @@
 ---
-title: Search engine (Perl + FastCGI + SA)
+title: Site search with suffix array
 date: 2026-01-03
 layout: post
 ---
 
-Article count on the website is growing. Need a way to search.
+Article count is growing. Need a way to search.
 
 Requirements: matches substrings, case-insensitive, fast, secure. No
 JavaScript.
 
 Architecture: browser → httpd → slowcgi → Perl CGI script.
 
-httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access
-governed by file system permissions--no secrets to manage. Secure by default.
+Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file
+system permissions govern access.
 
-2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files
-using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast
-enough. Start to feel the O(N) pull at higher file counts.
+2025-12-30: Regex search.
 
-Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can
-be mitigated. Tempted to stop here.
+140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at
+higher file counts.
 
-2026-01-03: Suffix array (SA) based index lookup.
+Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to
+stop here.
 
-Inefficiency of scanning every file on each request troubles me. Regex search
-depends almost entirely on hardware for speed.
+2026-01-03: Suffix Array (SA) based index lookup.
 
-Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted
-byte offsets), file_map.dat (metadata). Index created with the site:
+Slurping files on every request bothers me. Regex search depends almost
+entirely on hardware for speed.
+
+Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
+built with site:
 
 ```
 $ JEKYLL_ENV=production bundle exec jekyll build
@@ -35,10 +36,9 @@ $ cd cgi-bin/
 $ perl indexer.pl
 ```
 
-Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and
-saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix
-array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the
-lexicographical order:
+Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null
+byte sentinel for document boundaries. sa.bin stores suffix offsets as
+32-bit unsigned integers, sorted by lexicographical order:
 
 ```
 my @sa = 0 .. (length($corpus) - 1);
@@ -67,11 +67,11 @@ SORTED SUFFIXES:
 [7] x
 ```
 
-Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a
-cache line), reducing complexity to O(N log N).
+Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of cache
+line), reducing complexity to O(N log N).
 
-Search: Textbook range query with two binary searches. Random access to offsets
-and the text is possible via the fixed-width offsets:
+Search: Textbook range query with twin binary searches. Random access to
+suffixes and the text made possible by the fixed-width offsets:
 
 ```
 seek($fh_sa, $mid * 4, 0);
@@ -81,8 +81,9 @@ seek($fh_cp, $off, 0);
 read($fh_cp, my $text, $query_len);
 ```
 
-Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy
-to implement. Note to self: mmap.
+Small seek/reads are fast on modern SSDs. Keeps memory usage low. 
+
+NOTE: mmap.
 
 Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).
 
@@ -110,14 +111,16 @@ Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).
  - Search (SA): 0.0161 s
  - Search (Regex): 0.9120 s
 
-Security: Much of it comes from its architectural simplicity. No dependencies
-to manage, no secrets to hide, no assets for lateral movement. Runs in chroot. 
+Security: Derived from architectural simplicity. Zero dependencies to manage,
+no secrets to hide, no targets for lateral movement. 
+
+Runs in chroot. 
 
-Resource exhaustion and XSS attacks are inherent. The former is mitigated by
-limiting concurrent searches via lock-file semaphores and capping query length
-(64B) and result sets (20). All output is HTML-escaped to prevent XSS.
+Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting
+concurrent searches via lock-file semaphores. Query length (64B) and result set
+(20) capped. All output is HTML-escaped to prevent XSS.
 
-Verdict: Fast SA lookup; Works on every conceivable web browser.
+Secure by default. Fast. Durable.
 
 Commit:
 [6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e)
author	Sadeep Madurange <sadeep@asciimx.com>	2026-01-08 22:28:37 +0800
committer	Sadeep Madurange <sadeep@asciimx.com>	2026-01-08 22:28:37 +0800
commit	e836c4b9e78cc3892cdebf8126cb650f1b91ed37 (patch)
tree	4e52155aef0105cc9b888a42b3e760455a2bcb36 /_log/site-search.md
parent	57ff09d2eefefa2462a2af0175e3e8164c7bc828 (diff)
download	www-e836c4b9e78cc3892cdebf8126cb650f1b91ed37.tar.gz