From e836c4b9e78cc3892cdebf8126cb650f1b91ed37 Mon Sep 17 00:00:00 2001
From: Sadeep Madurange <sadeep@asciimx.com>
Date: Thu, 8 Jan 2026 22:28:37 +0800
Subject: Tighten prose.

---
 _site/log/site-search/index.html | 65 +++++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 31 deletions(-)

(limited to '_site/log/site-search')
diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html
index 9170002..e5b9f15 100644
--- a/_site/log/site-search/index.html
+++ b/_site/log/site-search/index.html
@@ -3,7 +3,7 @@
     <head>
   <meta charset="utf-8">
   <meta name="viewport" content="width=device-width, initial-scale=1">
-  <title>Search engine (Perl + FastCGI + SA)</title>
+  <title>Site search with suffix array</title>
   <link rel="stylesheet" href="/assets/css/main.css">
   <link rel="stylesheet" href="/assets/css/skeleton.css">
 </head>
@@ -37,43 +37,43 @@
     <main>
       <div class="container">
         <div class="container-2">
-          <h2 class="center" id="title">SEARCH ENGINE (PERL + FASTCGI + SA)</h2>
+          <h2 class="center" id="title">SITE SEARCH WITH SUFFIX ARRAY</h2>
           <h5 class="center">03 JANUARY 2026</h5>
           <br>
-          <div class="twocol justify"><p>Article count on the website is growing. Need a way to search.</p>
+          <div class="twocol justify"><p>Article count is growing. Need a way to search.</p>
 
 <p>Requirements: matches substrings, case-insensitive, fast, secure. No
 JavaScript.</p>
 
 <p>Architecture: browser → httpd → slowcgi → Perl CGI script.</p>
 
-<p>httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access
-governed by file system permissions–no secrets to manage. Secure by default.</p>
+<p>Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file
+system permissions govern access.</p>
 
-<p>2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files
-using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast
-enough. Start to feel the O(N) pull at higher file counts.</p>
+<p>2025-12-30: Regex search.</p>
 
-<p>Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can
-be mitigated. Tempted to stop here.</p>
+<p>140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at
+higher file counts.</p>
 
-<p>2026-01-03: Suffix array (SA) based index lookup.</p>
+<p>Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to
+stop here.</p>
 
-<p>Inefficiency of scanning every file on each request troubles me. Regex search
-depends almost entirely on hardware for speed.</p>
+<p>2026-01-03: Suffix Array (SA) based index lookup.</p>
 
-<p>Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted
-byte offsets), file_map.dat (metadata). Index created with the site:</p>
+<p>Slurping files on every request bothers me. Regex search depends almost
+entirely on hardware for speed.</p>
+
+<p>Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
+built with site:</p>
 
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ JEKYLL_ENV=production bundle exec jekyll build
 $ cd cgi-bin/
 $ perl indexer.pl
 </code></pre></div></div>
 
-<p>Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and
-saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix
-array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the
-lexicographical order:</p>
+<p>Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null
+byte sentinel for document boundaries. sa.bin stores suffix offsets as
+32-bit unsigned integers, sorted by lexicographical order:</p>
 
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my @sa = 0 .. (length($corpus) - 1);
 {
@@ -101,11 +101,11 @@ SORTED SUFFIXES:
 [7] x
 </code></pre></div></div>
 
-<p>Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a
-cache line), reducing complexity to O(N log N).</p>
+<p>Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of cache
+line), reducing complexity to O(N log N).</p>
 
-<p>Search: Textbook range query with two binary searches. Random access to offsets
-and the text is possible via the fixed-width offsets:</p>
+<p>Search: Textbook range query with twin binary searches. Random access to
+suffixes and the text made possible by the fixed-width offsets:</p>
 
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seek($fh_sa, $mid * 4, 0);
 read($fh_sa, my $bin_off, 4);
@@ -114,8 +114,9 @@ seek($fh_cp, $off, 0);
 read($fh_cp, my $text, $query_len);
 </code></pre></div></div>
 
-<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy
-to implement. Note to self: mmap.</p>
+<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low.</p>
+
+<p>NOTE: mmap.</p>
 
 <p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).</p>
 
@@ -149,14 +150,16 @@ to implement. Note to self: mmap.</p>
   <li>Search (Regex): 0.9120 s</li>
 </ul>
 
-<p>Security: Much of it comes from its architectural simplicity. No dependencies
-to manage, no secrets to hide, no assets for lateral movement. Runs in chroot.</p>
+<p>Security: Derived from architectural simplicity. Zero dependencies to manage,
+no secrets to hide, no targets for lateral movement.</p>
+
+<p>Runs in chroot.</p>
 
-<p>Resource exhaustion and XSS attacks are inherent. The former is mitigated by
-limiting concurrent searches via lock-file semaphores and capping query length
-(64B) and result sets (20). All output is HTML-escaped to prevent XSS.</p>
+<p>Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting
+concurrent searches via lock-file semaphores. Query length (64B) and result set
+(20) capped. All output is HTML-escaped to prevent XSS.</p>
 
-<p>Verdict: Fast SA lookup; Works on every conceivable web browser.</p>
+<p>Secure by default. Fast. Durable.</p>
 
 <p>Commit:
 <a href="https://git.asciimx.com/www/commit/?h=term&amp;id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a>
-- 
cgit v1.2.3