1 files changed, 100 insertions, 30 deletions
diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html
index 04f98e9..9170002 100644
--- a/_site/log/site-search/index.html
+++ b/_site/log/site-search/index.html
@@ -3,7 +3,7 @@
     <head>
   <meta charset="utf-8">
   <meta name="viewport" content="width=device-width, initial-scale=1">
-  <title>Perl + FastCGI + SA search engine</title>
+  <title>Search engine (Perl + FastCGI + SA)</title>
   <link rel="stylesheet" href="/assets/css/main.css">
   <link rel="stylesheet" href="/assets/css/skeleton.css">
 </head>
@@ -37,24 +37,47 @@
     <main>
       <div class="container">
         <div class="container-2">
-          <h2 class="center" id="title">PERL + FASTCGI + SA SEARCH ENGINE</h2>
-          <h5 class="center">02 JANUARY 2026</h5>
+          <h2 class="center" id="title">SEARCH ENGINE (PERL + FASTCGI + SA)</h2>
+          <h5 class="center">03 JANUARY 2026</h5>
           <br>
-          <div class="twocol justify"><p>Number of articles growing. Need search.</p>
+          <div class="twocol justify"><p>Article count on the website is growing. Need a way to search.</p>
 
-<p>Requirements: substring match, case-insensitive, fast, secure. No JavaScript.</p>
+<p>Requirements: matches substrings, case-insensitive, fast, secure. No
+JavaScript.</p>
 
-<p>Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.</p>
+<p>Architecture: browser → httpd → slowcgi → Perl CGI script.</p>
 
-<p>Data structure: suffix array. Three files: corpus.bin (articles), sa.bin
-(sorted byte offsets), file_map.dat (metadata).</p>
+<p>httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access
+governed by file system permissions–no secrets to manage. Secure by default.</p>
 
-<p>Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null
-byte sentinel for document boundaries. Sort lexicographically::</p>
+<p>2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files
+using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast
+enough. Start to feel the O(N) pull at higher file counts.</p>
 
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Use a block that forces byte-level comparison
+<p>Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can
+be mitigated. Tempted to stop here.</p>
+
+<p>2026-01-03: Suffix array (SA) based index lookup.</p>
+
+<p>Inefficiency of scanning every file on each request troubles me. Regex search
+depends almost entirely on hardware for speed.</p>
+
+<p>Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted
+byte offsets), file_map.dat (metadata). Index created with the site:</p>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ JEKYLL_ENV=production bundle exec jekyll build
+$ cd cgi-bin/
+$ perl indexer.pl
+</code></pre></div></div>
+
+<p>Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and
+saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix
+array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the
+lexicographical order:</p>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my @sa = 0 .. (length($corpus) - 1);
 {
-    use bytes; 
+    use bytes; # Force compare 8-bit Unicode value comparisons
     @sa = sort { 
         # First 64 bytes check (fast path)
         (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
@@ -62,31 +85,78 @@ byte sentinel for document boundaries. Sort lexicographically::</p>
         (substr($corpus, $a) cmp substr($corpus, $b))
     } @sa;
 }
-</code></pre></div></div>
-
-<p>Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte
-length targets cache lines.</p>
 
-<p>Search: binary search for range query. Cap at 20 results–define limits or be
-surprised by them.</p>
+CORPUS:   a  s  c  i \0  i  m  x
+OFFSET:   0  1  2  3  4  5  6  7
 
-<p>File IO and memory: many seek/read small chunks beat one large allocation (see
-benchmarks for find_one_file.cgi).</p>
+SORTED SUFFIXES:
 
-<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):</p>
+[4] \0imx
+[0] asci\0imx
+[2] ci\0imx
+[3] i\0imx       &lt;-- "i" from "asci"
+[5] imx          &lt;-- "i" from "imx"
+[6] mx
+[1] sci\0imx
+[7] x
+</code></pre></div></div>
 
-<p>1,000 files: 0.31s indexing, 410 KB total index.<br />
-10,000 files: 10.97s indexing, 4.16 MB total index.</p>
+<p>Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a
+cache line), reducing complexity to O(N log N).</p>
 
-<p>Search ‘arduino’ (0 matches):<br />
-1,000 files: 0.002s (SA) vs 0.080s (naive regex).<br />
-10,000 files: 0.016s (SA) vs 0.912s (naive regex).</p>
+<p>Search: Textbook range query with two binary searches. Random access to offsets
+and the text is possible via the fixed-width offsets:</p>
 
-<p>Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS).
-Sanitize input–strip non-printables, limit length, and quote metacharacters
-(ReDOS). No exec/system (command injection). Chroot.</p>
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seek($fh_sa, $mid * 4, 0);
+read($fh_sa, my $bin_off, 4);
+my $off = unpack("L", $bin_off);
+seek($fh_cp, $off, 0);
+read($fh_cp, my $text, $query_len);
+</code></pre></div></div>
 
-<p>Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.</p>
+<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy
+to implement. Note to self: mmap.</p>
+
+<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).</p>
+
+<p>500 files:</p>
+<ul>
+  <li>Index size: 204.94 KB</li>
+  <li>Indexing time: 0.1475 s</li>
+  <li>Peak RAM (SA): 8828 KB</li>
+  <li>Peak RAM (Regex): 9136 KB</li>
+  <li>Search (SA): 0.0012 s</li>
+  <li>Search (Regex): 0.0407 s</li>
+</ul>
+
+<p>1,000 files:</p>
+<ul>
+  <li>Index size: 410.51 KB</li>
+  <li>Indexing time: 0.3101 s</li>
+  <li>Peak RAM (SA): 8980 KB</li>
+  <li>Peak RAM (Regex): 9460 KB</li>
+  <li>Search (SA): 0.0019 s</li>
+  <li>Search (Regex): 0.0795 s</li>
+</ul>
+
+<p>10,000 files:</p>
+<ul>
+  <li>Index size: 4163.44 KB</li>
+  <li>Indexing time: 10.9661 s</li>
+  <li>Peak RAM (SA): 12504 KB</li>
+  <li>Peak RAM (Regex): 12804 KB</li>
+  <li>Search (SA): 0.0161 s</li>
+  <li>Search (Regex): 0.9120 s</li>
+</ul>
+
+<p>Security: Much of it comes from its architectural simplicity. No dependencies
+to manage, no secrets to hide, no assets for lateral movement. Runs in chroot.</p>
+
+<p>Resource exhaustion and XSS attacks are inherent. The former is mitigated by
+limiting concurrent searches via lock-file semaphores and capping query length
+(64B) and result sets (20). All output is HTML-escaped to prevent XSS.</p>
+
+<p>Verdict: Fast SA lookup; Works on every conceivable web browser.</p>
 
 <p>Commit:
 <a href="https://git.asciimx.com/www/commit/?h=term&amp;id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a>