CGI search post.

author: Sadeep Madurange <sadeep@asciimx.com> 2026-01-03 13:06:08 +0800
committer: Sadeep Madurange <sadeep@asciimx.com> 2026-01-03 14:42:43 +0800
commit: 15205d0cf770058b59be07e00f6dbc6523b9cede (patch)
tree: 43ba2ab53add863286e0ef24f1e9aa0e94bb5334
parent: 6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e (diff)
download: www-15205d0cf770058b59be07e00f6dbc6523b9cede.tar.gz
9 files changed, 678 insertions, 29 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
new file mode 100644
index 0000000..61e4bba
--- /dev/null
+++ b/_log/site-search.md
@@ -0,0 +1,287 @@
+---
+title: Site search with Perl + CGI + SA
+date: 2026-01-02
+layout: post
+---
+
+Number of articles on the site is growing. Need a way to find them.
+
+Search must match substrings, be case-insensitive, fast, and secure. Site must
+continue to work without JavaScript, on text web browsers--so no client-side
+search. Page rank not applicable.
+
+Architecture. OpenBSD's httpd speaks FastCGI. Perl and slowcgi are in the base
+system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the
+results back over FastCGI.
+
+Data structure. Suffix array. Build-time indexer generates three files:
+corpus.bin--articles, sa.bin-suffix array, file_map.dat--article metadata
+(title, path).
+
+## Indexer
+
+Crawl the directory containing posts, extract HTML with regex
+(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case
+the text (for case-insensitive search), convert to raw octets, concatenate.
+
+Sort the text lexicographically, and store their byte offsets in sa.bin:
+
+```
+# Use a block that forces byte-level comparison
+{
+    use bytes; 
+    @sa = sort { 
+        # First 64 bytes check (fast path)
+        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
+        # Full string fallback (required for correctness)
+        (substr($corpus, $a) cmp substr($corpus, $b))
+    } @sa;
+}
+```
+
+This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the
+suffix count, L the median length of suffixes. The fast path caps L at 64 bytes
+for most cases, bringing complexity down to O(NlogN)--64-byte length chosen to
+fit in cache lines.
+
+```
+CORPUS:   a  s  c  i \0  i  m  x
+OFFSET:   0  1  2  3  4  5  6  7
+
+SORTED SUFFIXES:
+[4] \0imx
+[0] asci\0imx
+[2] ci\0imx
+[3] i\0imx       <-- "i" from "asci"
+[5] imx          <-- "i" from "imx"
+[6] mx
+[1] sci\0imx
+[7] x
+```
+
+Sentinel (null byte) must be lexicographically smaller than every other
+character in the alphabet (offset 3 and 5 handled correctly at the document
+boundary).
+
+Benchmarks.
+
+Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM)
+i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.
+
+All articles are 16 KB (2x my avg post size) documents generated with seed.sh.
+
+Total Time:      0.1475 seconds<br>
+Files Processed: 500 <br>
+File Sizes:<br>
+&nbsp;&nbsp; corpus.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;33.59 KB<br>
+&nbsp;&nbsp; sa.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;134.34 KB<br>
+&nbsp;&nbsp; file_map.dat &nbsp;&nbsp;&nbsp; 37.01 KB<br>
+&nbsp;&nbsp; Total Index: &nbsp;&nbsp;&nbsp;204.94 KB<br>
+
+Total Time:      0.3101 seconds<br>
+Files Processed: 1000<br>
+File Sizes:<br>
+&nbsp;&nbsp; corpus.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;67.28 KB<br>
+&nbsp;&nbsp; sa.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;269.11 KB<br>
+&nbsp;&nbsp; file_map.dat &nbsp;&nbsp;&nbsp; 74.12 KB<br>
+&nbsp;&nbsp; Total Index: &nbsp;&nbsp;&nbsp;410.51 KB<br>
+
+Total Time:      10.9661 seconds<br>
+Files Processed: 10000<br>
+File Sizes:<br>
+&nbsp;&nbsp; corpus.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;682.51 KB<br>
+&nbsp;&nbsp; sa.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2730.05 KB<br>
+&nbsp;&nbsp; file_map.dat &nbsp;&nbsp; 750.88 KB<br>
+&nbsp;&nbsp; Total Index: &nbsp;&nbsp;4163.44 KB<br>
+
+There are 11 posts today. Unlikely that it'd grow to 10K. Even if it did, the
+index is very manageable.
+
+## Search
+
+Need to perform a range query to find offsets of all matches. Suffix
+array is sorted--use binary search.
+
+```
+# left bound
+while ($low <= $high) {
+    # Int overflow: Perl handles by promoting to float.
+    my $mid = int(($low + $high) / 2);
+    seek($fh_sa, $mid * 4, 0);
+    read($fh_sa, my $bin_off, 4);
+    my $off = unpack("L", $bin_off);
+    seek($fh_cp, $off, 0);
+    read($fh_cp, my $text, $query_len);
+        
+    my $cmp = $text cmp $query;
+    if ($cmp >= 0) {
+        $first_hit = $mid if $cmp == 0;
+        $high = $mid - 1;
+    } else {
+        $low = $mid + 1;
+    }
+}
+
+# Continue with right bound...
+```
+
+Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to
+seek/read exact chunks into memory. Since the FastCGI script is invoked per
+query, small, frequent allocations are better than less frequent large
+allocations (see bm_*.txt).
+
+Once the range is found through binary search, the results are collected using
+a simple for loop. Max result count is capped at 20.
+
+Benchmarks. Search for 'arduino' (expect 0 matches). Left: SA-based index
+search. Right: Naive search using regex.
+
+500 articles:
+
+<table>
+  <tr>
+      <td style="padding: 0">Files Read:</td>
+      <td style="padding: 0; text-align: right;">3</td>
+      <td style="padding: 0; text-align: right;">500</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">User CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0100</td>
+      <td style="padding: 0; text-align: right;">0.0600</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">System CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0200</td>
+      <td style="padding: 0; text-align: right;">0.0200</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Total Time (s):</td>
+      <td style="padding: 0; text-align: right;">0.0012</td>
+      <td style="padding: 0; text-align: right;">0.0407</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Peak RAM (KB):</td>
+      <td style="padding: 0; text-align: right;">8828</td>
+      <td style="padding: 0; text-align: right;">9136</td>
+   </tr>
+</table>
+
+1,000 articles:
+
+<table>
+  <tr>
+      <td style="padding: 0">Files Read:</td>
+      <td style="padding: 0; text-align: right;">3</td>
+      <td style="padding: 0; text-align: right;">1000</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">User CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0200</td>
+      <td style="padding: 0; text-align: right;">0.0500</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">System CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0100</td>
+      <td style="padding: 0; text-align: right;">0.0500</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Total Time (s):</td>
+      <td style="padding: 0; text-align: right;">0.0019</td>
+      <td style="padding: 0; text-align: right;">0.0795</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Peak RAM (KB):</td>
+      <td style="padding: 0; text-align: right;">8980</td>
+      <td style="padding: 0; text-align: right;">9460</td>
+   </tr>
+</table>
+
+10,000 articles:
+
+<table>
+  <tr>
+      <td style="padding: 0">Files Read:</td>
+      <td style="padding: 0; text-align: right;">3</td>
+      <td style="padding: 0; text-align: right;">10000</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">User CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0300</td>
+      <td style="padding: 0; text-align: right;">0.4000</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">System CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0300</td>
+      <td style="padding: 0; text-align: right;">0.5400</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Total Time (s):</td>
+      <td style="padding: 0; text-align: right;">0.0161</td>
+      <td style="padding: 0; text-align: right;">0.9120</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Peak RAM (KB):</td>
+      <td style="padding: 0; text-align: right;">12504</td>
+      <td style="padding: 0; text-align: right;">12804</td>
+   </tr>
+</table>
+
+Note: Total time < User + System CPU time is likely a measurement artifact--the
+script finishes faster than the OS tick interval.
+
+## Security
+
+Memory usage remains a concern. Parallel queries could exhaust memory. Limit 
+parallel searches using a semaphore implemented with lock files:
+
+```
+my $max_parallel   = 50;
+
+mkdir $lock_dir, 0777 unless -d $lock_dir;
+my $active_count = 0;
+my $now = time();
+
+opendir(my $dh, $lock_dir);
+while (my $file = readdir($dh)) {
+    next unless $file =~ /\.lock$/;
+    my $path = "$lock_dir/$file";
+    my $mtime = (stat($path))[9] || 0;
+    ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++;
+}
+closedir($dh);
+```
+
+XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).
+
+ReDOS: Sanitize search text by tossing non-printable characters, limit length,
+work within utf-8. Use `\Q...\E` in searches.
+
+```
+if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) {
+    $search_text = $1;
+    $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
+    $search_text = decode_utf8($search_text // "");
+    $search_text =~ s/\P{Print}//g; 
+    $search_text = substr($search_text, 0, 64);
+    $search_text =~ s/^\s+|\s+$//g;
+}
+```
+
+Command injection: no exec()/system() calls. Non-privileged user.
+Read-only directory.
+
+Symlink attacks: paths are hard-coded. Script can't be modified (554).
+
+No API keys, passwords involved.
+
+Last line of defence: chroot.
+
+Can't think of anything else.
+
+Verdict: Fast SA index lookup. Mitigated primary attack vectors. No
+dependencies. Works on every conceivable device.
+
+Commit:
+[9889aca](https://git.asciimx.com/www/commit/?h=term&id=9889acaac984817731ad9f9355f067f9acd02efd)
+| Benchmarks:
+[8a4da68](https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9)
diff --git a/_site/assets/css/main.css b/_site/assets/css/main.css
index 091baf3..1b4341b 100644
--- a/_site/assets/css/main.css
+++ b/_site/assets/css/main.css
@@ -62,7 +62,7 @@ table {
 }
 
 table, tr, td {
-  border: none;
+  border: none !important;
 }
 
 td {
diff --git a/_site/cgi-bin/find.cgi b/_site/cgi-bin/find.cgi
index 5f95e3a..ab066dd 100644
--- a/_site/cgi-bin/find.cgi
+++ b/_site/cgi-bin/find.cgi
@@ -7,7 +7,7 @@ use Encode qw(decode_utf8 encode_utf8);
 use URI::Escape qw(uri_unescape);
 use HTML::Escape qw(escape_html);
 
-# --- Configuration ---
+# Configuration
 my $max_parallel   = 50;                    # Max parallel search requests
 my $lock_timeout   = 30;                    # Seconds before dropping stale locks
 my $max_results    = 20;                    # Max search results to display
@@ -16,7 +16,7 @@ my $cp_file        = 'corpus.bin';          # Raw text corpus
 my $map_file       = 'file_map.dat';        # File metadata
 my $lock_dir       = '/tmp/search_locks';   # Semaphore directory
 
-# --- Concurrency Control ---
+# Concurrency control
 mkdir $lock_dir, 0777 unless -d $lock_dir;
 my $active_count = 0;
 my $now = time();
@@ -45,7 +45,7 @@ if ($active_count >= $max_parallel) {
 my $lock_file = "$lock_dir/$$.lock";
 open(my $fh_lock, '>', $lock_file);
 
-# --- Query Decoding ---
+# Query decoding
 if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) {
     my $raw_q = $1;
     $raw_q =~ tr/+/ /;
@@ -64,7 +64,7 @@ if ($search_text eq '') {
     final_output("<p>Please enter a search term above.</p>");
 }
 
-# --- Binary Search Logic ---
+# Binary search 
 my @results;
 my $query = encode_utf8(lc($search_text));
 my $query_len = length($query);
@@ -130,38 +130,44 @@ if (-f $sa_file && -f $cp_file) {
             foreach my $m (@$file_map) {
                 if ($offset >= $m->{start} && $offset < $m->{end}) {
                     if (!$seen{$m->{path}}++) {
-                        # 1. Capture slightly more than 50 chars for trimming
+                        # Capture more than 50 chars for trimming
                         my $snip_start = ($offset - 30 < $m->{start}) ? $m->{start} : $offset - 30;
+                        my $max_len = $m->{end} - $snip_start;
+                        my $read_len = ($max_len > 120) ? 120 : $max_len;
                         seek($fh_cp, $snip_start, 0);
-                        read($fh_cp, my $raw_snip, 120);
+                        read($fh_cp, my $raw_snip, $read_len);
                         
                         my $snippet = decode_utf8($raw_snip, Encode::FB_QUIET) // $raw_snip;
                         $snippet =~ s/\s+/ /g; # Normalize whitespace
 
-                        # 2. Trim Start: Partial word removal
+                        # Trim start: Partial word removal
                         if ($snip_start > $m->{start}) {
                             $snippet =~ s/^[^\s]*\s//; 
                         }
 
-                        # 3. Trim End: Length limit and partial word removal
+                        # Trim end: Length limit and partial word removal
                         my $has_more = 0;
                         if (length($snippet) > 50) {
                             $snippet = substr($snippet, 0, 50);
                             $has_more = 1 if $snippet =~ s/\s+[^\s]*$//;
                         }
+                        elsif ($snip_start + $read_len < $m->{end}) {
+                            # This check handles snippets that are naturally short but 
+                            # there's still more text in the article we didn't read
+                            $has_more = 1; 
+                        }
 
-                        # 4. Cleanup & Capitalize
+                        # Cleanup & capitalize
                         $snippet = ucfirst($snippet);
+                        $snippet = escape_html($snippet) . ($has_more ? "..." : "");
+
                         my $clean_path = $m->{path};
                         $clean_path =~ s|^\.\./_site/||;
 
-                        # 5. Build Final Snippet
-                        my $display_snippet = escape_html($snippet) . ($has_more ? "..." : "");
-
                         push @results, {
                             path    => $clean_path,
-                            title   => (split('/', $m->{path}))[-2],
-                            snippet => $display_snippet
+                            title   => $m->{title},,
+                            snippet => $snippet
                         };
                     }
                     last;
diff --git a/_site/index.html b/_site/index.html
index 90cbbfd..57ba207 100644
--- a/_site/index.html
+++ b/_site/index.html
@@ -61,6 +61,19 @@
   
       <tr>
         <td class="posts-td posts-td-link">
+          <a href="/log/site-search/" class="link-decor-none">Site search with Perl + CGI + SA</a>
+        </td>
+        <td class="posts-td posts-td-time">
+          <span class="post-meta">
+            <time datetime="2026-01-02 00:00:00 +0800">2026-01-02</time>
+          </span>
+        </td>
+      </tr>
+  
+    
+  
+      <tr>
+        <td class="posts-td posts-td-link">
           <a href="/log/matrix-digital-rain/" class="link-decor-none">Matrix Rain: 2025 refactor</a>
         </td>
         <td class="posts-td posts-td-time">
@@ -176,19 +189,6 @@
   
     
   
-      <tr>
-        <td class="posts-td posts-td-link">
-          <a href="/log/neo4j-a-star-search/" class="link-decor-none">Neo4j shortest path optimization</a>
-        </td>
-        <td class="posts-td posts-td-time">
-          <span class="post-meta">
-            <time datetime="2018-03-06 00:00:00 +0800">2018-03-06</time>
-          </span>
-        </td>
-      </tr>
-  
-    
-  
     </table>
   </div>
 
diff --git a/_site/log/index.html b/_site/log/index.html
index 32fbf8d..813832d 100644
--- a/_site/log/index.html
+++ b/_site/log/index.html
@@ -51,6 +51,19 @@
   
       <tr>
         <td class="posts-td posts-td-link">
+          <a href="/log/site-search/" class="link-decor-none">Site search with Perl + CGI + SA</a>
+        </td>
+        <td class="posts-td posts-td-time">
+          <span class="post-meta">
+            <time datetime="2026-01-02 00:00:00 +0800">2026-01-02</time>
+          </span>
+        </td>
+      </tr>
+  
+    
+  
+      <tr>
+        <td class="posts-td posts-td-link">
           <a href="/log/matrix-digital-rain/" class="link-decor-none">Matrix Rain: 2025 refactor</a>
         </td>
         <td class="posts-td posts-td-time">
diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html
new file mode 100644
index 0000000..3c32ba3
--- /dev/null
+++ b/_site/log/site-search/index.html
@@ -0,0 +1,338 @@
+<!DOCTYPE html>
+<html>
+    <head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <title>Site search with Perl + CGI + SA</title>
+  <link rel="stylesheet" href="/assets/css/main.css">
+  <link rel="stylesheet" href="/assets/css/skeleton.css">
+</head>
+
+
+  <body>
+
+    <div id="nav-container" class="container">
+  <ul id="navlist" class="left">
+    
+    <li >
+      <a href="/" class="link-decor-none">hme</a>
+    </li>
+    <li class="active">
+      <a href="/log/" class="link-decor-none">log</a>
+    </li>
+    <li >
+      <a href="/projects/" class="link-decor-none">poc</a>
+    </li>
+    <li >
+      <a href="/about/" class="link-decor-none">abt</a>
+    </li>
+    <li>
+      <a href="/cgi-bin/find.cgi" class="link-decor-none">sws</a>
+    </li>
+    <li>
+      <a href="/feed.xml" class="link-decor-none">rss</a>
+    </li>
+  </ul>
+</div>
+
+
+
+    <main>
+      <div class="container">
+        <div class="container-2">
+          <h2 class="center" id="title">SITE SEARCH WITH PERL + CGI + SA</h2>
+          <h5 class="center">02 JANUARY 2026</h5>
+          <br>
+          <div class="twocol justify"><p>Number of articles on the site is growing. Need a way to find them.</p>
+
+<p>Search must match substrings, be case-insensitive, fast, and secure. Site must
+continue to work without JavaScript, on text web browsers–so no client-side
+search. Page rank not applicable.</p>
+
+<p>Architecture. OpenBSD’s httpd speaks FastCGI. Perl and slowcgi are in the base
+system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the
+results back over FastCGI.</p>
+
+<p>Data structure. Suffix array. Build-time indexer generates three files:
+corpus.bin–articles, sa.bin-suffix array, file_map.dat–article metadata
+(title, path).</p>
+
+<h2 id="indexer">Indexer</h2>
+
+<p>Crawl the directory containing posts, extract HTML with regex
+(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case
+the text (for case-insensitive search), convert to raw octets, concatenate.</p>
+
+<p>Sort the text lexicographically, and store their byte offsets in sa.bin:</p>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Use a block that forces byte-level comparison
+{
+    use bytes; 
+    @sa = sort { 
+        # First 64 bytes check (fast path)
+        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
+        # Full string fallback (required for correctness)
+        (substr($corpus, $a) cmp substr($corpus, $b))
+    } @sa;
+}
+</code></pre></div></div>
+
+<p>This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the
+suffix count, L the median length of suffixes. The fast path caps L at 64 bytes
+for most cases, bringing complexity down to O(NlogN)–64-byte length chosen to
+fit in cache lines.</p>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CORPUS:   a  s  c  i \0  i  m  x
+OFFSET:   0  1  2  3  4  5  6  7
+
+SORTED SUFFIXES:
+[4] \0imx
+[0] asci\0imx
+[2] ci\0imx
+[3] i\0imx       &lt;-- "i" from "asci"
+[5] imx          &lt;-- "i" from "imx"
+[6] mx
+[1] sci\0imx
+[7] x
+</code></pre></div></div>
+
+<p>Sentinel (null byte) must be lexicographically smaller than every other
+character in the alphabet (offset 3 and 5 handled correctly at the document
+boundary).</p>
+
+<p>Benchmarks.</p>
+
+<p>Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM)
+i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.</p>
+
+<p>All articles are 16 KB (2x my avg post size) documents generated with seed.sh.</p>
+
+<p>Total Time:      0.1475 seconds<br />
+Files Processed: 500 <br />
+File Sizes:<br />
+   corpus.bin       33.59 KB<br />
+   sa.bin          134.34 KB<br />
+   file_map.dat     37.01 KB<br />
+   Total Index:    204.94 KB<br /></p>
+
+<p>Total Time:      0.3101 seconds<br />
+Files Processed: 1000<br />
+File Sizes:<br />
+   corpus.bin       67.28 KB<br />
+   sa.bin          269.11 KB<br />
+   file_map.dat     74.12 KB<br />
+   Total Index:    410.51 KB<br /></p>
+
+<p>Total Time:      10.9661 seconds<br />
+Files Processed: 10000<br />
+File Sizes:<br />
+   corpus.bin      682.51 KB<br />
+   sa.bin         2730.05 KB<br />
+   file_map.dat    750.88 KB<br />
+   Total Index:   4163.44 KB<br /></p>
+
+<p>There are 11 posts today. Unlikely that it’d grow to 10K. Even if it did, the
+index is very manageable.</p>
+
+<h2 id="search">Search</h2>
+
+<p>Need to perform a range query to find offsets of all matches. Suffix
+array is sorted–use binary search.</p>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># left bound
+while ($low &lt;= $high) {
+    # Int overflow: Perl handles by promoting to float.
+    my $mid = int(($low + $high) / 2);
+    seek($fh_sa, $mid * 4, 0);
+    read($fh_sa, my $bin_off, 4);
+    my $off = unpack("L", $bin_off);
+    seek($fh_cp, $off, 0);
+    read($fh_cp, my $text, $query_len);
+        
+    my $cmp = $text cmp $query;
+    if ($cmp &gt;= 0) {
+        $first_hit = $mid if $cmp == 0;
+        $high = $mid - 1;
+    } else {
+        $low = $mid + 1;
+    }
+}
+
+# Continue with right bound...
+</code></pre></div></div>
+
+<p>Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to
+seek/read exact chunks into memory. Since the FastCGI script is invoked per
+query, small, frequent allocations are better than less frequent large
+allocations (see bm_*.txt).</p>
+
+<p>Once the range is found through binary search, the results are collected using
+a simple for loop. Max result count is capped at 20.</p>
+
+<p>Benchmarks. Search for ‘arduino’ (expect 0 matches). Left: SA-based index
+search. Right: Naive search using regex.</p>
+
+<p>500 articles:</p>
+
+<table>
+  <tr>
+      <td style="padding: 0">Files Read:</td>
+      <td style="padding: 0; text-align: right;">3</td>
+      <td style="padding: 0; text-align: right;">500</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">User CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0100</td>
+      <td style="padding: 0; text-align: right;">0.0600</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">System CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0200</td>
+      <td style="padding: 0; text-align: right;">0.0200</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Total Time (s):</td>
+      <td style="padding: 0; text-align: right;">0.0012</td>
+      <td style="padding: 0; text-align: right;">0.0407</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Peak RAM (KB):</td>
+      <td style="padding: 0; text-align: right;">8828</td>
+      <td style="padding: 0; text-align: right;">9136</td>
+   </tr>
+</table>
+
+<p>1,000 articles:</p>
+
+<table>
+  <tr>
+      <td style="padding: 0">Files Read:</td>
+      <td style="padding: 0; text-align: right;">3</td>
+      <td style="padding: 0; text-align: right;">1000</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">User CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0200</td>
+      <td style="padding: 0; text-align: right;">0.0500</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">System CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0100</td>
+      <td style="padding: 0; text-align: right;">0.0500</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Total Time (s):</td>
+      <td style="padding: 0; text-align: right;">0.0019</td>
+      <td style="padding: 0; text-align: right;">0.0795</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Peak RAM (KB):</td>
+      <td style="padding: 0; text-align: right;">8980</td>
+      <td style="padding: 0; text-align: right;">9460</td>
+   </tr>
+</table>
+
+<p>10,000 articles:</p>
+
+<table>
+  <tr>
+      <td style="padding: 0">Files Read:</td>
+      <td style="padding: 0; text-align: right;">3</td>
+      <td style="padding: 0; text-align: right;">10000</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">User CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0300</td>
+      <td style="padding: 0; text-align: right;">0.4000</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">System CPU (s):</td>
+      <td style="padding: 0; text-align: right;">0.0300</td>
+      <td style="padding: 0; text-align: right;">0.5400</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Total Time (s):</td>
+      <td style="padding: 0; text-align: right;">0.0161</td>
+      <td style="padding: 0; text-align: right;">0.9120</td>
+  </tr>
+  <tr>
+      <td style="padding: 0">Peak RAM (KB):</td>
+      <td style="padding: 0; text-align: right;">12504</td>
+      <td style="padding: 0; text-align: right;">12804</td>
+   </tr>
+</table>
+
+<p>Note: Total time &lt; User + System CPU time is likely a measurement artifact–the
+script finishes faster than the OS tick interval.</p>
+
+<h2 id="security">Security</h2>
+
+<p>Memory usage remains a concern. Parallel queries could exhaust memory. Limit 
+parallel searches using a semaphore implemented with lock files:</p>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my $max_parallel   = 50;
+
+mkdir $lock_dir, 0777 unless -d $lock_dir;
+my $active_count = 0;
+my $now = time();
+
+opendir(my $dh, $lock_dir);
+while (my $file = readdir($dh)) {
+    next unless $file =~ /\.lock$/;
+    my $path = "$lock_dir/$file";
+    my $mtime = (stat($path))[9] || 0;
+    ($now - $mtime &gt; $lock_timeout) ? unlink($path) : $active_count++;
+}
+closedir($dh);
+</code></pre></div></div>
+
+<p>XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).</p>
+
+<p>ReDOS: Sanitize search text by tossing non-printable characters, limit length,
+work within utf-8. Use <code class="language-plaintext highlighter-rouge">\Q...\E</code> in searches.</p>
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (($ENV{QUERY_STRING} || '') =~ /^q=([^&amp;]*)/) {
+    $search_text = $1;
+    $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
+    $search_text = decode_utf8($search_text // "");
+    $search_text =~ s/\P{Print}//g; 
+    $search_text = substr($search_text, 0, 64);
+    $search_text =~ s/^\s+|\s+$//g;
+}
+</code></pre></div></div>
+
+<p>Command injection: no exec()/system() calls. Non-privileged user.
+Read-only directory.</p>
+
+<p>Symlink attacks: paths are hard-coded. Script can’t be modified (554).</p>
+
+<p>No API keys, passwords involved.</p>
+
+<p>Last line of defence: chroot.</p>
+
+<p>Can’t think of anything else.</p>
+
+<p>Verdict: Fast SA index lookup. Mitigated primary attack vectors. No
+dependencies. Works on every conceivable device.</p>
+
+<p>Commit:
+<a href="https://git.asciimx.com/www/commit/?h=term&amp;id=9889acaac984817731ad9f9355f067f9acd02efd">9889aca</a>
+| Benchmarks:
+<a href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9">8a4da68</a></p>
+</div>
+          <p class="post-author right">by W. D. Sadeep Madurange</p>
+        </div>
+      </div>
+    </main>
+
+    <div class="footer">
+  <div class="container">
+    <div class="twelve columns right container-2">
+      <p id="footer-text">&copy; ASCIIMX - 2026</p>
+    </div>
+  </div>
+</div>
+
+
+  </body>
+</html>
diff --git a/_site/posts.xml b/_site/posts.xml
index d327a7b..0136478 100644
--- a/_site/posts.xml
+++ b/_site/posts.xml
@@ -1 +1,2 @@
 <?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/posts.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-02T19:06:26+08:00</updated><id>/posts.xml</id><title type="html">ASCIIMX</title><author><name>W. D. Sadeep Madurange</name></author></feed>
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/posts.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-03T12:59:44+08:00</updated><id>/posts.xml</id><title type="html">ASCIIMX</title><author><name>W. D. Sadeep Madurange</name></author></feed>
diff --git a/_site/sitemap.xml b/_site/sitemap.xml
index b799547..c5fe837 100644
--- a/_site/sitemap.xml
+++ b/_site/sitemap.xml
@@ -41,6 +41,10 @@
 <lastmod>2025-12-21T00:00:00+08:00</lastmod>
 </url>
 <url>
+<loc>/log/site-search/</loc>
+<lastmod>2026-01-02T00:00:00+08:00</lastmod>
+</url>
+<url>
 <loc>/about/</loc>
 </url>
 <url>
diff --git a/assets/css/main.css b/assets/css/main.css
index 091baf3..1b4341b 100644
--- a/assets/css/main.css
+++ b/assets/css/main.css
@@ -62,7 +62,7 @@ table {
 }
 
 table, tr, td {
-  border: none;
+  border: none !important;
 }
 
 td {
author	Sadeep Madurange <sadeep@asciimx.com>	2026-01-03 13:06:08 +0800
committer	Sadeep Madurange <sadeep@asciimx.com>	2026-01-03 14:42:43 +0800
commit	15205d0cf770058b59be07e00f6dbc6523b9cede (patch)
tree	43ba2ab53add863286e0ef24f1e9aa0e94bb5334
parent	6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e (diff)
download	www-15205d0cf770058b59be07e00f6dbc6523b9cede.tar.gz