summaryrefslogtreecommitdiffstats
path: root/_log
diff options
context:
space:
mode:
authorSadeep Madurange <sadeep@asciimx.com>2026-01-03 13:06:08 +0800
committerSadeep Madurange <sadeep@asciimx.com>2026-01-03 14:42:43 +0800
commit15205d0cf770058b59be07e00f6dbc6523b9cede (patch)
tree43ba2ab53add863286e0ef24f1e9aa0e94bb5334 /_log
parent6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e (diff)
downloadwww-15205d0cf770058b59be07e00f6dbc6523b9cede.tar.gz
CGI search post.
Diffstat (limited to '_log')
-rw-r--r--_log/site-search.md287
1 files changed, 287 insertions, 0 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
new file mode 100644
index 0000000..61e4bba
--- /dev/null
+++ b/_log/site-search.md
@@ -0,0 +1,287 @@
+---
+title: Site search with Perl + CGI + SA
+date: 2026-01-02
+layout: post
+---
+
+Number of articles on the site is growing. Need a way to find them.
+
+Search must match substrings, be case-insensitive, fast, and secure. Site must
+continue to work without JavaScript, on text web browsers--so no client-side
+search. Page rank not applicable.
+
+Architecture. OpenBSD's httpd speaks FastCGI. Perl and slowcgi are in the base
+system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the
+results back over FastCGI.
+
+Data structure. Suffix array. Build-time indexer generates three files:
+corpus.bin--articles, sa.bin-suffix array, file_map.dat--article metadata
+(title, path).
+
+## Indexer
+
+Crawl the directory containing posts, extract HTML with regex
+(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case
+the text (for case-insensitive search), convert to raw octets, concatenate.
+
+Sort the text lexicographically, and store their byte offsets in sa.bin:
+
+```
+# Use a block that forces byte-level comparison
+{
+ use bytes;
+ @sa = sort {
+ # First 64 bytes check (fast path)
+ (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
+ # Full string fallback (required for correctness)
+ (substr($corpus, $a) cmp substr($corpus, $b))
+ } @sa;
+}
+```
+
+This is the slowest part of the indexer running at ~ O(Lâ‹…NlogN), N is the
+suffix count, L the median length of suffixes. The fast path caps L at 64 bytes
+for most cases, bringing complexity down to O(NlogN)--64-byte length chosen to
+fit in cache lines.
+
+```
+CORPUS: a s c i \0 i m x
+OFFSET: 0 1 2 3 4 5 6 7
+
+SORTED SUFFIXES:
+[4] \0imx
+[0] asci\0imx
+[2] ci\0imx
+[3] i\0imx <-- "i" from "asci"
+[5] imx <-- "i" from "imx"
+[6] mx
+[1] sci\0imx
+[7] x
+```
+
+Sentinel (null byte) must be lexicographically smaller than every other
+character in the alphabet (offset 3 and 5 handled correctly at the document
+boundary).
+
+Benchmarks.
+
+Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM)
+i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.
+
+All articles are 16 KB (2x my avg post size) documents generated with seed.sh.
+
+Total Time: 0.1475 seconds<br>
+Files Processed: 500 <br>
+File Sizes:<br>
+&nbsp;&nbsp; corpus.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;33.59 KB<br>
+&nbsp;&nbsp; sa.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;134.34 KB<br>
+&nbsp;&nbsp; file_map.dat &nbsp;&nbsp;&nbsp; 37.01 KB<br>
+&nbsp;&nbsp; Total Index: &nbsp;&nbsp;&nbsp;204.94 KB<br>
+
+Total Time: 0.3101 seconds<br>
+Files Processed: 1000<br>
+File Sizes:<br>
+&nbsp;&nbsp; corpus.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;67.28 KB<br>
+&nbsp;&nbsp; sa.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;269.11 KB<br>
+&nbsp;&nbsp; file_map.dat &nbsp;&nbsp;&nbsp; 74.12 KB<br>
+&nbsp;&nbsp; Total Index: &nbsp;&nbsp;&nbsp;410.51 KB<br>
+
+Total Time: 10.9661 seconds<br>
+Files Processed: 10000<br>
+File Sizes:<br>
+&nbsp;&nbsp; corpus.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;682.51 KB<br>
+&nbsp;&nbsp; sa.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2730.05 KB<br>
+&nbsp;&nbsp; file_map.dat &nbsp;&nbsp; 750.88 KB<br>
+&nbsp;&nbsp; Total Index: &nbsp;&nbsp;4163.44 KB<br>
+
+There are 11 posts today. Unlikely that it'd grow to 10K. Even if it did, the
+index is very manageable.
+
+## Search
+
+Need to perform a range query to find offsets of all matches. Suffix
+array is sorted--use binary search.
+
+```
+# left bound
+while ($low <= $high) {
+ # Int overflow: Perl handles by promoting to float.
+ my $mid = int(($low + $high) / 2);
+ seek($fh_sa, $mid * 4, 0);
+ read($fh_sa, my $bin_off, 4);
+ my $off = unpack("L", $bin_off);
+ seek($fh_cp, $off, 0);
+ read($fh_cp, my $text, $query_len);
+
+ my $cmp = $text cmp $query;
+ if ($cmp >= 0) {
+ $first_hit = $mid if $cmp == 0;
+ $high = $mid - 1;
+ } else {
+ $low = $mid + 1;
+ }
+}
+
+# Continue with right bound...
+```
+
+Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to
+seek/read exact chunks into memory. Since the FastCGI script is invoked per
+query, small, frequent allocations are better than less frequent large
+allocations (see bm_*.txt).
+
+Once the range is found through binary search, the results are collected using
+a simple for loop. Max result count is capped at 20.
+
+Benchmarks. Search for 'arduino' (expect 0 matches). Left: SA-based index
+search. Right: Naive search using regex.
+
+500 articles:
+
+<table>
+ <tr>
+ <td style="padding: 0">Files Read:</td>
+ <td style="padding: 0; text-align: right;">3</td>
+ <td style="padding: 0; text-align: right;">500</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">User CPU (s):</td>
+ <td style="padding: 0; text-align: right;">0.0100</td>
+ <td style="padding: 0; text-align: right;">0.0600</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">System CPU (s):</td>
+ <td style="padding: 0; text-align: right;">0.0200</td>
+ <td style="padding: 0; text-align: right;">0.0200</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">Total Time (s):</td>
+ <td style="padding: 0; text-align: right;">0.0012</td>
+ <td style="padding: 0; text-align: right;">0.0407</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">Peak RAM (KB):</td>
+ <td style="padding: 0; text-align: right;">8828</td>
+ <td style="padding: 0; text-align: right;">9136</td>
+ </tr>
+</table>
+
+1,000 articles:
+
+<table>
+ <tr>
+ <td style="padding: 0">Files Read:</td>
+ <td style="padding: 0; text-align: right;">3</td>
+ <td style="padding: 0; text-align: right;">1000</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">User CPU (s):</td>
+ <td style="padding: 0; text-align: right;">0.0200</td>
+ <td style="padding: 0; text-align: right;">0.0500</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">System CPU (s):</td>
+ <td style="padding: 0; text-align: right;">0.0100</td>
+ <td style="padding: 0; text-align: right;">0.0500</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">Total Time (s):</td>
+ <td style="padding: 0; text-align: right;">0.0019</td>
+ <td style="padding: 0; text-align: right;">0.0795</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">Peak RAM (KB):</td>
+ <td style="padding: 0; text-align: right;">8980</td>
+ <td style="padding: 0; text-align: right;">9460</td>
+ </tr>
+</table>
+
+10,000 articles:
+
+<table>
+ <tr>
+ <td style="padding: 0">Files Read:</td>
+ <td style="padding: 0; text-align: right;">3</td>
+ <td style="padding: 0; text-align: right;">10000</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">User CPU (s):</td>
+ <td style="padding: 0; text-align: right;">0.0300</td>
+ <td style="padding: 0; text-align: right;">0.4000</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">System CPU (s):</td>
+ <td style="padding: 0; text-align: right;">0.0300</td>
+ <td style="padding: 0; text-align: right;">0.5400</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">Total Time (s):</td>
+ <td style="padding: 0; text-align: right;">0.0161</td>
+ <td style="padding: 0; text-align: right;">0.9120</td>
+ </tr>
+ <tr>
+ <td style="padding: 0">Peak RAM (KB):</td>
+ <td style="padding: 0; text-align: right;">12504</td>
+ <td style="padding: 0; text-align: right;">12804</td>
+ </tr>
+</table>
+
+Note: Total time < User + System CPU time is likely a measurement artifact--the
+script finishes faster than the OS tick interval.
+
+## Security
+
+Memory usage remains a concern. Parallel queries could exhaust memory. Limit
+parallel searches using a semaphore implemented with lock files:
+
+```
+my $max_parallel = 50;
+
+mkdir $lock_dir, 0777 unless -d $lock_dir;
+my $active_count = 0;
+my $now = time();
+
+opendir(my $dh, $lock_dir);
+while (my $file = readdir($dh)) {
+ next unless $file =~ /\.lock$/;
+ my $path = "$lock_dir/$file";
+ my $mtime = (stat($path))[9] || 0;
+ ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++;
+}
+closedir($dh);
+```
+
+XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).
+
+ReDOS: Sanitize search text by tossing non-printable characters, limit length,
+work within utf-8. Use `\Q...\E` in searches.
+
+```
+if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) {
+ $search_text = $1;
+ $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
+ $search_text = decode_utf8($search_text // "");
+ $search_text =~ s/\P{Print}//g;
+ $search_text = substr($search_text, 0, 64);
+ $search_text =~ s/^\s+|\s+$//g;
+}
+```
+
+Command injection: no exec()/system() calls. Non-privileged user.
+Read-only directory.
+
+Symlink attacks: paths are hard-coded. Script can't be modified (554).
+
+No API keys, passwords involved.
+
+Last line of defence: chroot.
+
+Can't think of anything else.
+
+Verdict: Fast SA index lookup. Mitigated primary attack vectors. No
+dependencies. Works on every conceivable device.
+
+Commit:
+[9889aca](https://git.asciimx.com/www/commit/?h=term&id=9889acaac984817731ad9f9355f067f9acd02efd)
+| Benchmarks:
+[8a4da68](https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9)