--- title: Site search with Perl + CGI + SA date: 2026-01-02 layout: post --- Number of articles on the site is growing. Need a way to find them. Search must match substrings, be case-insensitive, fast, and secure. Site must continue to work without JavaScript, on text web browsers--so no client-side search. Page rank not applicable. Architecture. OpenBSD's httpd speaks FastCGI. Perl and slowcgi are in the base system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the results back over FastCGI. Data structure. Suffix array. Build-time indexer generates three files: corpus.bin--articles, sa.bin-suffix array, file_map.dat--article metadata (title, path). ## Indexer Crawl the directory containing posts, extract HTML with regex (no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case the text (for case-insensitive search), convert to raw octets, concatenate. Sort the text lexicographically, and store their byte offsets in sa.bin: ``` # Use a block that forces byte-level comparison { use bytes; @sa = sort { # First 64 bytes check (fast path) (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || # Full string fallback (required for correctness) (substr($corpus, $a) cmp substr($corpus, $b)) } @sa; } ``` This is the slowest part of the indexer running at ~ O(Lâ‹…NlogN), N is the suffix count, L the median length of suffixes. The fast path caps L at 64 bytes for most cases, bringing complexity down to O(NlogN)--64-byte length chosen to fit in cache lines. ``` CORPUS: a s c i \0 i m x OFFSET: 0 1 2 3 4 5 6 7 SORTED SUFFIXES: [4] \0imx [0] asci\0imx [2] ci\0imx [3] i\0imx <-- "i" from "asci" [5] imx <-- "i" from "imx" [6] mx [1] sci\0imx [7] x ``` Sentinel (null byte) must be lexicographically smaller than every other character in the alphabet (offset 3 and 5 handled correctly at the document boundary). Benchmarks. Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory. All articles are 16 KB (2x my avg post size) documents generated with seed.sh. Total Time: 0.1475 seconds
Files Processed: 500
File Sizes:
   corpus.bin       33.59 KB
   sa.bin          134.34 KB
   file_map.dat     37.01 KB
   Total Index:    204.94 KB
Total Time: 0.3101 seconds
Files Processed: 1000
File Sizes:
   corpus.bin       67.28 KB
   sa.bin          269.11 KB
   file_map.dat     74.12 KB
   Total Index:    410.51 KB
Total Time: 10.9661 seconds
Files Processed: 10000
File Sizes:
   corpus.bin      682.51 KB
   sa.bin         2730.05 KB
   file_map.dat    750.88 KB
   Total Index:   4163.44 KB
There are 11 posts today. Unlikely that it'd grow to 10K. Even if it did, the index is very manageable. ## Search Need to perform a range query to find offsets of all matches. Suffix array is sorted--use binary search. ``` # left bound while ($low <= $high) { # Int overflow: Perl handles by promoting to float. my $mid = int(($low + $high) / 2); seek($fh_sa, $mid * 4, 0); read($fh_sa, my $bin_off, 4); my $off = unpack("L", $bin_off); seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); my $cmp = $text cmp $query; if ($cmp >= 0) { $first_hit = $mid if $cmp == 0; $high = $mid - 1; } else { $low = $mid + 1; } } # Continue with right bound... ``` Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to seek/read exact chunks into memory. Since the FastCGI script is invoked per query, small, frequent allocations are better than less frequent large allocations (see bm_*.txt). Once the range is found through binary search, the results are collected using a simple for loop. Max result count is capped at 20. Benchmarks. Search for 'arduino' (expect 0 matches). Left: SA-based index search. Right: Naive search using regex. 500 articles:
Files Read: 3 500
User CPU (s): 0.0100 0.0600
System CPU (s): 0.0200 0.0200
Total Time (s): 0.0012 0.0407
Peak RAM (KB): 8828 9136
1,000 articles:
Files Read: 3 1000
User CPU (s): 0.0200 0.0500
System CPU (s): 0.0100 0.0500
Total Time (s): 0.0019 0.0795
Peak RAM (KB): 8980 9460
10,000 articles:
Files Read: 3 10000
User CPU (s): 0.0300 0.4000
System CPU (s): 0.0300 0.5400
Total Time (s): 0.0161 0.9120
Peak RAM (KB): 12504 12804
Note: Total time < User + System CPU time is likely a measurement artifact--the script finishes faster than the OS tick interval. ## Security Memory usage remains a concern. Parallel queries could exhaust memory. Limit parallel searches using a semaphore implemented with lock files: ``` my $max_parallel = 50; mkdir $lock_dir, 0777 unless -d $lock_dir; my $active_count = 0; my $now = time(); opendir(my $dh, $lock_dir); while (my $file = readdir($dh)) { next unless $file =~ /\.lock$/; my $path = "$lock_dir/$file"; my $mtime = (stat($path))[9] || 0; ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++; } closedir($dh); ``` XSS attacks. Escape all HTML using HTML::Escape qw(escape_html). ReDOS: Sanitize search text by tossing non-printable characters, limit length, work within utf-8. Use `\Q...\E` in searches. ``` if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) { $search_text = $1; $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; $search_text = decode_utf8($search_text // ""); $search_text =~ s/\P{Print}//g; $search_text = substr($search_text, 0, 64); $search_text =~ s/^\s+|\s+$//g; } ``` Command injection: no exec()/system() calls. Non-privileged user. Read-only directory. Symlink attacks: paths are hard-coded. Script can't be modified (554). No API keys, passwords involved. Last line of defence: chroot. Can't think of anything else. Verdict: Fast SA index lookup. Mitigated primary attack vectors. No dependencies. Works on every conceivable device. Commit: [9889aca](https://git.asciimx.com/www/commit/?h=term&id=9889acaac984817731ad9f9355f067f9acd02efd) | Benchmarks: [8a4da68](https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9)