SITE SEARCH WITH PERL + CGI + SA
02 JANUARY 2026
Number of articles on the site is growing. Need a way to find them.
Search must match substrings, be case-insensitive, fast, and secure. Site must continue to work without JavaScript, on text web browsers–so no client-side search. Page rank not applicable.
Architecture. OpenBSD’s httpd speaks FastCGI. Perl and slowcgi are in the base system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the results back over FastCGI.
Data structure. Suffix array. Build-time indexer generates three files: corpus.bin–articles, sa.bin-suffix array, file_map.dat–article metadata (title, path).
Indexer
Crawl the directory containing posts, extract HTML with regex (no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case the text (for case-insensitive search), convert to raw octets, concatenate.
Sort the text lexicographically, and store their byte offsets in sa.bin:
# Use a block that forces byte-level comparison
{
use bytes;
@sa = sort {
# First 64 bytes check (fast path)
(substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
# Full string fallback (required for correctness)
(substr($corpus, $a) cmp substr($corpus, $b))
} @sa;
}
This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the suffix count, L the median length of suffixes. The fast path caps L at 64 bytes for most cases, bringing complexity down to O(NlogN)–64-byte length chosen to fit in cache lines.
CORPUS: a s c i \0 i m x
OFFSET: 0 1 2 3 4 5 6 7
SORTED SUFFIXES:
[4] \0imx
[0] asci\0imx
[2] ci\0imx
[3] i\0imx <-- "i" from "asci"
[5] imx <-- "i" from "imx"
[6] mx
[1] sci\0imx
[7] x
Sentinel (null byte) must be lexicographically smaller than every other character in the alphabet (offset 3 and 5 handled correctly at the document boundary).
Benchmarks.
Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.
All articles are 16 KB (2x my avg post size) documents generated with seed.sh.
Total Time: 0.1475 seconds
Files Processed: 500
File Sizes:
corpus.bin 33.59 KB
sa.bin 134.34 KB
file_map.dat 37.01 KB
Total Index: 204.94 KB
Total Time: 0.3101 seconds
Files Processed: 1000
File Sizes:
corpus.bin 67.28 KB
sa.bin 269.11 KB
file_map.dat 74.12 KB
Total Index: 410.51 KB
Total Time: 10.9661 seconds
Files Processed: 10000
File Sizes:
corpus.bin 682.51 KB
sa.bin 2730.05 KB
file_map.dat 750.88 KB
Total Index: 4163.44 KB
There are 11 posts today. Unlikely that it’d grow to 10K. Even if it did, the index is very manageable.
Search
Need to perform a range query to find offsets of all matches. Suffix array is sorted–use binary search.
# left bound
while ($low <= $high) {
# Int overflow: Perl handles by promoting to float.
my $mid = int(($low + $high) / 2);
seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
my $cmp = $text cmp $query;
if ($cmp >= 0) {
$first_hit = $mid if $cmp == 0;
$high = $mid - 1;
} else {
$low = $mid + 1;
}
}
# Continue with right bound...
Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to seek/read exact chunks into memory. Since the FastCGI script is invoked per query, small, frequent allocations are better than less frequent large allocations (see bm_*.txt).
Once the range is found through binary search, the results are collected using a simple for loop. Max result count is capped at 20.
Benchmarks. Search for ‘arduino’ (expect 0 matches). Left: SA-based index search. Right: Naive search using regex.
500 articles:
| Files Read: | 3 | 500 |
| User CPU (s): | 0.0100 | 0.0600 |
| System CPU (s): | 0.0200 | 0.0200 |
| Total Time (s): | 0.0012 | 0.0407 |
| Peak RAM (KB): | 8828 | 9136 |
1,000 articles:
| Files Read: | 3 | 1000 |
| User CPU (s): | 0.0200 | 0.0500 |
| System CPU (s): | 0.0100 | 0.0500 |
| Total Time (s): | 0.0019 | 0.0795 |
| Peak RAM (KB): | 8980 | 9460 |
10,000 articles:
| Files Read: | 3 | 10000 |
| User CPU (s): | 0.0300 | 0.4000 |
| System CPU (s): | 0.0300 | 0.5400 |
| Total Time (s): | 0.0161 | 0.9120 |
| Peak RAM (KB): | 12504 | 12804 |
Note: Total time < User + System CPU time is likely a measurement artifact–the script finishes faster than the OS tick interval.
Security
Memory usage remains a concern. Parallel queries could exhaust memory. Limit parallel searches using a semaphore implemented with lock files:
my $max_parallel = 50;
mkdir $lock_dir, 0777 unless -d $lock_dir;
my $active_count = 0;
my $now = time();
opendir(my $dh, $lock_dir);
while (my $file = readdir($dh)) {
next unless $file =~ /\.lock$/;
my $path = "$lock_dir/$file";
my $mtime = (stat($path))[9] || 0;
($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++;
}
closedir($dh);
XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).
ReDOS: Sanitize search text by tossing non-printable characters, limit length,
work within utf-8. Use \Q...\E in searches.
if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) {
$search_text = $1;
$search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
$search_text = decode_utf8($search_text // "");
$search_text =~ s/\P{Print}//g;
$search_text = substr($search_text, 0, 64);
$search_text =~ s/^\s+|\s+$//g;
}
Command injection: no exec()/system() calls. Non-privileged user. Read-only directory.
Symlink attacks: paths are hard-coded. Script can’t be modified (554).
No API keys, passwords involved.
Last line of defence: chroot.
Can’t think of anything else.
Verdict: Fast SA index lookup. Mitigated primary attack vectors. No dependencies. Works on every conceivable device.