---
title: Site search with Perl + CGI + SA
date: 2026-01-02
layout: post
---
Number of articles on the site is growing. Need a way to find them.
Search must match substrings, be case-insensitive, fast, and secure. Site must
continue to work without JavaScript, on text web browsers--so no client-side
search. Page rank not applicable.
Architecture. OpenBSD's httpd speaks FastCGI. Perl and slowcgi are in the base
system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the
results back over FastCGI.
Data structure. Suffix array. Build-time indexer generates three files:
corpus.bin--articles, sa.bin-suffix array, file_map.dat--article metadata
(title, path).
## Indexer
Crawl the directory containing posts, extract HTML with regex
(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case
the text (for case-insensitive search), convert to raw octets, concatenate.
Sort the text lexicographically, and store their byte offsets in sa.bin:
```
# Use a block that forces byte-level comparison
{
use bytes;
@sa = sort {
# First 64 bytes check (fast path)
(substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
# Full string fallback (required for correctness)
(substr($corpus, $a) cmp substr($corpus, $b))
} @sa;
}
```
This is the slowest part of the indexer running at ~ O(Lâ‹…NlogN), N is the
suffix count, L the median length of suffixes. The fast path caps L at 64 bytes
for most cases, bringing complexity down to O(NlogN)--64-byte length chosen to
fit in cache lines.
```
CORPUS: a s c i \0 i m x
OFFSET: 0 1 2 3 4 5 6 7
SORTED SUFFIXES:
[4] \0imx
[0] asci\0imx
[2] ci\0imx
[3] i\0imx <-- "i" from "asci"
[5] imx <-- "i" from "imx"
[6] mx
[1] sci\0imx
[7] x
```
Sentinel (null byte) must be lexicographically smaller than every other
character in the alphabet (offset 3 and 5 handled correctly at the document
boundary).
Benchmarks.
Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM)
i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.
All articles are 16 KB (2x my avg post size) documents generated with seed.sh.
Total Time: 0.1475 seconds
Files Processed: 500
File Sizes:
corpus.bin 33.59 KB
sa.bin 134.34 KB
file_map.dat 37.01 KB
Total Index: 204.94 KB
Total Time: 0.3101 seconds
Files Processed: 1000
File Sizes:
corpus.bin 67.28 KB
sa.bin 269.11 KB
file_map.dat 74.12 KB
Total Index: 410.51 KB
Total Time: 10.9661 seconds
Files Processed: 10000
File Sizes:
corpus.bin 682.51 KB
sa.bin 2730.05 KB
file_map.dat 750.88 KB
Total Index: 4163.44 KB
There are 11 posts today. Unlikely that it'd grow to 10K. Even if it did, the
index is very manageable.
## Search
Need to perform a range query to find offsets of all matches. Suffix
array is sorted--use binary search.
```
# left bound
while ($low <= $high) {
# Int overflow: Perl handles by promoting to float.
my $mid = int(($low + $high) / 2);
seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
my $cmp = $text cmp $query;
if ($cmp >= 0) {
$first_hit = $mid if $cmp == 0;
$high = $mid - 1;
} else {
$low = $mid + 1;
}
}
# Continue with right bound...
```
Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to
seek/read exact chunks into memory. Since the FastCGI script is invoked per
query, small, frequent allocations are better than less frequent large
allocations (see bm_*.txt).
Once the range is found through binary search, the results are collected using
a simple for loop. Max result count is capped at 20.
Benchmarks. Search for 'arduino' (expect 0 matches). Left: SA-based index
search. Right: Naive search using regex.
500 articles:
| Files Read: |
3 |
500 |
| User CPU (s): |
0.0100 |
0.0600 |
| System CPU (s): |
0.0200 |
0.0200 |
| Total Time (s): |
0.0012 |
0.0407 |
| Peak RAM (KB): |
8828 |
9136 |
1,000 articles:
| Files Read: |
3 |
1000 |
| User CPU (s): |
0.0200 |
0.0500 |
| System CPU (s): |
0.0100 |
0.0500 |
| Total Time (s): |
0.0019 |
0.0795 |
| Peak RAM (KB): |
8980 |
9460 |
10,000 articles:
| Files Read: |
3 |
10000 |
| User CPU (s): |
0.0300 |
0.4000 |
| System CPU (s): |
0.0300 |
0.5400 |
| Total Time (s): |
0.0161 |
0.9120 |
| Peak RAM (KB): |
12504 |
12804 |
Note: Total time < User + System CPU time is likely a measurement artifact--the
script finishes faster than the OS tick interval.
## Security
Memory usage remains a concern. Parallel queries could exhaust memory. Limit
parallel searches using a semaphore implemented with lock files:
```
my $max_parallel = 50;
mkdir $lock_dir, 0777 unless -d $lock_dir;
my $active_count = 0;
my $now = time();
opendir(my $dh, $lock_dir);
while (my $file = readdir($dh)) {
next unless $file =~ /\.lock$/;
my $path = "$lock_dir/$file";
my $mtime = (stat($path))[9] || 0;
($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++;
}
closedir($dh);
```
XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).
ReDOS: Sanitize search text by tossing non-printable characters, limit length,
work within utf-8. Use `\Q...\E` in searches.
```
if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) {
$search_text = $1;
$search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
$search_text = decode_utf8($search_text // "");
$search_text =~ s/\P{Print}//g;
$search_text = substr($search_text, 0, 64);
$search_text =~ s/^\s+|\s+$//g;
}
```
Command injection: no exec()/system() calls. Non-privileged user.
Read-only directory.
Symlink attacks: paths are hard-coded. Script can't be modified (554).
No API keys, passwords involved.
Last line of defence: chroot.
Can't think of anything else.
Verdict: Fast SA index lookup. Mitigated primary attack vectors. No
dependencies. Works on every conceivable device.
Commit:
[6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e)
| Benchmarks:
[8a4da68](https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9)