Site search with Perl + CGI + SA

SITE SEARCH WITH PERL + CGI + SA

02 JANUARY 2026

Number of articles on the site is growing. Need a way to find them.

Search must match substrings, be case-insensitive, fast, and secure. Site must continue to work without JavaScript, on text web browsers–so no client-side search. Page rank not applicable.

Architecture. OpenBSD’s httpd speaks FastCGI. Perl and slowcgi are in the base system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the results back over FastCGI.

Data structure. Suffix array. Build-time indexer generates three files: corpus.bin–articles, sa.bin-suffix array, file_map.dat–article metadata (title, path).

Indexer

Crawl the directory containing posts, extract HTML with regex (no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case the text (for case-insensitive search), convert to raw octets, concatenate.

Sort the text lexicographically, and store their byte offsets in sa.bin:

# Use a block that forces byte-level comparison
{
    use bytes; 
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}

This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the suffix count, L the median length of suffixes. The fast path caps L at 64 bytes for most cases, bringing complexity down to O(NlogN)–64-byte length chosen to fit in cache lines.

CORPUS:   a  s  c  i \0  i  m  x
OFFSET:   0  1  2  3  4  5  6  7

SORTED SUFFIXES:
[4] \0imx
[0] asci\0imx
[2] ci\0imx
[3] i\0imx       <-- "i" from "asci"
[5] imx          <-- "i" from "imx"
[6] mx
[1] sci\0imx
[7] x

Sentinel (null byte) must be lexicographically smaller than every other character in the alphabet (offset 3 and 5 handled correctly at the document boundary).

Benchmarks.

Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.

All articles are 16 KB (2x my avg post size) documents generated with seed.sh.

Total Time: 0.1475 seconds
Files Processed: 500
File Sizes:
   corpus.bin       33.59 KB
   sa.bin          134.34 KB
   file_map.dat     37.01 KB
   Total Index:    204.94 KB

Total Time: 0.3101 seconds
Files Processed: 1000
File Sizes:
   corpus.bin       67.28 KB
   sa.bin          269.11 KB
   file_map.dat     74.12 KB
   Total Index:    410.51 KB

Total Time: 10.9661 seconds
Files Processed: 10000
File Sizes:
   corpus.bin      682.51 KB
   sa.bin         2730.05 KB
   file_map.dat    750.88 KB
   Total Index:   4163.44 KB

There are 11 posts today. Unlikely that it’d grow to 10K. Even if it did, the index is very manageable.

Search

Need to perform a range query to find offsets of all matches. Suffix array is sorted–use binary search.

# left bound
while ($low <= $high) {
    # Int overflow: Perl handles by promoting to float.
    my $mid = int(($low + $high) / 2);
    seek($fh_sa, $mid * 4, 0);
    read($fh_sa, my $bin_off, 4);
    my $off = unpack("L", $bin_off);
    seek($fh_cp, $off, 0);
    read($fh_cp, my $text, $query_len);
        
    my $cmp = $text cmp $query;
    if ($cmp >= 0) {
        $first_hit = $mid if $cmp == 0;
        $high = $mid - 1;
    } else {
        $low = $mid + 1;
    }
}

# Continue with right bound...

Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to seek/read exact chunks into memory. Since the FastCGI script is invoked per query, small, frequent allocations are better than less frequent large allocations (see bm_*.txt).

Once the range is found through binary search, the results are collected using a simple for loop. Max result count is capped at 20.

Benchmarks. Search for ‘arduino’ (expect 0 matches). Left: SA-based index search. Right: Naive search using regex.

500 articles:

Files Read:	3	500
User CPU (s):	0.0100	0.0600
System CPU (s):	0.0200	0.0200
Total Time (s):	0.0012	0.0407
Peak RAM (KB):	8828	9136

1,000 articles:

Files Read:	3	1000
User CPU (s):	0.0200	0.0500
System CPU (s):	0.0100	0.0500
Total Time (s):	0.0019	0.0795
Peak RAM (KB):	8980	9460

10,000 articles:

Files Read:	3	10000
User CPU (s):	0.0300	0.4000
System CPU (s):	0.0300	0.5400
Total Time (s):	0.0161	0.9120
Peak RAM (KB):	12504	12804

Note: Total time < User + System CPU time is likely a measurement artifact–the script finishes faster than the OS tick interval.

Security

Memory usage remains a concern. Parallel queries could exhaust memory. Limit parallel searches using a semaphore implemented with lock files:

my $max_parallel   = 50;

mkdir $lock_dir, 0777 unless -d $lock_dir;
my $active_count = 0;
my $now = time();

opendir(my $dh, $lock_dir);
while (my $file = readdir($dh)) {
    next unless $file =~ /\.lock$/;
    my $path = "$lock_dir/$file";
    my $mtime = (stat($path))[9] || 0;
    ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++;
}
closedir($dh);

XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).

ReDOS: Sanitize search text by tossing non-printable characters, limit length, work within utf-8. Use \Q...\E in searches.

if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) {
    $search_text = $1;
    $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
    $search_text = decode_utf8($search_text // "");
    $search_text =~ s/\P{Print}//g; 
    $search_text = substr($search_text, 0, 64);
    $search_text =~ s/^\s+|\s+$//g;
}

Command injection: no exec()/system() calls. Non-privileged user. Read-only directory.

Symlink attacks: paths are hard-coded. Script can’t be modified (554).

No API keys, passwords involved.

Last line of defence: chroot.

Can’t think of anything else.

Verdict: Fast SA index lookup. Mitigated primary attack vectors. No dependencies. Works on every conceivable device.

Commit: 9889aca | Benchmarks: 8a4da68

by W. D. Sadeep Madurange