PERL + FASTCGI + SA SEARCH ENGINE

02 JANUARY 2026

Number of articles growing. Need search.

Requirements: substring match, case-insensitive, fast, secure. No JavaScript.

Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.

Data structure: suffix array. Three files: corpus.bin (articles), sa.bin (sorted byte offsets), file_map.dat (metadata).

Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null byte sentinel for document boundaries. Sort lexicographically::

# Use a block that forces byte-level comparison
{
    use bytes; 
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}

Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte length targets cache lines.

Search: binary search for range query. Cap at 20 results–define limits or be surprised by them.

File IO and memory: many seek/read small chunks beat one large allocation (see benchmarks for find_one_file.cgi).

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):

1,000 files: 0.31s indexing, 410 KB total index.
10,000 files: 10.97s indexing, 4.16 MB total index.

Search ‘arduino’ (0 matches):
1,000 files: 0.002s (SA) vs 0.080s (naive regex).
10,000 files: 0.016s (SA) vs 0.912s (naive regex).

Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS). Sanitize input–strip non-printables, limit length, and quote metacharacters (ReDOS). No exec/system (command injection). Chroot.

Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.

Commit: 6da102d | Benchmarks: 8a4da68