--- title: Perl + FastCGI + SA search engine date: 2026-01-02 layout: post --- Number of articles growing. Need search. Requirements: substring match, case-insensitive, fast, secure. No JavaScript. Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script. Data structure: suffix array. Three files: corpus.bin (articles), sa.bin (sorted byte offsets), file_map.dat (metadata). Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null byte sentinel for document boundaries. Sort lexicographically:: ``` # Use a block that forces byte-level comparison { use bytes; @sa = sort { # First 64 bytes check (fast path) (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || # Full string fallback (required for correctness) (substr($corpus, $a) cmp substr($corpus, $b)) } @sa; } ``` Slow path: O(Lā‹…N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte length targets cache lines. Search: binary search for range query. Cap at 20 results--define limits or be surprised by them. File IO and memory: many seek/read small chunks beat one large allocation (see benchmarks for find_one_file.cgi). Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles): 1,000 files: 0.31s indexing, 410 KB total index. 10,000 files: 10.97s indexing, 4.16 MB total index. Search 'arduino' (0 matches): 1,000 files: 0.002s (SA) vs 0.080s (naive regex). 10,000 files: 0.016s (SA) vs 0.912s (naive regex). Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS). Sanitize input--strip non-printables, limit length, and quote metacharacters (ReDOS). No exec/system (command injection). Chroot. Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies. Commit: [6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e) | Benchmarks: [8a4da68](https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9)