--- title: 'Full-text search: SA lookup' date: 2026-01-03 layout: post --- Needed search for site. Requirements: matches substrings, case-insensitive, fast, secure. No JavaScript. Architecture: browser → httpd → slowcgi → Perl script. Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index built with site: ``` $ JEKYLL_ENV=production bundle exec jekyll build $ cd cgi-bin/ $ perl indexer.pl ``` Indexer: Extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte sentinel marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned integers, sorted lexicographically: ``` my @sa = 0 .. (length($corpus) - 1); { use bytes; # Force compare 8-bit Unicode value comparisons @sa = sort { # First 64 bytes check (fast path) (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || # Full string fallback (required for correctness) (substr($corpus, $a) cmp substr($corpus, $b)) } @sa; } ``` Sort is the bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a cache line) → O(N log N). 32-bit offsets limit index size to 4GB. Search: Textbook range query with twin binary searches. Fixed-width offsets enable random access: ``` seek($fh_sa, $mid * 4, 0); read($fh_sa, my $bin_off, 4); my $off = unpack("L", $bin_off); seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); ``` Small seek/reads are fast on modern SSDs. Keeps memory usage low. NOTE: mmap. Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear regex search: 500 files: - Index size: 204.94 KB - Indexing time: 0.1475 s - Peak RAM (SA): 8828 KB - Peak RAM (Regex): 9136 KB - Search (SA): 0.0012 s - Search (Regex): 0.0407 s 1,000 files: - Index size: 410.51 KB - Indexing time: 0.3101 s - Peak RAM (SA): 8980 KB - Peak RAM (Regex): 9460 KB - Search (SA): 0.0019 s - Search (Regex): 0.0795 s 10,000 files: - Index size: 4163.44 KB - Indexing time: 10.9661 s - Peak RAM (SA): 12504 KB - Peak RAM (Regex): 12804 KB - Search (SA): 0.0161 s - Search (Regex): 0.9120 S Security: httpd, slowcgi, Perl are in the base system--no dependencies. File system permissions govern access. Runs in chroot. Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit concurrent searches. Query length (64B) and result set (20) are capped. All output is HTML-escaped to prevent XSS. Warranty: 10,000 / 12 → 833 years. Next release: inverted index; Anno Domini 2859. Commit: 6da102d | Benchmarks: 8a4da68