--- title: 'Full-text search: SA lookup' date: 2026-01-03 layout: post --- Article count is growing. Need search. Requirements: matches substrings, case-insensitive, fast, secure. No JavaScript. Architecture: browser → httpd → slowcgi → Perl CGI script. Perl, httpd, slowcgi are in the base system. Search is anonymous; file system permissions govern access. 2025-12-30: Regex search. Threw together a 140-line Perl script; searches 500 files in 40ms. Fast enough. O(N) drag felt at higher file counts. Crawling files introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to stop here. 2026-01-03: Suffix array (SA) search. Regex search depends almost entirely on hardware for speed. Slurping files on every request bothers me. Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index built with site: ``` $ JEKYLL_ENV=production bundle exec jekyll build $ cd cgi-bin/ $ perl indexer.pl ``` Indexer: Extracts HTML, lowercases, and encodes into UTF-8 binary sequences. Null byte sentinel marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned integers, sorted by lexicographical order: ``` my @sa = 0 .. (length($corpus) - 1); { use bytes; # Force compare 8-bit Unicode value comparisons @sa = sort { # First 64 bytes check (fast path) (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || # Full string fallback (required for correctness) (substr($corpus, $a) cmp substr($corpus, $b)) } @sa; } CORPUS: a s c i \0 i m x OFFSET: 0 1 2 3 4 5 6 7 SORTED SUFFIXES: [4] \0imx [0] asci\0imx [2] ci\0imx [3] i\0imx <-- "i" from "asci" [5] imx <-- "i" from "imx" [6] mx [1] sci\0imx [7] x ``` Sort is the bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a cache line) → O(N log N). Search: Textbook range query with twin binary searches. Random access to suffixes and the text made possible by the fixed-width offsets: ``` seek($fh_sa, $mid * 4, 0); read($fh_sa, my $bin_off, 4); my $off = unpack("L", $bin_off); seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); ``` Small seek/reads are fast on modern SSDs. Keeps memory usage low. NOTE: mmap. Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB). 500 files: - Index size: 204.94 KB - Indexing time: 0.1475 s - Peak RAM (SA): 8828 KB - Peak RAM (Regex): 9136 KB - Search (SA): 0.0012 s - Search (Regex): 0.0407 s 1,000 files: - Index size: 410.51 KB - Indexing time: 0.3101 s - Peak RAM (SA): 8980 KB - Peak RAM (Regex): 9460 KB - Search (SA): 0.0019 s - Search (Regex): 0.0795 s 10,000 files: - Index size: 4163.44 KB - Indexing time: 10.9661 s - Peak RAM (SA): 12504 KB - Peak RAM (Regex): 12804 KB - Search (SA): 0.0161 s - Search (Regex): 0.9120 s Security: Derived from architectural simplicity. Zero dependencies to manage. No secrets to hide. No targets for lateral movement. Runs in chroot. Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting concurrent searches via lock-file semaphores. Query length (64B) and result set (20) capped. All output is HTML-escaped to prevent XSS. O(N) drag gone. Architecture durable; secure by default. Commit: 6da102d | Benchmarks: 8a4da68