--- title: Built a search engine for website based on suffix arrays date: 2026-01-03 layout: post --- Built search. Requirements: match substrings, ignore case, handle 50 concurrent queries with 1 vCPU, 1 GB RAM, 25 GB SSD. No JavaScript—work with uBlock Origin, NoScript, text browsers. Architecture: browser ↔ httpd ↔ slowcgi ↔ search engine. Server-side regex is viable for a personal site. But an index has clear advantages. Not that much harder to implement. Index: suffix array (SA) implemented in Perl. Index— corpus.bin, sa.bin, file_map.dat—built with site: ``` $ JEKYLL_ENV=production bundle exec jekyll build $ cd cgi-bin/ $ perl indexer.pl ``` Indexer extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte sentinel marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned integers, sorted lexicographically: ``` my @sa = 0 .. (length($corpus) - 1); { use bytes; # Force compare 8-bit Unicode value comparisons @sa = sort { # First 64 bytes check (fast path) (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || # Full string fallback (required for correctness) (substr($corpus, $a) cmp substr($corpus, $b)) } @sa; } ``` 32-bit offsets limit index size to 4GB—more than sufficient and, if necessary, easily expanded. Sort (O(L⋅N log N)) is the real bottleneck. Fast path caps L at 64 bytes (length of a typical cache line). Search: Textbook range query with two binary searches. Fixed-width offsets enable random access to index: ``` seek($fh_sa, $mid * 4, 0); read($fh_sa, my $bin_off, 4); my $off = unpack("L", $bin_off); seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); ``` Small seek/reads are fast on modern SSDs; keeps memory footprint small. Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear regex search:
=============================================================
SEARCH BENCHMARK: Suffix array vs. Linear regex
ARTICLE SIZE: 16 KB
=============================================================

500 files:
-------------------------------------------------------------
METRIC          | SA                   | REGEX
----------------+----------------------+---------------------
Search time     |             0.0012s  |             0.0407s
Peak RAM        |              8828 KB |              9136 KB
Indexing time   |             0.1475s  |                 N/A
Index size      |            204.94 KB |                 N/A
-------------------------------------------------------------

1,000 files:
-------------------------------------------------------------
METRIC          | SA                   | REGEX
----------------+----------------------+---------------------
Search time     |             0.0019s  |             0.0795s
Peak RAM        |              8980 KB |              9460 KB
Indexing time   |             0.3101s  |                 N/A
Index size      |            410.51 KB |                 N/A
-------------------------------------------------------------

10,000 files:
-------------------------------------------------------------
METRIC          | SA                   | REGEX
----------------+----------------------+---------------------
Search time     |             0.0161s  |             0.9120s
Peak RAM        |             12504 KB |             12804 KB
Indexing time   |            10.9661s  |                 N/A
Index size      |           4163.44 KB |                 N/A
-------------------------------------------------------------
Seek/read consistently outperformed mmap at <1k files. At 10k, mmap was occasionally faster (~200 µs), but used more memory—possibly OpenBSD's VM security trade-offs. Results may vary by OS. Security: httpd, slowcgi, Perl in OpenBSD base system. File system permissions govern access. Runs in chroot. Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit concurrent searches. Query length (64B) and result set (20) are capped. All output is HTML-escaped to prevent XSS. Warranty: 10,000 / 12 → 833 years.
Next release: inverted index—Anno Domini 2859. Commit: 6da102d | Benchmarks: 8a4da68