From e836c4b9e78cc3892cdebf8126cb650f1b91ed37 Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Thu, 8 Jan 2026 22:28:37 +0800 Subject: Tighten prose. --- _log/site-search.md | 63 ++++++++++++++++++++++++++++------------------------- 1 file changed, 33 insertions(+), 30 deletions(-) (limited to '_log/site-search.md') diff --git a/_log/site-search.md b/_log/site-search.md index d178d07..0cfbf99 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -1,33 +1,34 @@ --- -title: Search engine (Perl + FastCGI + SA) +title: Site search with suffix array date: 2026-01-03 layout: post --- -Article count on the website is growing. Need a way to search. +Article count is growing. Need a way to search. Requirements: matches substrings, case-insensitive, fast, secure. No JavaScript. Architecture: browser → httpd → slowcgi → Perl CGI script. -httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access -governed by file system permissions--no secrets to manage. Secure by default. +Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file +system permissions govern access. -2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files -using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast -enough. Start to feel the O(N) pull at higher file counts. +2025-12-30: Regex search. -Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can -be mitigated. Tempted to stop here. +140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at +higher file counts. -2026-01-03: Suffix array (SA) based index lookup. +Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to +stop here. -Inefficiency of scanning every file on each request troubles me. Regex search -depends almost entirely on hardware for speed. +2026-01-03: Suffix Array (SA) based index lookup. -Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted -byte offsets), file_map.dat (metadata). Index created with the site: +Slurping files on every request bothers me. Regex search depends almost +entirely on hardware for speed. + +Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index +built with site: ``` $ JEKYLL_ENV=production bundle exec jekyll build @@ -35,10 +36,9 @@ $ cd cgi-bin/ $ perl indexer.pl ``` -Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and -saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix -array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the -lexicographical order: +Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null +byte sentinel for document boundaries. sa.bin stores suffix offsets as +32-bit unsigned integers, sorted by lexicographical order: ``` my @sa = 0 .. (length($corpus) - 1); @@ -67,11 +67,11 @@ SORTED SUFFIXES: [7] x ``` -Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a -cache line), reducing complexity to O(N log N). +Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of cache +line), reducing complexity to O(N log N). -Search: Textbook range query with two binary searches. Random access to offsets -and the text is possible via the fixed-width offsets: +Search: Textbook range query with twin binary searches. Random access to +suffixes and the text made possible by the fixed-width offsets: ``` seek($fh_sa, $mid * 4, 0); @@ -81,8 +81,9 @@ seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); ``` -Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy -to implement. Note to self: mmap. +Small seek/reads are fast on modern SSDs. Keeps memory usage low. + +NOTE: mmap. Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB). @@ -110,14 +111,16 @@ Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB). - Search (SA): 0.0161 s - Search (Regex): 0.9120 s -Security: Much of it comes from its architectural simplicity. No dependencies -to manage, no secrets to hide, no assets for lateral movement. Runs in chroot. +Security: Derived from architectural simplicity. Zero dependencies to manage, +no secrets to hide, no targets for lateral movement. + +Runs in chroot. -Resource exhaustion and XSS attacks are inherent. The former is mitigated by -limiting concurrent searches via lock-file semaphores and capping query length -(64B) and result sets (20). All output is HTML-escaped to prevent XSS. +Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting +concurrent searches via lock-file semaphores. Query length (64B) and result set +(20) capped. All output is HTML-escaped to prevent XSS. -Verdict: Fast SA lookup; Works on every conceivable web browser. +Secure by default. Fast. Durable. Commit: [6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e) -- cgit v1.2.3