summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
diff options
context:
space:
mode:
authorSadeep Madurange <sadeep@asciimx.com>2026-01-08 22:28:37 +0800
committerSadeep Madurange <sadeep@asciimx.com>2026-01-08 22:28:37 +0800
commite836c4b9e78cc3892cdebf8126cb650f1b91ed37 (patch)
tree4e52155aef0105cc9b888a42b3e760455a2bcb36 /_log/site-search.md
parent57ff09d2eefefa2462a2af0175e3e8164c7bc828 (diff)
downloadwww-e836c4b9e78cc3892cdebf8126cb650f1b91ed37.tar.gz
Tighten prose.
Diffstat (limited to '_log/site-search.md')
-rw-r--r--_log/site-search.md63
1 files changed, 33 insertions, 30 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
index d178d07..0cfbf99 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -1,33 +1,34 @@
---
-title: Search engine (Perl + FastCGI + SA)
+title: Site search with suffix array
date: 2026-01-03
layout: post
---
-Article count on the website is growing. Need a way to search.
+Article count is growing. Need a way to search.
Requirements: matches substrings, case-insensitive, fast, secure. No
JavaScript.
Architecture: browser → httpd → slowcgi → Perl CGI script.
-httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access
-governed by file system permissions--no secrets to manage. Secure by default.
+Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file
+system permissions govern access.
-2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files
-using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast
-enough. Start to feel the O(N) pull at higher file counts.
+2025-12-30: Regex search.
-Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can
-be mitigated. Tempted to stop here.
+140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at
+higher file counts.
-2026-01-03: Suffix array (SA) based index lookup.
+Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to
+stop here.
-Inefficiency of scanning every file on each request troubles me. Regex search
-depends almost entirely on hardware for speed.
+2026-01-03: Suffix Array (SA) based index lookup.
-Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted
-byte offsets), file_map.dat (metadata). Index created with the site:
+Slurping files on every request bothers me. Regex search depends almost
+entirely on hardware for speed.
+
+Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
+built with site:
```
$ JEKYLL_ENV=production bundle exec jekyll build
@@ -35,10 +36,9 @@ $ cd cgi-bin/
$ perl indexer.pl
```
-Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and
-saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix
-array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the
-lexicographical order:
+Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null
+byte sentinel for document boundaries. sa.bin stores suffix offsets as
+32-bit unsigned integers, sorted by lexicographical order:
```
my @sa = 0 .. (length($corpus) - 1);
@@ -67,11 +67,11 @@ SORTED SUFFIXES:
[7] x
```
-Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a
-cache line), reducing complexity to O(N log N).
+Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of cache
+line), reducing complexity to O(N log N).
-Search: Textbook range query with two binary searches. Random access to offsets
-and the text is possible via the fixed-width offsets:
+Search: Textbook range query with twin binary searches. Random access to
+suffixes and the text made possible by the fixed-width offsets:
```
seek($fh_sa, $mid * 4, 0);
@@ -81,8 +81,9 @@ seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
```
-Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy
-to implement. Note to self: mmap.
+Small seek/reads are fast on modern SSDs. Keeps memory usage low.
+
+NOTE: mmap.
Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).
@@ -110,14 +111,16 @@ Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).
- Search (SA): 0.0161 s
- Search (Regex): 0.9120 s
-Security: Much of it comes from its architectural simplicity. No dependencies
-to manage, no secrets to hide, no assets for lateral movement. Runs in chroot.
+Security: Derived from architectural simplicity. Zero dependencies to manage,
+no secrets to hide, no targets for lateral movement.
+
+Runs in chroot.
-Resource exhaustion and XSS attacks are inherent. The former is mitigated by
-limiting concurrent searches via lock-file semaphores and capping query length
-(64B) and result sets (20). All output is HTML-escaped to prevent XSS.
+Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting
+concurrent searches via lock-file semaphores. Query length (64B) and result set
+(20) capped. All output is HTML-escaped to prevent XSS.
-Verdict: Fast SA lookup; Works on every conceivable web browser.
+Secure by default. Fast. Durable.
Commit:
[6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e)