summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
diff options
context:
space:
mode:
Diffstat (limited to '_log/site-search.md')
-rw-r--r--_log/site-search.md120
1 files changed, 93 insertions, 27 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
index 6af9f63..d178d07 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -1,25 +1,49 @@
---
-title: Perl + FastCGI + SA search engine
-date: 2026-01-02
+title: Search engine (Perl + FastCGI + SA)
+date: 2026-01-03
layout: post
---
-Number of articles growing. Need search.
+Article count on the website is growing. Need a way to search.
-Requirements: substring match, case-insensitive, fast, secure. No JavaScript.
+Requirements: matches substrings, case-insensitive, fast, secure. No
+JavaScript.
-Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.
+Architecture: browser → httpd → slowcgi → Perl CGI script.
-Data structure: suffix array. Three files: corpus.bin (articles), sa.bin
-(sorted byte offsets), file_map.dat (metadata).
+httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access
+governed by file system permissions--no secrets to manage. Secure by default.
-Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null
-byte sentinel for document boundaries. Sort lexicographically::
+2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files
+using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast
+enough. Start to feel the O(N) pull at higher file counts.
+
+Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can
+be mitigated. Tempted to stop here.
+
+2026-01-03: Suffix array (SA) based index lookup.
+
+Inefficiency of scanning every file on each request troubles me. Regex search
+depends almost entirely on hardware for speed.
+
+Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted
+byte offsets), file_map.dat (metadata). Index created with the site:
+
+```
+$ JEKYLL_ENV=production bundle exec jekyll build
+$ cd cgi-bin/
+$ perl indexer.pl
+```
+
+Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and
+saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix
+array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the
+lexicographical order:
```
-# Use a block that forces byte-level comparison
+my @sa = 0 .. (length($corpus) - 1);
{
- use bytes;
+ use bytes; # Force compare 8-bit Unicode value comparisons
@sa = sort {
# First 64 bytes check (fast path)
(substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
@@ -27,31 +51,73 @@ byte sentinel for document boundaries. Sort lexicographically::
(substr($corpus, $a) cmp substr($corpus, $b))
} @sa;
}
+
+CORPUS: a s c i \0 i m x
+OFFSET: 0 1 2 3 4 5 6 7
+
+SORTED SUFFIXES:
+
+[4] \0imx
+[0] asci\0imx
+[2] ci\0imx
+[3] i\0imx <-- "i" from "asci"
+[5] imx <-- "i" from "imx"
+[6] mx
+[1] sci\0imx
+[7] x
+```
+
+Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a
+cache line), reducing complexity to O(N log N).
+
+Search: Textbook range query with two binary searches. Random access to offsets
+and the text is possible via the fixed-width offsets:
+
+```
+seek($fh_sa, $mid * 4, 0);
+read($fh_sa, my $bin_off, 4);
+my $off = unpack("L", $bin_off);
+seek($fh_cp, $off, 0);
+read($fh_cp, my $text, $query_len);
```
-Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte
-length targets cache lines.
+Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy
+to implement. Note to self: mmap.
-Search: binary search for range query. Cap at 20 results--define limits or be
-surprised by them.
+Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).
-File IO and memory: many seek/read small chunks beat one large allocation (see
-benchmarks for find_one_file.cgi).
+500 files:
+ - Index size: 204.94 KB
+ - Indexing time: 0.1475 s
+ - Peak RAM (SA): 8828 KB
+ - Peak RAM (Regex): 9136 KB
+ - Search (SA): 0.0012 s
+ - Search (Regex): 0.0407 s
-Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):
+1,000 files:
+ - Index size: 410.51 KB
+ - Indexing time: 0.3101 s
+ - Peak RAM (SA): 8980 KB
+ - Peak RAM (Regex): 9460 KB
+ - Search (SA): 0.0019 s
+ - Search (Regex): 0.0795 s
-1,000 files: 0.31s indexing, 410 KB total index.
-10,000 files: 10.97s indexing, 4.16 MB total index.
+10,000 files:
+ - Index size: 4163.44 KB
+ - Indexing time: 10.9661 s
+ - Peak RAM (SA): 12504 KB
+ - Peak RAM (Regex): 12804 KB
+ - Search (SA): 0.0161 s
+ - Search (Regex): 0.9120 s
-Search 'arduino' (0 matches):
-1,000 files: 0.002s (SA) vs 0.080s (naive regex).
-10,000 files: 0.016s (SA) vs 0.912s (naive regex).
+Security: Much of it comes from its architectural simplicity. No dependencies
+to manage, no secrets to hide, no assets for lateral movement. Runs in chroot.
-Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS).
-Sanitize input--strip non-printables, limit length, and quote metacharacters
-(ReDOS). No exec/system (command injection). Chroot.
+Resource exhaustion and XSS attacks are inherent. The former is mitigated by
+limiting concurrent searches via lock-file semaphores and capping query length
+(64B) and result sets (20). All output is HTML-escaped to prevent XSS.
-Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.
+Verdict: Fast SA lookup; Works on every conceivable web browser.
Commit:
[6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e)