diff options
Diffstat (limited to '_log/site-search.md')
| -rw-r--r-- | _log/site-search.md | 34 |
1 files changed, 17 insertions, 17 deletions
diff --git a/_log/site-search.md b/_log/site-search.md index 4ccae3c..58146ad 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -11,21 +11,21 @@ JavaScript. Architecture: browser → httpd → slowcgi → Perl CGI script. -Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file -system permissions govern access. +Perl, httpd, slowcgi are in the base system. Search is anonymous; file system +permissions govern access. -2025-12-30: Regex. +2025-12-30: Regex search. -140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at -higher file counts. +Threw together a 140-line Perl script; searches 500 files in 40ms. Fast enough. +O(N) drag felt at higher file counts. -Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to -stop here. +Crawling files introduces ReDoS and symlink attack vectors. Both can be +mitigated. Tempted to stop here. -2026-01-03: Suffix array (SA) based index lookup. +2026-01-03: Suffix array (SA) search. -Slurping files on every request bothers me. Regex search depends almost -entirely on hardware for speed. +Regex search depends almost entirely on hardware for speed. Slurping files on +every request bothers me. Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index built with site: @@ -36,8 +36,8 @@ $ cd cgi-bin/ $ perl indexer.pl ``` -Indexer extracts HTML, lowercases, and encodes into UTF-8 binary sequences. -Null byte sentinel for document boundaries. sa.bin stores suffix offsets as +Indexer: Extracts HTML, lowercases, and encodes into UTF-8 binary sequences. +Null byte sentinel marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned integers, sorted by lexicographical order: ``` @@ -67,8 +67,8 @@ SORTED SUFFIXES: [7] x ``` -Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of cache -line), reducing complexity to O(N log N). +Sort is the bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at 64 +bytes (length of a cache line) → O(N log N). Search: Textbook range query with twin binary searches. Random access to suffixes and the text made possible by the fixed-width offsets: @@ -111,8 +111,8 @@ Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB). - Search (SA): 0.0161 s - Search (Regex): 0.9120 s -Security: Derived from architectural simplicity. Zero dependencies to manage, -no secrets to hide, no targets for lateral movement. +Security: Derived from architectural simplicity. Zero dependencies to manage. +No secrets to hide. No targets for lateral movement. Runs in chroot. @@ -120,7 +120,7 @@ Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting concurrent searches via lock-file semaphores. Query length (64B) and result set (20) capped. All output is HTML-escaped to prevent XSS. -Verdict: Fast. Durable. Secure by default. +O(N) drag gone. Architecture durable; secure by default. Commit: <a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e" |
