summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
diff options
context:
space:
mode:
Diffstat (limited to '_log/site-search.md')
-rw-r--r--_log/site-search.md34
1 files changed, 17 insertions, 17 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
index 4ccae3c..58146ad 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -11,21 +11,21 @@ JavaScript.
Architecture: browser → httpd → slowcgi → Perl CGI script.
-Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file
-system permissions govern access.
+Perl, httpd, slowcgi are in the base system. Search is anonymous; file system
+permissions govern access.
-2025-12-30: Regex.
+2025-12-30: Regex search.
-140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at
-higher file counts.
+Threw together a 140-line Perl script; searches 500 files in 40ms. Fast enough.
+O(N) drag felt at higher file counts.
-Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to
-stop here.
+Crawling files introduces ReDoS and symlink attack vectors. Both can be
+mitigated. Tempted to stop here.
-2026-01-03: Suffix array (SA) based index lookup.
+2026-01-03: Suffix array (SA) search.
-Slurping files on every request bothers me. Regex search depends almost
-entirely on hardware for speed.
+Regex search depends almost entirely on hardware for speed. Slurping files on
+every request bothers me.
Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
built with site:
@@ -36,8 +36,8 @@ $ cd cgi-bin/
$ perl indexer.pl
```
-Indexer extracts HTML, lowercases, and encodes into UTF-8 binary sequences.
-Null byte sentinel for document boundaries. sa.bin stores suffix offsets as
+Indexer: Extracts HTML, lowercases, and encodes into UTF-8 binary sequences.
+Null byte sentinel marks document boundaries. sa.bin stores suffix offsets as
32-bit unsigned integers, sorted by lexicographical order:
```
@@ -67,8 +67,8 @@ SORTED SUFFIXES:
[7] x
```
-Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of cache
-line), reducing complexity to O(N log N).
+Sort is the bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at 64
+bytes (length of a cache line) → O(N log N).
Search: Textbook range query with twin binary searches. Random access to
suffixes and the text made possible by the fixed-width offsets:
@@ -111,8 +111,8 @@ Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).
- Search (SA): 0.0161 s
- Search (Regex): 0.9120 s
-Security: Derived from architectural simplicity. Zero dependencies to manage,
-no secrets to hide, no targets for lateral movement.
+Security: Derived from architectural simplicity. Zero dependencies to manage.
+No secrets to hide. No targets for lateral movement.
Runs in chroot.
@@ -120,7 +120,7 @@ Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting
concurrent searches via lock-file semaphores. Query length (64B) and result set
(20) capped. All output is HTML-escaped to prevent XSS.
-Verdict: Fast. Durable. Secure by default.
+O(N) drag gone. Architecture durable; secure by default.
Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"