summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
diff options
context:
space:
mode:
authorSadeep Madurange <sadeep@asciimx.com>2026-01-15 22:44:09 +0800
committerSadeep Madurange <sadeep@asciimx.com>2026-01-15 22:44:09 +0800
commit891a38b49615833d8dfd87a56b72f5e2dbde1d48 (patch)
treeba94c06b1018445d6625d2467091c166ac8f4d22 /_log/site-search.md
parent429a7e85c29e2776f3a4293517513a5d523f6388 (diff)
downloadwww-891a38b49615833d8dfd87a56b72f5e2dbde1d48.tar.gz
Prune site search.
Diffstat (limited to '_log/site-search.md')
-rw-r--r--_log/site-search.md62
1 files changed, 16 insertions, 46 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
index 58146ad..0ff97fc 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -4,30 +4,14 @@ date: 2026-01-03
layout: post
---
-Article count is growing. Need search.
+Needed search for site.
Requirements: matches substrings, case-insensitive, fast, secure. No
JavaScript.
Architecture: browser → httpd → slowcgi → Perl CGI script.
-Perl, httpd, slowcgi are in the base system. Search is anonymous; file system
-permissions govern access.
-
-2025-12-30: Regex search.
-
-Threw together a 140-line Perl script; searches 500 files in 40ms. Fast enough.
-O(N) drag felt at higher file counts.
-
-Crawling files introduces ReDoS and symlink attack vectors. Both can be
-mitigated. Tempted to stop here.
-
-2026-01-03: Suffix array (SA) search.
-
-Regex search depends almost entirely on hardware for speed. Slurping files on
-every request bothers me.
-
-Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
+SA index implemented with three files: corpus.bin, sa.bin, file_map.dat. Index
built with site:
```
@@ -36,9 +20,9 @@ $ cd cgi-bin/
$ perl indexer.pl
```
-Indexer: Extracts HTML, lowercases, and encodes into UTF-8 binary sequences.
-Null byte sentinel marks document boundaries. sa.bin stores suffix offsets as
-32-bit unsigned integers, sorted by lexicographical order:
+Indexer: Extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte
+sentinel marks document boundaries. sa.bin stores suffix offsets as 32-bit
+unsigned integers, sorted lexicographically:
```
my @sa = 0 .. (length($corpus) - 1);
@@ -51,27 +35,13 @@ my @sa = 0 .. (length($corpus) - 1);
(substr($corpus, $a) cmp substr($corpus, $b))
} @sa;
}
-
-CORPUS: a s c i \0 i m x
-OFFSET: 0 1 2 3 4 5 6 7
-
-SORTED SUFFIXES:
-
-[4] \0imx
-[0] asci\0imx
-[2] ci\0imx
-[3] i\0imx <-- "i" from "asci"
-[5] imx <-- "i" from "imx"
-[6] mx
-[1] sci\0imx
-[7] x
```
Sort is the bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at 64
bytes (length of a cache line) → O(N log N).
-Search: Textbook range query with twin binary searches. Random access to
-suffixes and the text made possible by the fixed-width offsets:
+Search: Textbook range query with twin binary searches. Uses fixed-width
+offsets for random access:
```
seek($fh_sa, $mid * 4, 0);
@@ -85,7 +55,8 @@ Small seek/reads are fast on modern SSDs. Keeps memory usage low.
NOTE: mmap.
-Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).
+Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear
+regex search:
500 files:
- Index size: 204.94 KB
@@ -111,16 +82,15 @@ Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).
- Search (SA): 0.0161 s
- Search (Regex): 0.9120 s
-Security: Derived from architectural simplicity. Zero dependencies to manage.
-No secrets to hide. No targets for lateral movement.
-
-Runs in chroot.
+Security: httpd, slowcgi, Perl in base system--no dependencies. File system
+permissions govern access. Runs in chroot.
-Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting
-concurrent searches via lock-file semaphores. Query length (64B) and result set
-(20) capped. All output is HTML-escaped to prevent XSS.
+Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit
+concurrent searches. Query length (64B) and result set (20) capped. All output
+is HTML-escaped to prevent XSS.
-O(N) drag gone. Architecture durable; secure by default.
+Warranty: 10,000 / 12 → 833 years. Next release: inverted index, year of our
+Lord 2859.
Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"