diff options
| author | Sadeep Madurange <sadeep@asciimx.com> | 2026-01-15 22:44:09 +0800 |
|---|---|---|
| committer | Sadeep Madurange <sadeep@asciimx.com> | 2026-01-15 22:44:09 +0800 |
| commit | 891a38b49615833d8dfd87a56b72f5e2dbde1d48 (patch) | |
| tree | ba94c06b1018445d6625d2467091c166ac8f4d22 /_log | |
| parent | 429a7e85c29e2776f3a4293517513a5d523f6388 (diff) | |
| download | www-891a38b49615833d8dfd87a56b72f5e2dbde1d48.tar.gz | |
Prune site search.
Diffstat (limited to '_log')
| -rw-r--r-- | _log/site-search.md | 62 |
1 files changed, 16 insertions, 46 deletions
diff --git a/_log/site-search.md b/_log/site-search.md index 58146ad..0ff97fc 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -4,30 +4,14 @@ date: 2026-01-03 layout: post --- -Article count is growing. Need search. +Needed search for site. Requirements: matches substrings, case-insensitive, fast, secure. No JavaScript. Architecture: browser → httpd → slowcgi → Perl CGI script. -Perl, httpd, slowcgi are in the base system. Search is anonymous; file system -permissions govern access. - -2025-12-30: Regex search. - -Threw together a 140-line Perl script; searches 500 files in 40ms. Fast enough. -O(N) drag felt at higher file counts. - -Crawling files introduces ReDoS and symlink attack vectors. Both can be -mitigated. Tempted to stop here. - -2026-01-03: Suffix array (SA) search. - -Regex search depends almost entirely on hardware for speed. Slurping files on -every request bothers me. - -Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index +SA index implemented with three files: corpus.bin, sa.bin, file_map.dat. Index built with site: ``` @@ -36,9 +20,9 @@ $ cd cgi-bin/ $ perl indexer.pl ``` -Indexer: Extracts HTML, lowercases, and encodes into UTF-8 binary sequences. -Null byte sentinel marks document boundaries. sa.bin stores suffix offsets as -32-bit unsigned integers, sorted by lexicographical order: +Indexer: Extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte +sentinel marks document boundaries. sa.bin stores suffix offsets as 32-bit +unsigned integers, sorted lexicographically: ``` my @sa = 0 .. (length($corpus) - 1); @@ -51,27 +35,13 @@ my @sa = 0 .. (length($corpus) - 1); (substr($corpus, $a) cmp substr($corpus, $b)) } @sa; } - -CORPUS: a s c i \0 i m x -OFFSET: 0 1 2 3 4 5 6 7 - -SORTED SUFFIXES: - -[4] \0imx -[0] asci\0imx -[2] ci\0imx -[3] i\0imx <-- "i" from "asci" -[5] imx <-- "i" from "imx" -[6] mx -[1] sci\0imx -[7] x ``` Sort is the bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a cache line) → O(N log N). -Search: Textbook range query with twin binary searches. Random access to -suffixes and the text made possible by the fixed-width offsets: +Search: Textbook range query with twin binary searches. Uses fixed-width +offsets for random access: ``` seek($fh_sa, $mid * 4, 0); @@ -85,7 +55,8 @@ Small seek/reads are fast on modern SSDs. Keeps memory usage low. NOTE: mmap. -Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB). +Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear +regex search: 500 files: - Index size: 204.94 KB @@ -111,16 +82,15 @@ Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB). - Search (SA): 0.0161 s - Search (Regex): 0.9120 s -Security: Derived from architectural simplicity. Zero dependencies to manage. -No secrets to hide. No targets for lateral movement. - -Runs in chroot. +Security: httpd, slowcgi, Perl in base system--no dependencies. File system +permissions govern access. Runs in chroot. -Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting -concurrent searches via lock-file semaphores. Query length (64B) and result set -(20) capped. All output is HTML-escaped to prevent XSS. +Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit +concurrent searches. Query length (64B) and result set (20) capped. All output +is HTML-escaped to prevent XSS. -O(N) drag gone. Architecture durable; secure by default. +Warranty: 10,000 / 12 → 833 years. Next release: inverted index, year of our +Lord 2859. Commit: <a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e" |
