diff options
Diffstat (limited to '_site/log/site-search/index.html')
| -rw-r--r-- | _site/log/site-search/index.html | 184 |
1 files changed, 0 insertions, 184 deletions
diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html deleted file mode 100644 index 77fe19e..0000000 --- a/_site/log/site-search/index.html +++ /dev/null @@ -1,184 +0,0 @@ -<!DOCTYPE html> -<html> - <head> - <meta charset="utf-8"> - <meta name="viewport" content="width=device-width, initial-scale=1"> - <title>Full-text search: SA lookup</title> - <link rel="stylesheet" href="/assets/css/main.css"> - <link rel="stylesheet" href="/assets/css/skeleton.css"> -</head> - - - <body> - - <div id="nav-container" class="container"> - <ul id="navlist" class="left"> - - <li > - <a href="/" class="link-decor-none">hme</a> - </li> - <li > - <a href="/projects/" class="link-decor-none">poc</a> - </li> - <li > - <a href="/about/" class="link-decor-none">abt</a> - </li> - <li> - <a href="/cgi-bin/find.cgi" class="link-decor-none">lup</a> - </li> - <li> - <a href="/feed.xml" class="link-decor-none">rss</a> - </li> - </ul> -</div> - - - - <main> - <div class="container"> - <div class="container-2"> - <h2 class="center" id="title">FULL-TEXT SEARCH: SA LOOKUP</h2> - <h5 class="center">03 JANUARY 2026</h5> - <br> - <div class="twocol justify"><p>Article count is growing. Need a way to search.</p> - -<p>Requirements: matches substrings, case-insensitive, fast, secure. No -JavaScript.</p> - -<p>Architecture: browser ā httpd ā slowcgi ā Perl CGI script.</p> - -<p>Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file -system permissions govern access.</p> - -<p>2025-12-30: Regex search.</p> - -<p>140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at -higher file counts.</p> - -<p>Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to -stop here.</p> - -<p>2026-01-03: Suffix Array (SA) based index lookup.</p> - -<p>Slurping files on every request bothers me. Regex search depends almost -entirely on hardware for speed.</p> - -<p>Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index -built with site:</p> - -<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ JEKYLL_ENV=production bundle exec jekyll build -$ cd cgi-bin/ -$ perl indexer.pl -</code></pre></div></div> - -<p>Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null -byte sentinel for document boundaries. sa.bin stores suffix offsets as -32-bit unsigned integers, sorted by lexicographical order:</p> - -<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my @sa = 0 .. (length($corpus) - 1); -{ - use bytes; # Force compare 8-bit Unicode value comparisons - @sa = sort { - # First 64 bytes check (fast path) - (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || - # Full string fallback (required for correctness) - (substr($corpus, $a) cmp substr($corpus, $b)) - } @sa; -} - -CORPUS: a s c i \0 i m x -OFFSET: 0 1 2 3 4 5 6 7 - -SORTED SUFFIXES: - -[4] \0imx -[0] asci\0imx -[2] ci\0imx -[3] i\0imx <-- "i" from "asci" -[5] imx <-- "i" from "imx" -[6] mx -[1] sci\0imx -[7] x -</code></pre></div></div> - -<p>Time complexity: O(Lā
N log N). Fast path caps L at 64 bytes (length of cache -line), reducing complexity to O(N log N).</p> - -<p>Search: Textbook range query with twin binary searches. Random access to -suffixes and the text made possible by the fixed-width offsets:</p> - -<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seek($fh_sa, $mid * 4, 0); -read($fh_sa, my $bin_off, 4); -my $off = unpack("L", $bin_off); -seek($fh_cp, $off, 0); -read($fh_cp, my $text, $query_len); -</code></pre></div></div> - -<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low.</p> - -<p>NOTE: mmap.</p> - -<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).</p> - -<p>500 files:</p> -<ul> - <li>Index size: 204.94 KB</li> - <li>Indexing time: 0.1475 s</li> - <li>Peak RAM (SA): 8828 KB</li> - <li>Peak RAM (Regex): 9136 KB</li> - <li>Search (SA): 0.0012 s</li> - <li>Search (Regex): 0.0407 s</li> -</ul> - -<p>1,000 files:</p> -<ul> - <li>Index size: 410.51 KB</li> - <li>Indexing time: 0.3101 s</li> - <li>Peak RAM (SA): 8980 KB</li> - <li>Peak RAM (Regex): 9460 KB</li> - <li>Search (SA): 0.0019 s</li> - <li>Search (Regex): 0.0795 s</li> -</ul> - -<p>10,000 files:</p> -<ul> - <li>Index size: 4163.44 KB</li> - <li>Indexing time: 10.9661 s</li> - <li>Peak RAM (SA): 12504 KB</li> - <li>Peak RAM (Regex): 12804 KB</li> - <li>Search (SA): 0.0161 s</li> - <li>Search (Regex): 0.9120 s</li> -</ul> - -<p>Security: Derived from architectural simplicity. Zero dependencies to manage, -no secrets to hide, no targets for lateral movement.</p> - -<p>Runs in chroot.</p> - -<p>Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting -concurrent searches via lock-file semaphores. Query length (64B) and result set -(20) capped. All output is HTML-escaped to prevent XSS.</p> - -<p>Secure by default. Fast. Durable.</p> - -<p>Commit: -<a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a> -| Benchmarks: -<a href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9">8a4da68</a></p> -</div> - <p class="post-author right">by W. D. Sadeep Madurange</p> - </div> - </div> - </main> - - <div class="footer"> - <div class="container"> - <div class="twelve columns right container-2"> - <p id="footer-text">© ASCIIMX - 2026</p> - </div> - </div> -</div> - - - </body> -</html> |
