summaryrefslogtreecommitdiffstats
path: root/_site/log/site-search/index.html
diff options
context:
space:
mode:
authorSadeep Madurange <sadeep@asciimx.com>2026-01-09 16:45:56 +0800
committerSadeep Madurange <sadeep@asciimx.com>2026-01-09 16:45:56 +0800
commita851a2d646f439f7126c232ba1524c55a8990872 (patch)
tree0d8870fae739b72e37fc3ce1a672388ce7592a9c /_site/log/site-search/index.html
parent3b0f6bd6879d8a32d89dcccc739a73e0e9e823a7 (diff)
downloadwww-a851a2d646f439f7126c232ba1524c55a8990872.tar.gz
Remove _site from git.
Diffstat (limited to '_site/log/site-search/index.html')
-rw-r--r--_site/log/site-search/index.html184
1 files changed, 0 insertions, 184 deletions
diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html
deleted file mode 100644
index 77fe19e..0000000
--- a/_site/log/site-search/index.html
+++ /dev/null
@@ -1,184 +0,0 @@
-<!DOCTYPE html>
-<html>
- <head>
- <meta charset="utf-8">
- <meta name="viewport" content="width=device-width, initial-scale=1">
- <title>Full-text search: SA lookup</title>
- <link rel="stylesheet" href="/assets/css/main.css">
- <link rel="stylesheet" href="/assets/css/skeleton.css">
-</head>
-
-
- <body>
-
- <div id="nav-container" class="container">
- <ul id="navlist" class="left">
-
- <li >
- <a href="/" class="link-decor-none">hme</a>
- </li>
- <li >
- <a href="/projects/" class="link-decor-none">poc</a>
- </li>
- <li >
- <a href="/about/" class="link-decor-none">abt</a>
- </li>
- <li>
- <a href="/cgi-bin/find.cgi" class="link-decor-none">lup</a>
- </li>
- <li>
- <a href="/feed.xml" class="link-decor-none">rss</a>
- </li>
- </ul>
-</div>
-
-
-
- <main>
- <div class="container">
- <div class="container-2">
- <h2 class="center" id="title">FULL-TEXT SEARCH: SA LOOKUP</h2>
- <h5 class="center">03 JANUARY 2026</h5>
- <br>
- <div class="twocol justify"><p>Article count is growing. Need a way to search.</p>
-
-<p>Requirements: matches substrings, case-insensitive, fast, secure. No
-JavaScript.</p>
-
-<p>Architecture: browser → httpd → slowcgi → Perl CGI script.</p>
-
-<p>Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file
-system permissions govern access.</p>
-
-<p>2025-12-30: Regex search.</p>
-
-<p>140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at
-higher file counts.</p>
-
-<p>Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to
-stop here.</p>
-
-<p>2026-01-03: Suffix Array (SA) based index lookup.</p>
-
-<p>Slurping files on every request bothers me. Regex search depends almost
-entirely on hardware for speed.</p>
-
-<p>Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
-built with site:</p>
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ JEKYLL_ENV=production bundle exec jekyll build
-$ cd cgi-bin/
-$ perl indexer.pl
-</code></pre></div></div>
-
-<p>Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null
-byte sentinel for document boundaries. sa.bin stores suffix offsets as
-32-bit unsigned integers, sorted by lexicographical order:</p>
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my @sa = 0 .. (length($corpus) - 1);
-{
- use bytes; # Force compare 8-bit Unicode value comparisons
- @sa = sort {
- # First 64 bytes check (fast path)
- (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
- # Full string fallback (required for correctness)
- (substr($corpus, $a) cmp substr($corpus, $b))
- } @sa;
-}
-
-CORPUS: a s c i \0 i m x
-OFFSET: 0 1 2 3 4 5 6 7
-
-SORTED SUFFIXES:
-
-[4] \0imx
-[0] asci\0imx
-[2] ci\0imx
-[3] i\0imx &lt;-- "i" from "asci"
-[5] imx &lt;-- "i" from "imx"
-[6] mx
-[1] sci\0imx
-[7] x
-</code></pre></div></div>
-
-<p>Time complexity: O(Lā‹…N log N). Fast path caps L at 64 bytes (length of cache
-line), reducing complexity to O(N log N).</p>
-
-<p>Search: Textbook range query with twin binary searches. Random access to
-suffixes and the text made possible by the fixed-width offsets:</p>
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seek($fh_sa, $mid * 4, 0);
-read($fh_sa, my $bin_off, 4);
-my $off = unpack("L", $bin_off);
-seek($fh_cp, $off, 0);
-read($fh_cp, my $text, $query_len);
-</code></pre></div></div>
-
-<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low.</p>
-
-<p>NOTE: mmap.</p>
-
-<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).</p>
-
-<p>500 files:</p>
-<ul>
- <li>Index size: 204.94 KB</li>
- <li>Indexing time: 0.1475 s</li>
- <li>Peak RAM (SA): 8828 KB</li>
- <li>Peak RAM (Regex): 9136 KB</li>
- <li>Search (SA): 0.0012 s</li>
- <li>Search (Regex): 0.0407 s</li>
-</ul>
-
-<p>1,000 files:</p>
-<ul>
- <li>Index size: 410.51 KB</li>
- <li>Indexing time: 0.3101 s</li>
- <li>Peak RAM (SA): 8980 KB</li>
- <li>Peak RAM (Regex): 9460 KB</li>
- <li>Search (SA): 0.0019 s</li>
- <li>Search (Regex): 0.0795 s</li>
-</ul>
-
-<p>10,000 files:</p>
-<ul>
- <li>Index size: 4163.44 KB</li>
- <li>Indexing time: 10.9661 s</li>
- <li>Peak RAM (SA): 12504 KB</li>
- <li>Peak RAM (Regex): 12804 KB</li>
- <li>Search (SA): 0.0161 s</li>
- <li>Search (Regex): 0.9120 s</li>
-</ul>
-
-<p>Security: Derived from architectural simplicity. Zero dependencies to manage,
-no secrets to hide, no targets for lateral movement.</p>
-
-<p>Runs in chroot.</p>
-
-<p>Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting
-concurrent searches via lock-file semaphores. Query length (64B) and result set
-(20) capped. All output is HTML-escaped to prevent XSS.</p>
-
-<p>Secure by default. Fast. Durable.</p>
-
-<p>Commit:
-<a href="https://git.asciimx.com/www/commit/?h=term&amp;id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a>
-| Benchmarks:
-<a href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9">8a4da68</a></p>
-</div>
- <p class="post-author right">by W. D. Sadeep Madurange</p>
- </div>
- </div>
- </main>
-
- <div class="footer">
- <div class="container">
- <div class="twelve columns right container-2">
- <p id="footer-text">&copy; ASCIIMX - 2026</p>
- </div>
- </div>
-</div>
-
-
- </body>
-</html>