diff options
| author | Sadeep Madurange <sadeep@asciimx.com> | 2026-01-08 22:28:37 +0800 |
|---|---|---|
| committer | Sadeep Madurange <sadeep@asciimx.com> | 2026-01-08 22:28:37 +0800 |
| commit | e836c4b9e78cc3892cdebf8126cb650f1b91ed37 (patch) | |
| tree | 4e52155aef0105cc9b888a42b3e760455a2bcb36 /_site/log/site-search/index.html | |
| parent | 57ff09d2eefefa2462a2af0175e3e8164c7bc828 (diff) | |
| download | www-e836c4b9e78cc3892cdebf8126cb650f1b91ed37.tar.gz | |
Tighten prose.
Diffstat (limited to '_site/log/site-search/index.html')
| -rw-r--r-- | _site/log/site-search/index.html | 65 |
1 files changed, 34 insertions, 31 deletions
diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html index 9170002..e5b9f15 100644 --- a/_site/log/site-search/index.html +++ b/_site/log/site-search/index.html @@ -3,7 +3,7 @@ <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> - <title>Search engine (Perl + FastCGI + SA)</title> + <title>Site search with suffix array</title> <link rel="stylesheet" href="/assets/css/main.css"> <link rel="stylesheet" href="/assets/css/skeleton.css"> </head> @@ -37,43 +37,43 @@ <main> <div class="container"> <div class="container-2"> - <h2 class="center" id="title">SEARCH ENGINE (PERL + FASTCGI + SA)</h2> + <h2 class="center" id="title">SITE SEARCH WITH SUFFIX ARRAY</h2> <h5 class="center">03 JANUARY 2026</h5> <br> - <div class="twocol justify"><p>Article count on the website is growing. Need a way to search.</p> + <div class="twocol justify"><p>Article count is growing. Need a way to search.</p> <p>Requirements: matches substrings, case-insensitive, fast, secure. No JavaScript.</p> <p>Architecture: browser → httpd → slowcgi → Perl CGI script.</p> -<p>httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access -governed by file system permissions–no secrets to manage. Secure by default.</p> +<p>Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file +system permissions govern access.</p> -<p>2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files -using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast -enough. Start to feel the O(N) pull at higher file counts.</p> +<p>2025-12-30: Regex search.</p> -<p>Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can -be mitigated. Tempted to stop here.</p> +<p>140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at +higher file counts.</p> -<p>2026-01-03: Suffix array (SA) based index lookup.</p> +<p>Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to +stop here.</p> -<p>Inefficiency of scanning every file on each request troubles me. Regex search -depends almost entirely on hardware for speed.</p> +<p>2026-01-03: Suffix Array (SA) based index lookup.</p> -<p>Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted -byte offsets), file_map.dat (metadata). Index created with the site:</p> +<p>Slurping files on every request bothers me. Regex search depends almost +entirely on hardware for speed.</p> + +<p>Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index +built with site:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ JEKYLL_ENV=production bundle exec jekyll build $ cd cgi-bin/ $ perl indexer.pl </code></pre></div></div> -<p>Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and -saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix -array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the -lexicographical order:</p> +<p>Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null +byte sentinel for document boundaries. sa.bin stores suffix offsets as +32-bit unsigned integers, sorted by lexicographical order:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my @sa = 0 .. (length($corpus) - 1); { @@ -101,11 +101,11 @@ SORTED SUFFIXES: [7] x </code></pre></div></div> -<p>Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a -cache line), reducing complexity to O(N log N).</p> +<p>Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of cache +line), reducing complexity to O(N log N).</p> -<p>Search: Textbook range query with two binary searches. Random access to offsets -and the text is possible via the fixed-width offsets:</p> +<p>Search: Textbook range query with twin binary searches. Random access to +suffixes and the text made possible by the fixed-width offsets:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seek($fh_sa, $mid * 4, 0); read($fh_sa, my $bin_off, 4); @@ -114,8 +114,9 @@ seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); </code></pre></div></div> -<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy -to implement. Note to self: mmap.</p> +<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low.</p> + +<p>NOTE: mmap.</p> <p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).</p> @@ -149,14 +150,16 @@ to implement. Note to self: mmap.</p> <li>Search (Regex): 0.9120 s</li> </ul> -<p>Security: Much of it comes from its architectural simplicity. No dependencies -to manage, no secrets to hide, no assets for lateral movement. Runs in chroot.</p> +<p>Security: Derived from architectural simplicity. Zero dependencies to manage, +no secrets to hide, no targets for lateral movement.</p> + +<p>Runs in chroot.</p> -<p>Resource exhaustion and XSS attacks are inherent. The former is mitigated by -limiting concurrent searches via lock-file semaphores and capping query length -(64B) and result sets (20). All output is HTML-escaped to prevent XSS.</p> +<p>Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting +concurrent searches via lock-file semaphores. Query length (64B) and result set +(20) capped. All output is HTML-escaped to prevent XSS.</p> -<p>Verdict: Fast SA lookup; Works on every conceivable web browser.</p> +<p>Secure by default. Fast. Durable.</p> <p>Commit: <a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a> |
