summaryrefslogtreecommitdiffstats
path: root/_site/log/site-search/index.html
diff options
context:
space:
mode:
Diffstat (limited to '_site/log/site-search/index.html')
-rw-r--r--_site/log/site-search/index.html65
1 files changed, 34 insertions, 31 deletions
diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html
index 9170002..e5b9f15 100644
--- a/_site/log/site-search/index.html
+++ b/_site/log/site-search/index.html
@@ -3,7 +3,7 @@
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
- <title>Search engine (Perl + FastCGI + SA)</title>
+ <title>Site search with suffix array</title>
<link rel="stylesheet" href="/assets/css/main.css">
<link rel="stylesheet" href="/assets/css/skeleton.css">
</head>
@@ -37,43 +37,43 @@
<main>
<div class="container">
<div class="container-2">
- <h2 class="center" id="title">SEARCH ENGINE (PERL + FASTCGI + SA)</h2>
+ <h2 class="center" id="title">SITE SEARCH WITH SUFFIX ARRAY</h2>
<h5 class="center">03 JANUARY 2026</h5>
<br>
- <div class="twocol justify"><p>Article count on the website is growing. Need a way to search.</p>
+ <div class="twocol justify"><p>Article count is growing. Need a way to search.</p>
<p>Requirements: matches substrings, case-insensitive, fast, secure. No
JavaScript.</p>
<p>Architecture: browser → httpd → slowcgi → Perl CGI script.</p>
-<p>httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access
-governed by file system permissions–no secrets to manage. Secure by default.</p>
+<p>Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file
+system permissions govern access.</p>
-<p>2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files
-using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast
-enough. Start to feel the O(N) pull at higher file counts.</p>
+<p>2025-12-30: Regex search.</p>
-<p>Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can
-be mitigated. Tempted to stop here.</p>
+<p>140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at
+higher file counts.</p>
-<p>2026-01-03: Suffix array (SA) based index lookup.</p>
+<p>Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to
+stop here.</p>
-<p>Inefficiency of scanning every file on each request troubles me. Regex search
-depends almost entirely on hardware for speed.</p>
+<p>2026-01-03: Suffix Array (SA) based index lookup.</p>
-<p>Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted
-byte offsets), file_map.dat (metadata). Index created with the site:</p>
+<p>Slurping files on every request bothers me. Regex search depends almost
+entirely on hardware for speed.</p>
+
+<p>Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
+built with site:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ JEKYLL_ENV=production bundle exec jekyll build
$ cd cgi-bin/
$ perl indexer.pl
</code></pre></div></div>
-<p>Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and
-saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix
-array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the
-lexicographical order:</p>
+<p>Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null
+byte sentinel for document boundaries. sa.bin stores suffix offsets as
+32-bit unsigned integers, sorted by lexicographical order:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my @sa = 0 .. (length($corpus) - 1);
{
@@ -101,11 +101,11 @@ SORTED SUFFIXES:
[7] x
</code></pre></div></div>
-<p>Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a
-cache line), reducing complexity to O(N log N).</p>
+<p>Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of cache
+line), reducing complexity to O(N log N).</p>
-<p>Search: Textbook range query with two binary searches. Random access to offsets
-and the text is possible via the fixed-width offsets:</p>
+<p>Search: Textbook range query with twin binary searches. Random access to
+suffixes and the text made possible by the fixed-width offsets:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
@@ -114,8 +114,9 @@ seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
</code></pre></div></div>
-<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy
-to implement. Note to self: mmap.</p>
+<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low.</p>
+
+<p>NOTE: mmap.</p>
<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).</p>
@@ -149,14 +150,16 @@ to implement. Note to self: mmap.</p>
<li>Search (Regex): 0.9120 s</li>
</ul>
-<p>Security: Much of it comes from its architectural simplicity. No dependencies
-to manage, no secrets to hide, no assets for lateral movement. Runs in chroot.</p>
+<p>Security: Derived from architectural simplicity. Zero dependencies to manage,
+no secrets to hide, no targets for lateral movement.</p>
+
+<p>Runs in chroot.</p>
-<p>Resource exhaustion and XSS attacks are inherent. The former is mitigated by
-limiting concurrent searches via lock-file semaphores and capping query length
-(64B) and result sets (20). All output is HTML-escaped to prevent XSS.</p>
+<p>Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting
+concurrent searches via lock-file semaphores. Query length (64B) and result set
+(20) capped. All output is HTML-escaped to prevent XSS.</p>
-<p>Verdict: Fast SA lookup; Works on every conceivable web browser.</p>
+<p>Secure by default. Fast. Durable.</p>
<p>Commit:
<a href="https://git.asciimx.com/www/commit/?h=term&amp;id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a>