summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
diff options
context:
space:
mode:
Diffstat (limited to '_log/site-search.md')
-rw-r--r--_log/site-search.md71
1 files changed, 28 insertions, 43 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
index 0848dce..50de7b2 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -1,32 +1,17 @@
---
-title: Built a search engine for website based on suffix arrays
+title: Overengineered search for my website
date: 2026-01-03
layout: post
---
-Built search.
+Developed a suffix-array-based search engine for the site today. While a simple
+regex search was enough, I couldn't resist the technical elegance of a proper
+index.
-Requirements: match substrings, ignore case, handle 50 concurrent queries with
-1 vCPU, 1 GB RAM, 25 GB SSD. No JavaScript—work with uBlock Origin, NoScript,
-text browsers.
-
-Architecture: browser ↔ httpd ↔ slowcgi ↔ search engine.
-
-Server-side regex is viable for a personal site. But an index has clear
-advantages. Not that much harder to implement.
-
-Index: suffix array (SA) implemented in Perl. Index— corpus.bin, sa.bin,
-file_map.dat—built with site:
-
-```
-$ JEKYLL_ENV=production bundle exec jekyll build
-$ cd cgi-bin/
-$ perl indexer.pl
-```
-
-Indexer extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte sentinel
-marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned
-integers, sorted lexicographically:
+Indexer: Implemented the indexer in Perl to crawl the HTML, lowercase the text,
+and encode it into UTF-8 bytes. Used a null byte sentinel to mark document
+boundaries and stored the lexicographically sorted 32-bit unsigned integer
+offsets to sa.bin:
```
my @sa = 0 .. (length($corpus) - 1);
@@ -41,14 +26,16 @@ my @sa = 0 .. (length($corpus) - 1);
}
```
-32-bit offsets limit index size to 4GB—more than sufficient and, if necessary,
-easily expanded.
+32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but
+comforting to have.
-Sort (O(L⋅N log N)) is the real bottleneck. Fast path caps L at 64 bytes
-(length of a typical cache line).
+It takes about 50ms to index my 12-entry website on a T490. As the site grows,
+the O(L⋅N log N) sort could become a bottleneck. So I introduced a fast path
+that caps L at 64 bytes—roughly the size of a cache line on common hardware.
-Search: Textbook range query with two binary searches. Fixed-width offsets
-enable random access to index:
+Search: Implemented the search in a FastCGI script as a textbook range query
+with two binary searches. Leveraged the fixed-width offsets for fast random
+access to the index:
```
seek($fh_sa, $mid * 4, 0);
@@ -58,10 +45,12 @@ seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
```
-Small seek/reads are fast on modern SSDs; keeps memory footprint small.
+Chose seek + read over mmap because it outperformed mmap for <1k files. At
+10k, mmap was occasionally faster (~200 µs), but used more memory—possibly
+due to OpenBSD’s VM security trade-offs. Results may vary by OS.
-Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear
-regex search:
+Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against
+linear regex search:
<pre class="pre-no-style">
=============================================================
@@ -100,19 +89,15 @@ Index size | 4163.44 KB | N/A
-------------------------------------------------------------
</pre>
-Seek/read consistently outperformed mmap at <1k files. At 10k, mmap was
-occasionally faster (~200 µs), but used more memory—possibly OpenBSD's VM
-security trade-offs. Results may vary by OS.
-
-Security: httpd, slowcgi, Perl in OpenBSD base system. File system permissions
-govern access. Runs in chroot.
+Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system
+permissions to govern access. Hardened the system by running it in chroot.
-Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit
-concurrent searches. Query length (64B) and result set (20) are capped. All
-output is HTML-escaped to prevent XSS.
+Resource exhaustion and XSS attacks are inherent. Limited concurrent searches
+using lock-file semaphores, and capped the query length (64 B) and the result
+set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape.
-Warranty: 10,000 / 12 → 833 years. <br>
-Next release: inverted index—Anno Domini 2859.
+At my current pace of twelve articles a year, the search should work without
+intervention for the next 833 years. Penciled in the next release for 2859 AD.
Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"