summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
diff options
context:
space:
mode:
Diffstat (limited to '_log/site-search.md')
-rw-r--r--_log/site-search.md39
1 files changed, 22 insertions, 17 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
index e25c0fc..d4d7fe4 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -4,13 +4,13 @@ date: 2026-01-03
layout: post
---
-Developed a suffix-array-based search engine my personal site. While a simple
-regex search was enough, couldn't resist the technical elegance of a proper
+Developed a suffix-array-based search engine. While a simple regex search
+would've been sufficient, couldn't resist the technical elegance of a proper
index.
-Indexer: Indexer crawls the HTML, lowercases the text, and encodes it into
-UTF-8 bytes. Null byte sentinel marks the document boundaries;
-Lexicographically sorted 32-bit unsigned integer offsets are stored in sa.bin:
+Indexer crawls the HTML, lowercases the text, and encodes it into UTF-8 bytes.
+Null byte sentinels mark document boundaries; sa.bin stores lexicographically
+sorted 32-bit unsigned integer offsets:
```
my @sa = 0 .. (length($corpus) - 1);
@@ -28,12 +28,14 @@ my @sa = 0 .. (length($corpus) - 1);
32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but
comforting to have.
-O(L⋅N log N) sort is slow. 100 4.1 KB articles took 97.9s to index. L=64 fast
-path reduces that to 1.31s (L=16, 32, 64: 1.29-1.31s; 128, 256: 1.33-1.35s).
-Even with fast path optimization, indexer is unusable beyond 300 articles.
+O(L⋅N log N) sort is the bottleneck. 100 4.1 KB articles took 97.9s to index.
+L=64 fast path reduces that to 1.31s. Experimented with 16, 32, 128, and 256
+bytes; 64 was the sweet spot—lower values were marginally faster, higher ones
+marginally slower.
-Search: Textbook range query with two binary searches, hosted in a FastCGI
-process. Fixed-width offsets allow fast random access to the index:
+Implemented search using a textbook range query with two binary searches,
+hosted in a FastCGI process. Fixed-width offsets allow fast random access to
+the index:
```
seek($fh_sa, $mid * 4, 0);
@@ -47,6 +49,13 @@ Seek + read outperformed mmap for <1k files. At 10k, mmap was occasionally
faster (~200 µs), but consumed more memory—possibly due to OpenBSD’s VM
security trade-offs. Results may vary by OS.
+Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system
+permissions to govern access. Hardened the system by running it in chroot.
+
+Resource exhaustion and XSS attacks are inherent. Limited concurrent searches
+using lock-file semaphores, and capped the query length (64 B) and the result
+set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape.
+
Benchmarks: My articles have a 3.42 KB median, 3.43 KB mean, and 5.39 KB max.
Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 4.1 KB) against
linear regex search:
@@ -88,14 +97,10 @@ Index size | 103557.18 KB | N/A
----------------+----------------------+---------------------
</pre>
-Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system
-permissions to govern access. Hardened the system by running it in chroot.
-
-Resource exhaustion and XSS attacks are inherent. Limited concurrent searches
-using lock-file semaphores, and capped the query length (64 B) and the result
-set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape.
+Search scales well—0.9 ms at 100 files, 8.8 ms at 5000. Indexing doesn't. 4.5s
+at 300 files is tolerable; 138s at 5000 is impractical.
-Next release: Incremental indexing + SA-IS, Anno Domini 2076.
+Warranty: 300 / 6 → 50 years.
Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"