summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
diff options
context:
space:
mode:
Diffstat (limited to '_log/site-search.md')
-rw-r--r--_log/site-search.md92
1 files changed, 45 insertions, 47 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
index 7b916fe..c1d4c12 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -1,5 +1,5 @@
---
-title: Overengineered search
+title: Under-engineered search
date: 2026-01-03
layout: post
---
@@ -8,15 +8,14 @@ Developed a suffix-array-based search engine for the site today. While a simple
regex search was enough, couldn't resist the technical elegance of a proper
index.
-Indexer: Implemented the indexer in Perl to crawl the HTML, lowercase the text,
-and encode it into UTF-8 bytes. Used a null byte sentinel to mark document
-boundaries and stored the lexicographically sorted 32-bit unsigned integer
-offsets to sa.bin:
+Indexer: Indexer crawls the HTML, lowercases the text, and encodes it into
+UTF-8 bytes. Null byte sentinel marks the document boundaries;
+Lexicographically sorted 32-bit unsigned integer offsets are stored in sa.bin:
```
my @sa = 0 .. (length($corpus) - 1);
{
- use bytes; # Force compare 8-bit Unicode value comparisons
+ use bytes; # Force compare raw bytes
@sa = sort {
# First 64 bytes check (fast path)
(substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
@@ -29,13 +28,8 @@ my @sa = 0 .. (length($corpus) - 1);
32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but
comforting to have.
-It takes about 50ms to index my 12-entry website on a T490. As the site grows,
-the O(L⋅N log N) sort could become a bottleneck. So I introduced a fast path
-that caps L at 64 bytes—roughly the size of a cache line on common hardware.
-
-Search: Implemented the search in a FastCGI script as a textbook range query
-with two binary searches. Leveraged the fixed-width offsets for fast random
-access to the index:
+Search: Textbook range query with two binary searches hosted in a FastCGI
+process. Fixed-width offsets enable fast random access to the index:
```
seek($fh_sa, $mid * 4, 0);
@@ -45,9 +39,9 @@ seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
```
-Chose seek + read over mmap because it outperformed mmap for <1k files. At
-10k, mmap was occasionally faster (~200 µs), but used more memory—possibly
-due to OpenBSD’s VM security trade-offs. Results may vary by OS.
+Seek + read outperformed mmap for <1k files. At 10k, mmap was occasionally
+faster (~200 µs), but consumed more memory—possibly due to OpenBSD’s VM
+security trade-offs. Results may vary by OS.
Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against
linear regex search:
@@ -55,38 +49,38 @@ linear regex search:
<pre class="pre-no-style">
=============================================================
SEARCH BENCHMARK: Suffix array vs. Linear regex
-ARTICLE SIZE: 16 KB
+ARTICLE SIZE: 8 KB
=============================================================
-500 files:
--------------------------------------------------------------
-METRIC | SA | REGEX
+500 files (Targeting: keyword_-1):
+----------------+----------------------+---------------------
+METRIC | SA | REGEX
+----------------+----------------------+---------------------
+Search time | 0.0014s | 0.0451s
+Peak RAM | 8124 KB | 9612 KB
+Indexing time | 18.1865s | N/A
+Index size | 19610.39 KB | N/A
+----------------+----------------------+---------------------
+
+1000 files (Targeting: keyword_-1):
----------------+----------------------+---------------------
-Search time | 0.0012s | 0.0407s
-Peak RAM | 8828 KB | 9136 KB
-Indexing time | 0.1475s | N/A
-Index size | 204.94 KB | N/A
--------------------------------------------------------------
-
-1,000 files:
--------------------------------------------------------------
-METRIC | SA | REGEX
+METRIC | SA | REGEX
----------------+----------------------+---------------------
-Search time | 0.0019s | 0.0795s
-Peak RAM | 8980 KB | 9460 KB
-Indexing time | 0.3101s | N/A
-Index size | 410.51 KB | N/A
--------------------------------------------------------------
-
-10,000 files:
--------------------------------------------------------------
-METRIC | SA | REGEX
+Search time | 0.0021s | 0.0918s
+Peak RAM | 8280 KB | 9960 KB
+Indexing time | 43.1748s | N/A
+Index size | 39225.06 KB | N/A
+----------------+----------------------+---------------------
+
+10000 files (Targeting: keyword_-1):
+----------------+----------------------+---------------------
+METRIC | SA | REGEX
+----------------+----------------------+---------------------
+Search time | 0.0173s | 1.1275s
+Peak RAM | 11848 KB | 13392 KB
+Indexing time | 663.3909s | N/A
+Index size | 392263.01 KB | N/A
----------------+----------------------+---------------------
-Search time | 0.0161s | 0.9120s
-Peak RAM | 12504 KB | 12804 KB
-Indexing time | 10.9661s | N/A
-Index size | 4163.44 KB | N/A
--------------------------------------------------------------
</pre>
Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system
@@ -96,13 +90,17 @@ Resource exhaustion and XSS attacks are inherent. Limited concurrent searches
using lock-file semaphores, and capped the query length (64 B) and the result
set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape.
-At six articles a year, this should work for the next 1600 years. Penciled in
-the next release for Anno Domini 3626.
+Performance: Without SA-IS, indexing is slow. With O(L⋅N log N) naive sort, 100
+8 KB articles took 6.58 minutes to index. L=64 fast path reduces that to 2.69
+seconds (L=16, 32, 64: 2.68-2.69s; 128, 256: 2.75-2.77s). Even so, 43.1748s to
+index 500 articles is untenable.
+
+I under-engineered search.
Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"
class="external" target="_blank" rel="noopener noreferrer">6da102d</a> |
Benchmarks: <a
-href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9"
-class="external" target="_blank" rel="noopener noreferrer">8a4da68</a>
+href="https://git.asciimx.com/site-search-bm/commit/?id=de9d82e8074c9b67a04989f9b6be62890b7c95bb"
+class="external" target="_blank" rel="noopener noreferrer">de9d82e</a>