summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
diff options
context:
space:
mode:
Diffstat (limited to '_log/site-search.md')
-rw-r--r--_log/site-search.md58
1 files changed, 29 insertions, 29 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
index c1d4c12..e25c0fc 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -1,10 +1,10 @@
---
-title: Under-engineered search
+title: Overengineered search
date: 2026-01-03
layout: post
---
-Developed a suffix-array-based search engine for the site today. While a simple
+Developed a suffix-array-based search engine my personal site. While a simple
regex search was enough, couldn't resist the technical elegance of a proper
index.
@@ -28,8 +28,12 @@ my @sa = 0 .. (length($corpus) - 1);
32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but
comforting to have.
-Search: Textbook range query with two binary searches hosted in a FastCGI
-process. Fixed-width offsets enable fast random access to the index:
+O(L⋅N log N) sort is slow. 100 4.1 KB articles took 97.9s to index. L=64 fast
+path reduces that to 1.31s (L=16, 32, 64: 1.29-1.31s; 128, 256: 1.33-1.35s).
+Even with fast path optimization, indexer is unusable beyond 300 articles.
+
+Search: Textbook range query with two binary searches, hosted in a FastCGI
+process. Fixed-width offsets allow fast random access to the index:
```
seek($fh_sa, $mid * 4, 0);
@@ -43,43 +47,44 @@ Seek + read outperformed mmap for <1k files. At 10k, mmap was occasionally
faster (~200 µs), but consumed more memory—possibly due to OpenBSD’s VM
security trade-offs. Results may vary by OS.
-Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against
+Benchmarks: My articles have a 3.42 KB median, 3.43 KB mean, and 5.39 KB max.
+Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 4.1 KB) against
linear regex search:
<pre class="pre-no-style">
=============================================================
SEARCH BENCHMARK: Suffix array vs. Linear regex
-ARTICLE SIZE: 8 KB
+ARTICLE SIZE: 4.1 KB
=============================================================
-500 files (Targeting: keyword_-1):
+100 files (Targeting: keyword_-1):
----------------+----------------------+---------------------
METRIC | SA | REGEX
----------------+----------------------+---------------------
-Search time | 0.0014s | 0.0451s
-Peak RAM | 8124 KB | 9612 KB
-Indexing time | 18.1865s | N/A
-Index size | 19610.39 KB | N/A
+Search time | 0.0009s | 0.0084s
+Peak RAM | 7968 KB | 9676 KB
+Indexing time | 1.3332s | N/A
+Index size | 2070.38 KB | N/A
----------------+----------------------+---------------------
-1000 files (Targeting: keyword_-1):
+300 files (Targeting: keyword_-1):
----------------+----------------------+---------------------
METRIC | SA | REGEX
----------------+----------------------+---------------------
-Search time | 0.0021s | 0.0918s
-Peak RAM | 8280 KB | 9960 KB
-Indexing time | 43.1748s | N/A
-Index size | 39225.06 KB | N/A
+Search time | 0.0009s | 0.0242s
+Peak RAM | 8024 KB | 9680 KB
+Indexing time | 4.5658s | N/A
+Index size | 6211.76 KB | N/A
----------------+----------------------+---------------------
-10000 files (Targeting: keyword_-1):
+5000 files (Targeting: keyword_-1):
----------------+----------------------+---------------------
METRIC | SA | REGEX
----------------+----------------------+---------------------
-Search time | 0.0173s | 1.1275s
-Peak RAM | 11848 KB | 13392 KB
-Indexing time | 663.3909s | N/A
-Index size | 392263.01 KB | N/A
+Search time | 0.0088s | 0.4937s
+Peak RAM | 9948 KB | 11436 KB
+Indexing time | 138.7510s | N/A
+Index size | 103557.18 KB | N/A
----------------+----------------------+---------------------
</pre>
@@ -90,17 +95,12 @@ Resource exhaustion and XSS attacks are inherent. Limited concurrent searches
using lock-file semaphores, and capped the query length (64 B) and the result
set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape.
-Performance: Without SA-IS, indexing is slow. With O(L⋅N log N) naive sort, 100
-8 KB articles took 6.58 minutes to index. L=64 fast path reduces that to 2.69
-seconds (L=16, 32, 64: 2.68-2.69s; 128, 256: 2.75-2.77s). Even so, 43.1748s to
-index 500 articles is untenable.
-
-I under-engineered search.
+Next release: Incremental indexing + SA-IS, Anno Domini 2076.
Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"
class="external" target="_blank" rel="noopener noreferrer">6da102d</a> |
Benchmarks: <a
-href="https://git.asciimx.com/site-search-bm/commit/?id=de9d82e8074c9b67a04989f9b6be62890b7c95bb"
-class="external" target="_blank" rel="noopener noreferrer">de9d82e</a>
+href="https://git.asciimx.com/site-search-bm/commit/?id=f6d7c3fdbecbcb880c0c02fdffefa1f467c46b03"
+class="external" target="_blank" rel="noopener noreferrer">f6d7c3f</a>