From 0e0d336d7ec80ff00e0e4d9acf00267ad3faa214 Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Thu, 7 May 2026 16:04:43 +0800 Subject: Deployment script and replace regex search with SA search. --- _log/site-search.md | 58 ++++++++++++++++++++++++++--------------------------- 1 file changed, 29 insertions(+), 29 deletions(-) (limited to '_log/site-search.md') diff --git a/_log/site-search.md b/_log/site-search.md index c1d4c12..e25c0fc 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -1,10 +1,10 @@ --- -title: Under-engineered search +title: Overengineered search date: 2026-01-03 layout: post --- -Developed a suffix-array-based search engine for the site today. While a simple +Developed a suffix-array-based search engine my personal site. While a simple regex search was enough, couldn't resist the technical elegance of a proper index. @@ -28,8 +28,12 @@ my @sa = 0 .. (length($corpus) - 1); 32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but comforting to have. -Search: Textbook range query with two binary searches hosted in a FastCGI -process. Fixed-width offsets enable fast random access to the index: +O(L⋅N log N) sort is slow. 100 4.1 KB articles took 97.9s to index. L=64 fast +path reduces that to 1.31s (L=16, 32, 64: 1.29-1.31s; 128, 256: 1.33-1.35s). +Even with fast path optimization, indexer is unusable beyond 300 articles. + +Search: Textbook range query with two binary searches, hosted in a FastCGI +process. Fixed-width offsets allow fast random access to the index: ``` seek($fh_sa, $mid * 4, 0); @@ -43,43 +47,44 @@ Seek + read outperformed mmap for <1k files. At 10k, mmap was occasionally faster (~200 µs), but consumed more memory—possibly due to OpenBSD’s VM security trade-offs. Results may vary by OS. -Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against +Benchmarks: My articles have a 3.42 KB median, 3.43 KB mean, and 5.39 KB max. +Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 4.1 KB) against linear regex search:
 =============================================================
 SEARCH BENCHMARK: Suffix array vs. Linear regex
-ARTICLE SIZE: 8 KB
+ARTICLE SIZE: 4.1 KB
 =============================================================
 
-500 files (Targeting: keyword_-1):
+100 files (Targeting: keyword_-1):
 ----------------+----------------------+---------------------
 METRIC          | SA                   | REGEX               
 ----------------+----------------------+---------------------
-Search time     | 0.0014s              | 0.0451s             
-Peak RAM        | 8124 KB              | 9612 KB             
-Indexing time   | 18.1865s             | N/A                 
-Index size      | 19610.39 KB          | N/A                 
+Search time     | 0.0009s              | 0.0084s             
+Peak RAM        | 7968 KB              | 9676 KB             
+Indexing time   | 1.3332s              | N/A                 
+Index size      | 2070.38 KB           | N/A                 
 ----------------+----------------------+---------------------
 
-1000 files (Targeting: keyword_-1):
+300 files (Targeting: keyword_-1):
 ----------------+----------------------+---------------------
 METRIC          | SA                   | REGEX               
 ----------------+----------------------+---------------------
-Search time     | 0.0021s              | 0.0918s             
-Peak RAM        | 8280 KB              | 9960 KB             
-Indexing time   | 43.1748s             | N/A                 
-Index size      | 39225.06 KB          | N/A                 
+Search time     | 0.0009s              | 0.0242s             
+Peak RAM        | 8024 KB              | 9680 KB             
+Indexing time   | 4.5658s              | N/A                 
+Index size      | 6211.76 KB           | N/A                 
 ----------------+----------------------+---------------------
 
-10000 files (Targeting: keyword_-1):
+5000 files (Targeting: keyword_-1):
 ----------------+----------------------+---------------------
 METRIC          | SA                   | REGEX               
 ----------------+----------------------+---------------------
-Search time     | 0.0173s              | 1.1275s             
-Peak RAM        | 11848 KB             | 13392 KB            
-Indexing time   | 663.3909s            | N/A                 
-Index size      | 392263.01 KB         | N/A                 
+Search time     | 0.0088s              | 0.4937s             
+Peak RAM        | 9948 KB              | 11436 KB            
+Indexing time   | 138.7510s            | N/A                 
+Index size      | 103557.18 KB         | N/A                 
 ----------------+----------------------+---------------------
 
@@ -90,17 +95,12 @@ Resource exhaustion and XSS attacks are inherent. Limited concurrent searches using lock-file semaphores, and capped the query length (64 B) and the result set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape. -Performance: Without SA-IS, indexing is slow. With O(L⋅N log N) naive sort, 100 -8 KB articles took 6.58 minutes to index. L=64 fast path reduces that to 2.69 -seconds (L=16, 32, 64: 2.68-2.69s; 128, 256: 2.75-2.77s). Even so, 43.1748s to -index 500 articles is untenable. - -I under-engineered search. +Next release: Incremental indexing + SA-IS, Anno Domini 2076. Commit: 6da102d | Benchmarks: de9d82e +href="https://git.asciimx.com/site-search-bm/commit/?id=f6d7c3fdbecbcb880c0c02fdffefa1f467c46b03" +class="external" target="_blank" rel="noopener noreferrer">f6d7c3f -- cgit v1.2.3