From 35c999ea426a5f0b32e2b5d5bab5fe010ea9ce97 Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Mon, 4 May 2026 09:01:42 +0800 Subject: Improve accuracy of framing. --- _log/site-search.md | 92 ++++++++++++++++++++++++++--------------------------- 1 file changed, 45 insertions(+), 47 deletions(-) (limited to '_log/site-search.md') diff --git a/_log/site-search.md b/_log/site-search.md index 7b916fe..c1d4c12 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -1,5 +1,5 @@ --- -title: Overengineered search +title: Under-engineered search date: 2026-01-03 layout: post --- @@ -8,15 +8,14 @@ Developed a suffix-array-based search engine for the site today. While a simple regex search was enough, couldn't resist the technical elegance of a proper index. -Indexer: Implemented the indexer in Perl to crawl the HTML, lowercase the text, -and encode it into UTF-8 bytes. Used a null byte sentinel to mark document -boundaries and stored the lexicographically sorted 32-bit unsigned integer -offsets to sa.bin: +Indexer: Indexer crawls the HTML, lowercases the text, and encodes it into +UTF-8 bytes. Null byte sentinel marks the document boundaries; +Lexicographically sorted 32-bit unsigned integer offsets are stored in sa.bin: ``` my @sa = 0 .. (length($corpus) - 1); { - use bytes; # Force compare 8-bit Unicode value comparisons + use bytes; # Force compare raw bytes @sa = sort { # First 64 bytes check (fast path) (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || @@ -29,13 +28,8 @@ my @sa = 0 .. (length($corpus) - 1); 32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but comforting to have. -It takes about 50ms to index my 12-entry website on a T490. As the site grows, -the O(L⋅N log N) sort could become a bottleneck. So I introduced a fast path -that caps L at 64 bytes—roughly the size of a cache line on common hardware. - -Search: Implemented the search in a FastCGI script as a textbook range query -with two binary searches. Leveraged the fixed-width offsets for fast random -access to the index: +Search: Textbook range query with two binary searches hosted in a FastCGI +process. Fixed-width offsets enable fast random access to the index: ``` seek($fh_sa, $mid * 4, 0); @@ -45,9 +39,9 @@ seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); ``` -Chose seek + read over mmap because it outperformed mmap for <1k files. At -10k, mmap was occasionally faster (~200 µs), but used more memory—possibly -due to OpenBSD’s VM security trade-offs. Results may vary by OS. +Seek + read outperformed mmap for <1k files. At 10k, mmap was occasionally +faster (~200 µs), but consumed more memory—possibly due to OpenBSD’s VM +security trade-offs. Results may vary by OS. Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear regex search: @@ -55,38 +49,38 @@ linear regex search:
 =============================================================
 SEARCH BENCHMARK: Suffix array vs. Linear regex
-ARTICLE SIZE: 16 KB
+ARTICLE SIZE: 8 KB
 =============================================================
 
-500 files:
--------------------------------------------------------------
-METRIC          | SA                   | REGEX
+500 files (Targeting: keyword_-1):
+----------------+----------------------+---------------------
+METRIC          | SA                   | REGEX               
+----------------+----------------------+---------------------
+Search time     | 0.0014s              | 0.0451s             
+Peak RAM        | 8124 KB              | 9612 KB             
+Indexing time   | 18.1865s             | N/A                 
+Index size      | 19610.39 KB          | N/A                 
+----------------+----------------------+---------------------
+
+1000 files (Targeting: keyword_-1):
 ----------------+----------------------+---------------------
-Search time     |             0.0012s  |             0.0407s
-Peak RAM        |              8828 KB |              9136 KB
-Indexing time   |             0.1475s  |                 N/A
-Index size      |            204.94 KB |                 N/A
--------------------------------------------------------------
-
-1,000 files:
--------------------------------------------------------------
-METRIC          | SA                   | REGEX
+METRIC          | SA                   | REGEX               
 ----------------+----------------------+---------------------
-Search time     |             0.0019s  |             0.0795s
-Peak RAM        |              8980 KB |              9460 KB
-Indexing time   |             0.3101s  |                 N/A
-Index size      |            410.51 KB |                 N/A
--------------------------------------------------------------
-
-10,000 files:
--------------------------------------------------------------
-METRIC          | SA                   | REGEX
+Search time     | 0.0021s              | 0.0918s             
+Peak RAM        | 8280 KB              | 9960 KB             
+Indexing time   | 43.1748s             | N/A                 
+Index size      | 39225.06 KB          | N/A                 
+----------------+----------------------+---------------------
+
+10000 files (Targeting: keyword_-1):
+----------------+----------------------+---------------------
+METRIC          | SA                   | REGEX               
+----------------+----------------------+---------------------
+Search time     | 0.0173s              | 1.1275s             
+Peak RAM        | 11848 KB             | 13392 KB            
+Indexing time   | 663.3909s            | N/A                 
+Index size      | 392263.01 KB         | N/A                 
 ----------------+----------------------+---------------------
-Search time     |             0.0161s  |             0.9120s
-Peak RAM        |             12504 KB |             12804 KB
-Indexing time   |            10.9661s  |                 N/A
-Index size      |           4163.44 KB |                 N/A
--------------------------------------------------------------
 
Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system @@ -96,13 +90,17 @@ Resource exhaustion and XSS attacks are inherent. Limited concurrent searches using lock-file semaphores, and capped the query length (64 B) and the result set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape. -At six articles a year, this should work for the next 1600 years. Penciled in -the next release for Anno Domini 3626. +Performance: Without SA-IS, indexing is slow. With O(L⋅N log N) naive sort, 100 +8 KB articles took 6.58 minutes to index. L=64 fast path reduces that to 2.69 +seconds (L=16, 32, 64: 2.68-2.69s; 128, 256: 2.75-2.77s). Even so, 43.1748s to +index 500 articles is untenable. + +I under-engineered search. Commit: 6da102d | Benchmarks: 8a4da68 +href="https://git.asciimx.com/site-search-bm/commit/?id=de9d82e8074c9b67a04989f9b6be62890b7c95bb" +class="external" target="_blank" rel="noopener noreferrer">de9d82e -- cgit v1.2.3