diff options
Diffstat (limited to '_log/site-search.md')
| -rw-r--r-- | _log/site-search.md | 92 |
1 files changed, 45 insertions, 47 deletions
diff --git a/_log/site-search.md b/_log/site-search.md index 7b916fe..c1d4c12 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -1,5 +1,5 @@ --- -title: Overengineered search +title: Under-engineered search date: 2026-01-03 layout: post --- @@ -8,15 +8,14 @@ Developed a suffix-array-based search engine for the site today. While a simple regex search was enough, couldn't resist the technical elegance of a proper index. -Indexer: Implemented the indexer in Perl to crawl the HTML, lowercase the text, -and encode it into UTF-8 bytes. Used a null byte sentinel to mark document -boundaries and stored the lexicographically sorted 32-bit unsigned integer -offsets to sa.bin: +Indexer: Indexer crawls the HTML, lowercases the text, and encodes it into +UTF-8 bytes. Null byte sentinel marks the document boundaries; +Lexicographically sorted 32-bit unsigned integer offsets are stored in sa.bin: ``` my @sa = 0 .. (length($corpus) - 1); { - use bytes; # Force compare 8-bit Unicode value comparisons + use bytes; # Force compare raw bytes @sa = sort { # First 64 bytes check (fast path) (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || @@ -29,13 +28,8 @@ my @sa = 0 .. (length($corpus) - 1); 32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but comforting to have. -It takes about 50ms to index my 12-entry website on a T490. As the site grows, -the O(L⋅N log N) sort could become a bottleneck. So I introduced a fast path -that caps L at 64 bytes—roughly the size of a cache line on common hardware. - -Search: Implemented the search in a FastCGI script as a textbook range query -with two binary searches. Leveraged the fixed-width offsets for fast random -access to the index: +Search: Textbook range query with two binary searches hosted in a FastCGI +process. Fixed-width offsets enable fast random access to the index: ``` seek($fh_sa, $mid * 4, 0); @@ -45,9 +39,9 @@ seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); ``` -Chose seek + read over mmap because it outperformed mmap for <1k files. At -10k, mmap was occasionally faster (~200 µs), but used more memory—possibly -due to OpenBSD’s VM security trade-offs. Results may vary by OS. +Seek + read outperformed mmap for <1k files. At 10k, mmap was occasionally +faster (~200 µs), but consumed more memory—possibly due to OpenBSD’s VM +security trade-offs. Results may vary by OS. Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear regex search: @@ -55,38 +49,38 @@ linear regex search: <pre class="pre-no-style"> ============================================================= SEARCH BENCHMARK: Suffix array vs. Linear regex -ARTICLE SIZE: 16 KB +ARTICLE SIZE: 8 KB ============================================================= -500 files: -------------------------------------------------------------- -METRIC | SA | REGEX +500 files (Targeting: keyword_-1): +----------------+----------------------+--------------------- +METRIC | SA | REGEX +----------------+----------------------+--------------------- +Search time | 0.0014s | 0.0451s +Peak RAM | 8124 KB | 9612 KB +Indexing time | 18.1865s | N/A +Index size | 19610.39 KB | N/A +----------------+----------------------+--------------------- + +1000 files (Targeting: keyword_-1): ----------------+----------------------+--------------------- -Search time | 0.0012s | 0.0407s -Peak RAM | 8828 KB | 9136 KB -Indexing time | 0.1475s | N/A -Index size | 204.94 KB | N/A -------------------------------------------------------------- - -1,000 files: -------------------------------------------------------------- -METRIC | SA | REGEX +METRIC | SA | REGEX ----------------+----------------------+--------------------- -Search time | 0.0019s | 0.0795s -Peak RAM | 8980 KB | 9460 KB -Indexing time | 0.3101s | N/A -Index size | 410.51 KB | N/A -------------------------------------------------------------- - -10,000 files: -------------------------------------------------------------- -METRIC | SA | REGEX +Search time | 0.0021s | 0.0918s +Peak RAM | 8280 KB | 9960 KB +Indexing time | 43.1748s | N/A +Index size | 39225.06 KB | N/A +----------------+----------------------+--------------------- + +10000 files (Targeting: keyword_-1): +----------------+----------------------+--------------------- +METRIC | SA | REGEX +----------------+----------------------+--------------------- +Search time | 0.0173s | 1.1275s +Peak RAM | 11848 KB | 13392 KB +Indexing time | 663.3909s | N/A +Index size | 392263.01 KB | N/A ----------------+----------------------+--------------------- -Search time | 0.0161s | 0.9120s -Peak RAM | 12504 KB | 12804 KB -Indexing time | 10.9661s | N/A -Index size | 4163.44 KB | N/A -------------------------------------------------------------- </pre> Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system @@ -96,13 +90,17 @@ Resource exhaustion and XSS attacks are inherent. Limited concurrent searches using lock-file semaphores, and capped the query length (64 B) and the result set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape. -At six articles a year, this should work for the next 1600 years. Penciled in -the next release for Anno Domini 3626. +Performance: Without SA-IS, indexing is slow. With O(L⋅N log N) naive sort, 100 +8 KB articles took 6.58 minutes to index. L=64 fast path reduces that to 2.69 +seconds (L=16, 32, 64: 2.68-2.69s; 128, 256: 2.75-2.77s). Even so, 43.1748s to +index 500 articles is untenable. + +I under-engineered search. Commit: <a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e" class="external" target="_blank" rel="noopener noreferrer">6da102d</a> | Benchmarks: <a -href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9" -class="external" target="_blank" rel="noopener noreferrer">8a4da68</a> +href="https://git.asciimx.com/site-search-bm/commit/?id=de9d82e8074c9b67a04989f9b6be62890b7c95bb" +class="external" target="_blank" rel="noopener noreferrer">de9d82e</a> |
