diff options
| author | Sadeep Madurange <sadeep@asciimx.com> | 2026-01-03 19:24:25 +0800 |
|---|---|---|
| committer | Sadeep Madurange <sadeep@asciimx.com> | 2026-01-03 19:24:25 +0800 |
| commit | 16fe66dd83cbffa18af31676a380660ebce4e827 (patch) | |
| tree | 2e3c655bac69430998296a0ab8dbfcc4987090bb /_log | |
| parent | d3aff7d14e66f19a8e4e9a3315494ad612ccc1f5 (diff) | |
| download | www-16fe66dd83cbffa18af31676a380660ebce4e827.tar.gz | |
Polish the search engine post.
Diffstat (limited to '_log')
| -rw-r--r-- | _log/site-search.md | 276 |
1 files changed, 24 insertions, 252 deletions
diff --git a/_log/site-search.md b/_log/site-search.md index 2a552f6..6af9f63 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -1,30 +1,20 @@ --- -title: Site search with Perl + CGI + SA +title: Perl + FastCGI + SA search engine date: 2026-01-02 layout: post --- -Number of articles on the site is growing. Need a way to find them. +Number of articles growing. Need search. -Search must match substrings, be case-insensitive, fast, and secure. Site must -continue to work without JavaScript, on text web browsers--so no client-side -search. Page rank not applicable. +Requirements: substring match, case-insensitive, fast, secure. No JavaScript. -Architecture. OpenBSD's httpd speaks FastCGI. Perl and slowcgi are in the base -system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the -results back over FastCGI. +Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script. -Data structure. Suffix array. Build-time indexer generates three files: -corpus.bin--articles, sa.bin-suffix array, file_map.dat--article metadata -(title, path). +Data structure: suffix array. Three files: corpus.bin (articles), sa.bin +(sorted byte offsets), file_map.dat (metadata). -## Indexer - -Crawl the directory containing posts, extract HTML with regex -(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case -the text (for case-insensitive search), convert to raw octets, concatenate. - -Sort the text lexicographically, and store their byte offsets in sa.bin: +Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null +byte sentinel for document boundaries. Sort lexicographically:: ``` # Use a block that forces byte-level comparison @@ -39,247 +29,29 @@ Sort the text lexicographically, and store their byte offsets in sa.bin: } ``` -This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the -suffix count, L the median length of suffixes. The fast path caps L at 64 bytes -for most cases, bringing complexity down to O(NlogN)--64-byte length chosen to -fit in cache lines. - -``` -CORPUS: a s c i \0 i m x -OFFSET: 0 1 2 3 4 5 6 7 - -SORTED SUFFIXES: -[4] \0imx -[0] asci\0imx -[2] ci\0imx -[3] i\0imx <-- "i" from "asci" -[5] imx <-- "i" from "imx" -[6] mx -[1] sci\0imx -[7] x -``` - -Sentinel (null byte) must be lexicographically smaller than every other -character in the alphabet (offset 3 and 5 handled correctly at the document -boundary). - -Benchmarks. - -Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM) -i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory. - -All articles are 16 KB (2x my avg post size) documents generated with seed.sh. - -Total Time: 0.1475 seconds<br> -Files Processed: 500 <br> -File Sizes:<br> - corpus.bin 33.59 KB<br> - sa.bin 134.34 KB<br> - file_map.dat 37.01 KB<br> - Total Index: 204.94 KB<br> - -Total Time: 0.3101 seconds<br> -Files Processed: 1000<br> -File Sizes:<br> - corpus.bin 67.28 KB<br> - sa.bin 269.11 KB<br> - file_map.dat 74.12 KB<br> - Total Index: 410.51 KB<br> - -Total Time: 10.9661 seconds<br> -Files Processed: 10000<br> -File Sizes:<br> - corpus.bin 682.51 KB<br> - sa.bin 2730.05 KB<br> - file_map.dat 750.88 KB<br> - Total Index: 4163.44 KB<br> - -There are 11 posts today. Unlikely that it'd grow to 10K. Even if it did, the -index is very manageable. - -## Search - -Need to perform a range query to find offsets of all matches. Suffix -array is sorted--use binary search. - -``` -# left bound -while ($low <= $high) { - # Int overflow: Perl handles by promoting to float. - my $mid = int(($low + $high) / 2); - seek($fh_sa, $mid * 4, 0); - read($fh_sa, my $bin_off, 4); - my $off = unpack("L", $bin_off); - seek($fh_cp, $off, 0); - read($fh_cp, my $text, $query_len); - - my $cmp = $text cmp $query; - if ($cmp >= 0) { - $first_hit = $mid if $cmp == 0; - $high = $mid - 1; - } else { - $low = $mid + 1; - } -} - -# Continue with right bound... -``` - -Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to -seek/read exact chunks into memory. Since the FastCGI script is invoked per -query, small, frequent allocations are better than less frequent large -allocations (see bm_*.txt). - -Once the range is found through binary search, the results are collected using -a simple for loop. Max result count is capped at 20. - -Benchmarks. Search for 'arduino' (expect 0 matches). Left: SA-based index -search. Right: Naive search using regex. - -500 articles: - -<table> - <tr> - <td style="padding: 0">Files Read:</td> - <td style="padding: 0; text-align: right;">3</td> - <td style="padding: 0; text-align: right;">500</td> - </tr> - <tr> - <td style="padding: 0">User CPU (s):</td> - <td style="padding: 0; text-align: right;">0.0100</td> - <td style="padding: 0; text-align: right;">0.0600</td> - </tr> - <tr> - <td style="padding: 0">System CPU (s):</td> - <td style="padding: 0; text-align: right;">0.0200</td> - <td style="padding: 0; text-align: right;">0.0200</td> - </tr> - <tr> - <td style="padding: 0">Total Time (s):</td> - <td style="padding: 0; text-align: right;">0.0012</td> - <td style="padding: 0; text-align: right;">0.0407</td> - </tr> - <tr> - <td style="padding: 0">Peak RAM (KB):</td> - <td style="padding: 0; text-align: right;">8828</td> - <td style="padding: 0; text-align: right;">9136</td> - </tr> -</table> - -1,000 articles: - -<table> - <tr> - <td style="padding: 0">Files Read:</td> - <td style="padding: 0; text-align: right;">3</td> - <td style="padding: 0; text-align: right;">1000</td> - </tr> - <tr> - <td style="padding: 0">User CPU (s):</td> - <td style="padding: 0; text-align: right;">0.0200</td> - <td style="padding: 0; text-align: right;">0.0500</td> - </tr> - <tr> - <td style="padding: 0">System CPU (s):</td> - <td style="padding: 0; text-align: right;">0.0100</td> - <td style="padding: 0; text-align: right;">0.0500</td> - </tr> - <tr> - <td style="padding: 0">Total Time (s):</td> - <td style="padding: 0; text-align: right;">0.0019</td> - <td style="padding: 0; text-align: right;">0.0795</td> - </tr> - <tr> - <td style="padding: 0">Peak RAM (KB):</td> - <td style="padding: 0; text-align: right;">8980</td> - <td style="padding: 0; text-align: right;">9460</td> - </tr> -</table> - -10,000 articles: - -<table> - <tr> - <td style="padding: 0">Files Read:</td> - <td style="padding: 0; text-align: right;">3</td> - <td style="padding: 0; text-align: right;">10000</td> - </tr> - <tr> - <td style="padding: 0">User CPU (s):</td> - <td style="padding: 0; text-align: right;">0.0300</td> - <td style="padding: 0; text-align: right;">0.4000</td> - </tr> - <tr> - <td style="padding: 0">System CPU (s):</td> - <td style="padding: 0; text-align: right;">0.0300</td> - <td style="padding: 0; text-align: right;">0.5400</td> - </tr> - <tr> - <td style="padding: 0">Total Time (s):</td> - <td style="padding: 0; text-align: right;">0.0161</td> - <td style="padding: 0; text-align: right;">0.9120</td> - </tr> - <tr> - <td style="padding: 0">Peak RAM (KB):</td> - <td style="padding: 0; text-align: right;">12504</td> - <td style="padding: 0; text-align: right;">12804</td> - </tr> -</table> - -Note: Total time < User + System CPU time is likely a measurement artifact--the -script finishes faster than the OS tick interval. - -## Security - -Memory usage remains a concern. Parallel queries could exhaust memory. Limit -parallel searches using a semaphore implemented with lock files: - -``` -my $max_parallel = 50; - -mkdir $lock_dir, 0777 unless -d $lock_dir; -my $active_count = 0; -my $now = time(); - -opendir(my $dh, $lock_dir); -while (my $file = readdir($dh)) { - next unless $file =~ /\.lock$/; - my $path = "$lock_dir/$file"; - my $mtime = (stat($path))[9] || 0; - ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++; -} -closedir($dh); -``` - -XSS attacks. Escape all HTML using HTML::Escape qw(escape_html). - -ReDOS: Sanitize search text by tossing non-printable characters, limit length, -work within utf-8. Use `\Q...\E` in searches. +Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte +length targets cache lines. -``` -if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) { - $search_text = $1; - $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; - $search_text = decode_utf8($search_text // ""); - $search_text =~ s/\P{Print}//g; - $search_text = substr($search_text, 0, 64); - $search_text =~ s/^\s+|\s+$//g; -} -``` +Search: binary search for range query. Cap at 20 results--define limits or be +surprised by them. -Command injection: no exec()/system() calls. Non-privileged user. -Read-only directory. +File IO and memory: many seek/read small chunks beat one large allocation (see +benchmarks for find_one_file.cgi). -Symlink attacks: paths are hard-coded. Script can't be modified (554). +Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles): -No API keys, passwords involved. +1,000 files: 0.31s indexing, 410 KB total index. +10,000 files: 10.97s indexing, 4.16 MB total index. -Last line of defence: chroot. +Search 'arduino' (0 matches): +1,000 files: 0.002s (SA) vs 0.080s (naive regex). +10,000 files: 0.016s (SA) vs 0.912s (naive regex). -Can't think of anything else. +Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS). +Sanitize input--strip non-printables, limit length, and quote metacharacters +(ReDOS). No exec/system (command injection). Chroot. -Verdict: Fast SA index lookup. Mitigated primary attack vectors. No -dependencies. Works on every conceivable device. +Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies. Commit: [6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e) |
