diff options
Diffstat (limited to '_site/log/site-search')
| -rw-r--r-- | _site/log/site-search/index.html | 338 |
1 files changed, 338 insertions, 0 deletions
diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html new file mode 100644 index 0000000..3c32ba3 --- /dev/null +++ b/_site/log/site-search/index.html @@ -0,0 +1,338 @@ +<!DOCTYPE html> +<html> + <head> + <meta charset="utf-8"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <title>Site search with Perl + CGI + SA</title> + <link rel="stylesheet" href="/assets/css/main.css"> + <link rel="stylesheet" href="/assets/css/skeleton.css"> +</head> + + + <body> + + <div id="nav-container" class="container"> + <ul id="navlist" class="left"> + + <li > + <a href="/" class="link-decor-none">hme</a> + </li> + <li class="active"> + <a href="/log/" class="link-decor-none">log</a> + </li> + <li > + <a href="/projects/" class="link-decor-none">poc</a> + </li> + <li > + <a href="/about/" class="link-decor-none">abt</a> + </li> + <li> + <a href="/cgi-bin/find.cgi" class="link-decor-none">sws</a> + </li> + <li> + <a href="/feed.xml" class="link-decor-none">rss</a> + </li> + </ul> +</div> + + + + <main> + <div class="container"> + <div class="container-2"> + <h2 class="center" id="title">SITE SEARCH WITH PERL + CGI + SA</h2> + <h5 class="center">02 JANUARY 2026</h5> + <br> + <div class="twocol justify"><p>Number of articles on the site is growing. Need a way to find them.</p> + +<p>Search must match substrings, be case-insensitive, fast, and secure. Site must +continue to work without JavaScript, on text web browsers–so no client-side +search. Page rank not applicable.</p> + +<p>Architecture. OpenBSD’s httpd speaks FastCGI. Perl and slowcgi are in the base +system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the +results back over FastCGI.</p> + +<p>Data structure. Suffix array. Build-time indexer generates three files: +corpus.bin–articles, sa.bin-suffix array, file_map.dat–article metadata +(title, path).</p> + +<h2 id="indexer">Indexer</h2> + +<p>Crawl the directory containing posts, extract HTML with regex +(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case +the text (for case-insensitive search), convert to raw octets, concatenate.</p> + +<p>Sort the text lexicographically, and store their byte offsets in sa.bin:</p> + +<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Use a block that forces byte-level comparison +{ + use bytes; + @sa = sort { + # First 64 bytes check (fast path) + (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || + # Full string fallback (required for correctness) + (substr($corpus, $a) cmp substr($corpus, $b)) + } @sa; +} +</code></pre></div></div> + +<p>This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the +suffix count, L the median length of suffixes. The fast path caps L at 64 bytes +for most cases, bringing complexity down to O(NlogN)–64-byte length chosen to +fit in cache lines.</p> + +<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CORPUS: a s c i \0 i m x +OFFSET: 0 1 2 3 4 5 6 7 + +SORTED SUFFIXES: +[4] \0imx +[0] asci\0imx +[2] ci\0imx +[3] i\0imx <-- "i" from "asci" +[5] imx <-- "i" from "imx" +[6] mx +[1] sci\0imx +[7] x +</code></pre></div></div> + +<p>Sentinel (null byte) must be lexicographically smaller than every other +character in the alphabet (offset 3 and 5 handled correctly at the document +boundary).</p> + +<p>Benchmarks.</p> + +<p>Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM) +i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.</p> + +<p>All articles are 16 KB (2x my avg post size) documents generated with seed.sh.</p> + +<p>Total Time: 0.1475 seconds<br /> +Files Processed: 500 <br /> +File Sizes:<br /> + corpus.bin 33.59 KB<br /> + sa.bin 134.34 KB<br /> + file_map.dat 37.01 KB<br /> + Total Index: 204.94 KB<br /></p> + +<p>Total Time: 0.3101 seconds<br /> +Files Processed: 1000<br /> +File Sizes:<br /> + corpus.bin 67.28 KB<br /> + sa.bin 269.11 KB<br /> + file_map.dat 74.12 KB<br /> + Total Index: 410.51 KB<br /></p> + +<p>Total Time: 10.9661 seconds<br /> +Files Processed: 10000<br /> +File Sizes:<br /> + corpus.bin 682.51 KB<br /> + sa.bin 2730.05 KB<br /> + file_map.dat 750.88 KB<br /> + Total Index: 4163.44 KB<br /></p> + +<p>There are 11 posts today. Unlikely that it’d grow to 10K. Even if it did, the +index is very manageable.</p> + +<h2 id="search">Search</h2> + +<p>Need to perform a range query to find offsets of all matches. Suffix +array is sorted–use binary search.</p> + +<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># left bound +while ($low <= $high) { + # Int overflow: Perl handles by promoting to float. + my $mid = int(($low + $high) / 2); + seek($fh_sa, $mid * 4, 0); + read($fh_sa, my $bin_off, 4); + my $off = unpack("L", $bin_off); + seek($fh_cp, $off, 0); + read($fh_cp, my $text, $query_len); + + my $cmp = $text cmp $query; + if ($cmp >= 0) { + $first_hit = $mid if $cmp == 0; + $high = $mid - 1; + } else { + $low = $mid + 1; + } +} + +# Continue with right bound... +</code></pre></div></div> + +<p>Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to +seek/read exact chunks into memory. Since the FastCGI script is invoked per +query, small, frequent allocations are better than less frequent large +allocations (see bm_*.txt).</p> + +<p>Once the range is found through binary search, the results are collected using +a simple for loop. Max result count is capped at 20.</p> + +<p>Benchmarks. Search for ‘arduino’ (expect 0 matches). Left: SA-based index +search. Right: Naive search using regex.</p> + +<p>500 articles:</p> + +<table> + <tr> + <td style="padding: 0">Files Read:</td> + <td style="padding: 0; text-align: right;">3</td> + <td style="padding: 0; text-align: right;">500</td> + </tr> + <tr> + <td style="padding: 0">User CPU (s):</td> + <td style="padding: 0; text-align: right;">0.0100</td> + <td style="padding: 0; text-align: right;">0.0600</td> + </tr> + <tr> + <td style="padding: 0">System CPU (s):</td> + <td style="padding: 0; text-align: right;">0.0200</td> + <td style="padding: 0; text-align: right;">0.0200</td> + </tr> + <tr> + <td style="padding: 0">Total Time (s):</td> + <td style="padding: 0; text-align: right;">0.0012</td> + <td style="padding: 0; text-align: right;">0.0407</td> + </tr> + <tr> + <td style="padding: 0">Peak RAM (KB):</td> + <td style="padding: 0; text-align: right;">8828</td> + <td style="padding: 0; text-align: right;">9136</td> + </tr> +</table> + +<p>1,000 articles:</p> + +<table> + <tr> + <td style="padding: 0">Files Read:</td> + <td style="padding: 0; text-align: right;">3</td> + <td style="padding: 0; text-align: right;">1000</td> + </tr> + <tr> + <td style="padding: 0">User CPU (s):</td> + <td style="padding: 0; text-align: right;">0.0200</td> + <td style="padding: 0; text-align: right;">0.0500</td> + </tr> + <tr> + <td style="padding: 0">System CPU (s):</td> + <td style="padding: 0; text-align: right;">0.0100</td> + <td style="padding: 0; text-align: right;">0.0500</td> + </tr> + <tr> + <td style="padding: 0">Total Time (s):</td> + <td style="padding: 0; text-align: right;">0.0019</td> + <td style="padding: 0; text-align: right;">0.0795</td> + </tr> + <tr> + <td style="padding: 0">Peak RAM (KB):</td> + <td style="padding: 0; text-align: right;">8980</td> + <td style="padding: 0; text-align: right;">9460</td> + </tr> +</table> + +<p>10,000 articles:</p> + +<table> + <tr> + <td style="padding: 0">Files Read:</td> + <td style="padding: 0; text-align: right;">3</td> + <td style="padding: 0; text-align: right;">10000</td> + </tr> + <tr> + <td style="padding: 0">User CPU (s):</td> + <td style="padding: 0; text-align: right;">0.0300</td> + <td style="padding: 0; text-align: right;">0.4000</td> + </tr> + <tr> + <td style="padding: 0">System CPU (s):</td> + <td style="padding: 0; text-align: right;">0.0300</td> + <td style="padding: 0; text-align: right;">0.5400</td> + </tr> + <tr> + <td style="padding: 0">Total Time (s):</td> + <td style="padding: 0; text-align: right;">0.0161</td> + <td style="padding: 0; text-align: right;">0.9120</td> + </tr> + <tr> + <td style="padding: 0">Peak RAM (KB):</td> + <td style="padding: 0; text-align: right;">12504</td> + <td style="padding: 0; text-align: right;">12804</td> + </tr> +</table> + +<p>Note: Total time < User + System CPU time is likely a measurement artifact–the +script finishes faster than the OS tick interval.</p> + +<h2 id="security">Security</h2> + +<p>Memory usage remains a concern. Parallel queries could exhaust memory. Limit +parallel searches using a semaphore implemented with lock files:</p> + +<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my $max_parallel = 50; + +mkdir $lock_dir, 0777 unless -d $lock_dir; +my $active_count = 0; +my $now = time(); + +opendir(my $dh, $lock_dir); +while (my $file = readdir($dh)) { + next unless $file =~ /\.lock$/; + my $path = "$lock_dir/$file"; + my $mtime = (stat($path))[9] || 0; + ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++; +} +closedir($dh); +</code></pre></div></div> + +<p>XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).</p> + +<p>ReDOS: Sanitize search text by tossing non-printable characters, limit length, +work within utf-8. Use <code class="language-plaintext highlighter-rouge">\Q...\E</code> in searches.</p> + +<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) { + $search_text = $1; + $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; + $search_text = decode_utf8($search_text // ""); + $search_text =~ s/\P{Print}//g; + $search_text = substr($search_text, 0, 64); + $search_text =~ s/^\s+|\s+$//g; +} +</code></pre></div></div> + +<p>Command injection: no exec()/system() calls. Non-privileged user. +Read-only directory.</p> + +<p>Symlink attacks: paths are hard-coded. Script can’t be modified (554).</p> + +<p>No API keys, passwords involved.</p> + +<p>Last line of defence: chroot.</p> + +<p>Can’t think of anything else.</p> + +<p>Verdict: Fast SA index lookup. Mitigated primary attack vectors. No +dependencies. Works on every conceivable device.</p> + +<p>Commit: +<a href="https://git.asciimx.com/www/commit/?h=term&id=9889acaac984817731ad9f9355f067f9acd02efd">9889aca</a> +| Benchmarks: +<a href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9">8a4da68</a></p> +</div> + <p class="post-author right">by W. D. Sadeep Madurange</p> + </div> + </div> + </main> + + <div class="footer"> + <div class="container"> + <div class="twelve columns right container-2"> + <p id="footer-text">© ASCIIMX - 2026</p> + </div> + </div> +</div> + + + </body> +</html> |
