From 15205d0cf770058b59be07e00f6dbc6523b9cede Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Sat, 3 Jan 2026 13:06:08 +0800 Subject: CGI search post. --- _site/log/site-search/index.html | 338 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 338 insertions(+) create mode 100644 _site/log/site-search/index.html (limited to '_site/log/site-search/index.html') diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html new file mode 100644 index 0000000..3c32ba3 --- /dev/null +++ b/_site/log/site-search/index.html @@ -0,0 +1,338 @@ + + + + + + Site search with Perl + CGI + SA + + + + + + + + + + + +
+
+
+

SITE SEARCH WITH PERL + CGI + SA

+
02 JANUARY 2026
+
+

Number of articles on the site is growing. Need a way to find them.

+ +

Search must match substrings, be case-insensitive, fast, and secure. Site must +continue to work without JavaScript, on text web browsers–so no client-side +search. Page rank not applicable.

+ +

Architecture. OpenBSD’s httpd speaks FastCGI. Perl and slowcgi are in the base +system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the +results back over FastCGI.

+ +

Data structure. Suffix array. Build-time indexer generates three files: +corpus.bin–articles, sa.bin-suffix array, file_map.dat–article metadata +(title, path).

+ +

Indexer

+ +

Crawl the directory containing posts, extract HTML with regex +(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case +the text (for case-insensitive search), convert to raw octets, concatenate.

+ +

Sort the text lexicographically, and store their byte offsets in sa.bin:

+ +
# Use a block that forces byte-level comparison
+{
+    use bytes; 
+    @sa = sort { 
+        # First 64 bytes check (fast path)
+        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
+        # Full string fallback (required for correctness)
+        (substr($corpus, $a) cmp substr($corpus, $b))
+    } @sa;
+}
+
+ +

This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the +suffix count, L the median length of suffixes. The fast path caps L at 64 bytes +for most cases, bringing complexity down to O(NlogN)–64-byte length chosen to +fit in cache lines.

+ +
CORPUS:   a  s  c  i \0  i  m  x
+OFFSET:   0  1  2  3  4  5  6  7
+
+SORTED SUFFIXES:
+[4] \0imx
+[0] asci\0imx
+[2] ci\0imx
+[3] i\0imx       <-- "i" from "asci"
+[5] imx          <-- "i" from "imx"
+[6] mx
+[1] sci\0imx
+[7] x
+
+ +

Sentinel (null byte) must be lexicographically smaller than every other +character in the alphabet (offset 3 and 5 handled correctly at the document +boundary).

+ +

Benchmarks.

+ +

Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM) +i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.

+ +

All articles are 16 KB (2x my avg post size) documents generated with seed.sh.

+ +

Total Time: 0.1475 seconds
+Files Processed: 500
+File Sizes:
+   corpus.bin       33.59 KB
+   sa.bin          134.34 KB
+   file_map.dat     37.01 KB
+   Total Index:    204.94 KB

+ +

Total Time: 0.3101 seconds
+Files Processed: 1000
+File Sizes:
+   corpus.bin       67.28 KB
+   sa.bin          269.11 KB
+   file_map.dat     74.12 KB
+   Total Index:    410.51 KB

+ +

Total Time: 10.9661 seconds
+Files Processed: 10000
+File Sizes:
+   corpus.bin      682.51 KB
+   sa.bin         2730.05 KB
+   file_map.dat    750.88 KB
+   Total Index:   4163.44 KB

+ +

There are 11 posts today. Unlikely that it’d grow to 10K. Even if it did, the +index is very manageable.

+ + + +

Need to perform a range query to find offsets of all matches. Suffix +array is sorted–use binary search.

+ +
# left bound
+while ($low <= $high) {
+    # Int overflow: Perl handles by promoting to float.
+    my $mid = int(($low + $high) / 2);
+    seek($fh_sa, $mid * 4, 0);
+    read($fh_sa, my $bin_off, 4);
+    my $off = unpack("L", $bin_off);
+    seek($fh_cp, $off, 0);
+    read($fh_cp, my $text, $query_len);
+        
+    my $cmp = $text cmp $query;
+    if ($cmp >= 0) {
+        $first_hit = $mid if $cmp == 0;
+        $high = $mid - 1;
+    } else {
+        $low = $mid + 1;
+    }
+}
+
+# Continue with right bound...
+
+ +

Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to +seek/read exact chunks into memory. Since the FastCGI script is invoked per +query, small, frequent allocations are better than less frequent large +allocations (see bm_*.txt).

+ +

Once the range is found through binary search, the results are collected using +a simple for loop. Max result count is capped at 20.

+ +

Benchmarks. Search for ‘arduino’ (expect 0 matches). Left: SA-based index +search. Right: Naive search using regex.

+ +

500 articles:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
Files Read:3500
User CPU (s):0.01000.0600
System CPU (s):0.02000.0200
Total Time (s):0.00120.0407
Peak RAM (KB):88289136
+ +

1,000 articles:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
Files Read:31000
User CPU (s):0.02000.0500
System CPU (s):0.01000.0500
Total Time (s):0.00190.0795
Peak RAM (KB):89809460
+ +

10,000 articles:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
Files Read:310000
User CPU (s):0.03000.4000
System CPU (s):0.03000.5400
Total Time (s):0.01610.9120
Peak RAM (KB):1250412804
+ +

Note: Total time < User + System CPU time is likely a measurement artifact–the +script finishes faster than the OS tick interval.

+ +

Security

+ +

Memory usage remains a concern. Parallel queries could exhaust memory. Limit +parallel searches using a semaphore implemented with lock files:

+ +
my $max_parallel   = 50;
+
+mkdir $lock_dir, 0777 unless -d $lock_dir;
+my $active_count = 0;
+my $now = time();
+
+opendir(my $dh, $lock_dir);
+while (my $file = readdir($dh)) {
+    next unless $file =~ /\.lock$/;
+    my $path = "$lock_dir/$file";
+    my $mtime = (stat($path))[9] || 0;
+    ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++;
+}
+closedir($dh);
+
+ +

XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).

+ +

ReDOS: Sanitize search text by tossing non-printable characters, limit length, +work within utf-8. Use \Q...\E in searches.

+ +
if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) {
+    $search_text = $1;
+    $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
+    $search_text = decode_utf8($search_text // "");
+    $search_text =~ s/\P{Print}//g; 
+    $search_text = substr($search_text, 0, 64);
+    $search_text =~ s/^\s+|\s+$//g;
+}
+
+ +

Command injection: no exec()/system() calls. Non-privileged user. +Read-only directory.

+ +

Symlink attacks: paths are hard-coded. Script can’t be modified (554).

+ +

No API keys, passwords involved.

+ +

Last line of defence: chroot.

+ +

Can’t think of anything else.

+ +

Verdict: Fast SA index lookup. Mitigated primary attack vectors. No +dependencies. Works on every conceivable device.

+ +

Commit: +9889aca +| Benchmarks: +8a4da68

+
+ +
+
+
+ + + + + + -- cgit v1.2.3