From 16fe66dd83cbffa18af31676a380660ebce4e827 Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Sat, 3 Jan 2026 19:24:25 +0800 Subject: Polish the search engine post. --- _site/log/site-search/index.html | 274 ++++----------------------------------- 1 file changed, 25 insertions(+), 249 deletions(-) (limited to '_site/log/site-search/index.html') diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html index 65e69e0..04f98e9 100644 --- a/_site/log/site-search/index.html +++ b/_site/log/site-search/index.html @@ -3,7 +3,7 @@ - Site search with Perl + CGI + SA + Perl + FastCGI + SA search engine @@ -37,30 +37,20 @@
-

SITE SEARCH WITH PERL + CGI + SA

+

PERL + FASTCGI + SA SEARCH ENGINE

02 JANUARY 2026

-

Number of articles on the site is growing. Need a way to find them.

+

Number of articles growing. Need search.

-

Search must match substrings, be case-insensitive, fast, and secure. Site must -continue to work without JavaScript, on text web browsers–so no client-side -search. Page rank not applicable.

+

Requirements: substring match, case-insensitive, fast, secure. No JavaScript.

-

Architecture. OpenBSD’s httpd speaks FastCGI. Perl and slowcgi are in the base -system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the -results back over FastCGI.

+

Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.

-

Data structure. Suffix array. Build-time indexer generates three files: -corpus.bin–articles, sa.bin-suffix array, file_map.dat–article metadata -(title, path).

+

Data structure: suffix array. Three files: corpus.bin (articles), sa.bin +(sorted byte offsets), file_map.dat (metadata).

-

Indexer

- -

Crawl the directory containing posts, extract HTML with regex -(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case -the text (for case-insensitive search), convert to raw octets, concatenate.

- -

Sort the text lexicographically, and store their byte offsets in sa.bin:

+

Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null +byte sentinel for document boundaries. Sort lexicographically::

# Use a block that forces byte-level comparison
 {
@@ -74,243 +64,29 @@ the text (for case-insensitive search), convert to raw octets, concatenate.

}
-

This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the -suffix count, L the median length of suffixes. The fast path caps L at 64 bytes -for most cases, bringing complexity down to O(NlogN)–64-byte length chosen to -fit in cache lines.

- -
CORPUS:   a  s  c  i \0  i  m  x
-OFFSET:   0  1  2  3  4  5  6  7
-
-SORTED SUFFIXES:
-[4] \0imx
-[0] asci\0imx
-[2] ci\0imx
-[3] i\0imx       <-- "i" from "asci"
-[5] imx          <-- "i" from "imx"
-[6] mx
-[1] sci\0imx
-[7] x
-
- -

Sentinel (null byte) must be lexicographically smaller than every other -character in the alphabet (offset 3 and 5 handled correctly at the document -boundary).

- -

Benchmarks.

- -

Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM) -i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.

- -

All articles are 16 KB (2x my avg post size) documents generated with seed.sh.

- -

Total Time: 0.1475 seconds
-Files Processed: 500
-File Sizes:
-   corpus.bin       33.59 KB
-   sa.bin          134.34 KB
-   file_map.dat     37.01 KB
-   Total Index:    204.94 KB

- -

Total Time: 0.3101 seconds
-Files Processed: 1000
-File Sizes:
-   corpus.bin       67.28 KB
-   sa.bin          269.11 KB
-   file_map.dat     74.12 KB
-   Total Index:    410.51 KB

- -

Total Time: 10.9661 seconds
-Files Processed: 10000
-File Sizes:
-   corpus.bin      682.51 KB
-   sa.bin         2730.05 KB
-   file_map.dat    750.88 KB
-   Total Index:   4163.44 KB

- -

There are 11 posts today. Unlikely that it’d grow to 10K. Even if it did, the -index is very manageable.

- - - -

Need to perform a range query to find offsets of all matches. Suffix -array is sorted–use binary search.

- -
# left bound
-while ($low <= $high) {
-    # Int overflow: Perl handles by promoting to float.
-    my $mid = int(($low + $high) / 2);
-    seek($fh_sa, $mid * 4, 0);
-    read($fh_sa, my $bin_off, 4);
-    my $off = unpack("L", $bin_off);
-    seek($fh_cp, $off, 0);
-    read($fh_cp, my $text, $query_len);
-        
-    my $cmp = $text cmp $query;
-    if ($cmp >= 0) {
-        $first_hit = $mid if $cmp == 0;
-        $high = $mid - 1;
-    } else {
-        $low = $mid + 1;
-    }
-}
-
-# Continue with right bound...
-
- -

Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to -seek/read exact chunks into memory. Since the FastCGI script is invoked per -query, small, frequent allocations are better than less frequent large -allocations (see bm_*.txt).

- -

Once the range is found through binary search, the results are collected using -a simple for loop. Max result count is capped at 20.

- -

Benchmarks. Search for ‘arduino’ (expect 0 matches). Left: SA-based index -search. Right: Naive search using regex.

- -

500 articles:

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Files Read:3500
User CPU (s):0.01000.0600
System CPU (s):0.02000.0200
Total Time (s):0.00120.0407
Peak RAM (KB):88289136
- -

1,000 articles:

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Files Read:31000
User CPU (s):0.02000.0500
System CPU (s):0.01000.0500
Total Time (s):0.00190.0795
Peak RAM (KB):89809460
- -

10,000 articles:

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Files Read:310000
User CPU (s):0.03000.4000
System CPU (s):0.03000.5400
Total Time (s):0.01610.9120
Peak RAM (KB):1250412804
- -

Note: Total time < User + System CPU time is likely a measurement artifact–the -script finishes faster than the OS tick interval.

- -

Security

- -

Memory usage remains a concern. Parallel queries could exhaust memory. Limit -parallel searches using a semaphore implemented with lock files:

- -
my $max_parallel   = 50;
-
-mkdir $lock_dir, 0777 unless -d $lock_dir;
-my $active_count = 0;
-my $now = time();
-
-opendir(my $dh, $lock_dir);
-while (my $file = readdir($dh)) {
-    next unless $file =~ /\.lock$/;
-    my $path = "$lock_dir/$file";
-    my $mtime = (stat($path))[9] || 0;
-    ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++;
-}
-closedir($dh);
-
- -

XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).

- -

ReDOS: Sanitize search text by tossing non-printable characters, limit length, -work within utf-8. Use \Q...\E in searches.

+

Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte +length targets cache lines.

-
if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) {
-    $search_text = $1;
-    $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
-    $search_text = decode_utf8($search_text // "");
-    $search_text =~ s/\P{Print}//g; 
-    $search_text = substr($search_text, 0, 64);
-    $search_text =~ s/^\s+|\s+$//g;
-}
-
+

Search: binary search for range query. Cap at 20 results–define limits or be +surprised by them.

-

Command injection: no exec()/system() calls. Non-privileged user. -Read-only directory.

+

File IO and memory: many seek/read small chunks beat one large allocation (see +benchmarks for find_one_file.cgi).

-

Symlink attacks: paths are hard-coded. Script can’t be modified (554).

+

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):

-

No API keys, passwords involved.

+

1,000 files: 0.31s indexing, 410 KB total index.
+10,000 files: 10.97s indexing, 4.16 MB total index.

-

Last line of defence: chroot.

+

Search ‘arduino’ (0 matches):
+1,000 files: 0.002s (SA) vs 0.080s (naive regex).
+10,000 files: 0.016s (SA) vs 0.912s (naive regex).

-

Can’t think of anything else.

+

Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS). +Sanitize input–strip non-printables, limit length, and quote metacharacters +(ReDOS). No exec/system (command injection). Chroot.

-

Verdict: Fast SA index lookup. Mitigated primary attack vectors. No -dependencies. Works on every conceivable device.

+

Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.

Commit: 6da102d -- cgit v1.2.3