From 15205d0cf770058b59be07e00f6dbc6523b9cede Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Sat, 3 Jan 2026 13:06:08 +0800 Subject: CGI search post. --- _log/site-search.md | 287 +++++++++++++++++++++++++++++++++ _site/assets/css/main.css | 2 +- _site/cgi-bin/find.cgi | 34 ++-- _site/index.html | 26 +-- _site/log/index.html | 13 ++ _site/log/site-search/index.html | 338 +++++++++++++++++++++++++++++++++++++++ _site/posts.xml | 1 + _site/sitemap.xml | 4 + assets/css/main.css | 2 +- 9 files changed, 678 insertions(+), 29 deletions(-) create mode 100644 _log/site-search.md create mode 100644 _site/log/site-search/index.html diff --git a/_log/site-search.md b/_log/site-search.md new file mode 100644 index 0000000..61e4bba --- /dev/null +++ b/_log/site-search.md @@ -0,0 +1,287 @@ +--- +title: Site search with Perl + CGI + SA +date: 2026-01-02 +layout: post +--- + +Number of articles on the site is growing. Need a way to find them. + +Search must match substrings, be case-insensitive, fast, and secure. Site must +continue to work without JavaScript, on text web browsers--so no client-side +search. Page rank not applicable. + +Architecture. OpenBSD's httpd speaks FastCGI. Perl and slowcgi are in the base +system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the +results back over FastCGI. + +Data structure. Suffix array. Build-time indexer generates three files: +corpus.bin--articles, sa.bin-suffix array, file_map.dat--article metadata +(title, path). + +## Indexer + +Crawl the directory containing posts, extract HTML with regex +(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case +the text (for case-insensitive search), convert to raw octets, concatenate. + +Sort the text lexicographically, and store their byte offsets in sa.bin: + +``` +# Use a block that forces byte-level comparison +{ + use bytes; + @sa = sort { + # First 64 bytes check (fast path) + (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || + # Full string fallback (required for correctness) + (substr($corpus, $a) cmp substr($corpus, $b)) + } @sa; +} +``` + +This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the +suffix count, L the median length of suffixes. The fast path caps L at 64 bytes +for most cases, bringing complexity down to O(NlogN)--64-byte length chosen to +fit in cache lines. + +``` +CORPUS: a s c i \0 i m x +OFFSET: 0 1 2 3 4 5 6 7 + +SORTED SUFFIXES: +[4] \0imx +[0] asci\0imx +[2] ci\0imx +[3] i\0imx <-- "i" from "asci" +[5] imx <-- "i" from "imx" +[6] mx +[1] sci\0imx +[7] x +``` + +Sentinel (null byte) must be lexicographically smaller than every other +character in the alphabet (offset 3 and 5 handled correctly at the document +boundary). + +Benchmarks. + +Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM) +i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory. + +All articles are 16 KB (2x my avg post size) documents generated with seed.sh. + +Total Time: 0.1475 seconds
+Files Processed: 500
+File Sizes:
+   corpus.bin       33.59 KB
+   sa.bin          134.34 KB
+   file_map.dat     37.01 KB
+   Total Index:    204.94 KB
+ +Total Time: 0.3101 seconds
+Files Processed: 1000
+File Sizes:
+   corpus.bin       67.28 KB
+   sa.bin          269.11 KB
+   file_map.dat     74.12 KB
+   Total Index:    410.51 KB
+ +Total Time: 10.9661 seconds
+Files Processed: 10000
+File Sizes:
+   corpus.bin      682.51 KB
+   sa.bin         2730.05 KB
+   file_map.dat    750.88 KB
+   Total Index:   4163.44 KB
+ +There are 11 posts today. Unlikely that it'd grow to 10K. Even if it did, the +index is very manageable. + +## Search + +Need to perform a range query to find offsets of all matches. Suffix +array is sorted--use binary search. + +``` +# left bound +while ($low <= $high) { + # Int overflow: Perl handles by promoting to float. + my $mid = int(($low + $high) / 2); + seek($fh_sa, $mid * 4, 0); + read($fh_sa, my $bin_off, 4); + my $off = unpack("L", $bin_off); + seek($fh_cp, $off, 0); + read($fh_cp, my $text, $query_len); + + my $cmp = $text cmp $query; + if ($cmp >= 0) { + $first_hit = $mid if $cmp == 0; + $high = $mid - 1; + } else { + $low = $mid + 1; + } +} + +# Continue with right bound... +``` + +Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to +seek/read exact chunks into memory. Since the FastCGI script is invoked per +query, small, frequent allocations are better than less frequent large +allocations (see bm_*.txt). + +Once the range is found through binary search, the results are collected using +a simple for loop. Max result count is capped at 20. + +Benchmarks. Search for 'arduino' (expect 0 matches). Left: SA-based index +search. Right: Naive search using regex. + +500 articles: + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Files Read:3500
User CPU (s):0.01000.0600
System CPU (s):0.02000.0200
Total Time (s):0.00120.0407
Peak RAM (KB):88289136
+ +1,000 articles: + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Files Read:31000
User CPU (s):0.02000.0500
System CPU (s):0.01000.0500
Total Time (s):0.00190.0795
Peak RAM (KB):89809460
+ +10,000 articles: + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Files Read:310000
User CPU (s):0.03000.4000
System CPU (s):0.03000.5400
Total Time (s):0.01610.9120
Peak RAM (KB):1250412804
+ +Note: Total time < User + System CPU time is likely a measurement artifact--the +script finishes faster than the OS tick interval. + +## Security + +Memory usage remains a concern. Parallel queries could exhaust memory. Limit +parallel searches using a semaphore implemented with lock files: + +``` +my $max_parallel = 50; + +mkdir $lock_dir, 0777 unless -d $lock_dir; +my $active_count = 0; +my $now = time(); + +opendir(my $dh, $lock_dir); +while (my $file = readdir($dh)) { + next unless $file =~ /\.lock$/; + my $path = "$lock_dir/$file"; + my $mtime = (stat($path))[9] || 0; + ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++; +} +closedir($dh); +``` + +XSS attacks. Escape all HTML using HTML::Escape qw(escape_html). + +ReDOS: Sanitize search text by tossing non-printable characters, limit length, +work within utf-8. Use `\Q...\E` in searches. + +``` +if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) { + $search_text = $1; + $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; + $search_text = decode_utf8($search_text // ""); + $search_text =~ s/\P{Print}//g; + $search_text = substr($search_text, 0, 64); + $search_text =~ s/^\s+|\s+$//g; +} +``` + +Command injection: no exec()/system() calls. Non-privileged user. +Read-only directory. + +Symlink attacks: paths are hard-coded. Script can't be modified (554). + +No API keys, passwords involved. + +Last line of defence: chroot. + +Can't think of anything else. + +Verdict: Fast SA index lookup. Mitigated primary attack vectors. No +dependencies. Works on every conceivable device. + +Commit: +[9889aca](https://git.asciimx.com/www/commit/?h=term&id=9889acaac984817731ad9f9355f067f9acd02efd) +| Benchmarks: +[8a4da68](https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9) diff --git a/_site/assets/css/main.css b/_site/assets/css/main.css index 091baf3..1b4341b 100644 --- a/_site/assets/css/main.css +++ b/_site/assets/css/main.css @@ -62,7 +62,7 @@ table { } table, tr, td { - border: none; + border: none !important; } td { diff --git a/_site/cgi-bin/find.cgi b/_site/cgi-bin/find.cgi index 5f95e3a..ab066dd 100644 --- a/_site/cgi-bin/find.cgi +++ b/_site/cgi-bin/find.cgi @@ -7,7 +7,7 @@ use Encode qw(decode_utf8 encode_utf8); use URI::Escape qw(uri_unescape); use HTML::Escape qw(escape_html); -# --- Configuration --- +# Configuration my $max_parallel = 50; # Max parallel search requests my $lock_timeout = 30; # Seconds before dropping stale locks my $max_results = 20; # Max search results to display @@ -16,7 +16,7 @@ my $cp_file = 'corpus.bin'; # Raw text corpus my $map_file = 'file_map.dat'; # File metadata my $lock_dir = '/tmp/search_locks'; # Semaphore directory -# --- Concurrency Control --- +# Concurrency control mkdir $lock_dir, 0777 unless -d $lock_dir; my $active_count = 0; my $now = time(); @@ -45,7 +45,7 @@ if ($active_count >= $max_parallel) { my $lock_file = "$lock_dir/$$.lock"; open(my $fh_lock, '>', $lock_file); -# --- Query Decoding --- +# Query decoding if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) { my $raw_q = $1; $raw_q =~ tr/+/ /; @@ -64,7 +64,7 @@ if ($search_text eq '') { final_output("

Please enter a search term above.

"); } -# --- Binary Search Logic --- +# Binary search my @results; my $query = encode_utf8(lc($search_text)); my $query_len = length($query); @@ -130,38 +130,44 @@ if (-f $sa_file && -f $cp_file) { foreach my $m (@$file_map) { if ($offset >= $m->{start} && $offset < $m->{end}) { if (!$seen{$m->{path}}++) { - # 1. Capture slightly more than 50 chars for trimming + # Capture more than 50 chars for trimming my $snip_start = ($offset - 30 < $m->{start}) ? $m->{start} : $offset - 30; + my $max_len = $m->{end} - $snip_start; + my $read_len = ($max_len > 120) ? 120 : $max_len; seek($fh_cp, $snip_start, 0); - read($fh_cp, my $raw_snip, 120); + read($fh_cp, my $raw_snip, $read_len); my $snippet = decode_utf8($raw_snip, Encode::FB_QUIET) // $raw_snip; $snippet =~ s/\s+/ /g; # Normalize whitespace - # 2. Trim Start: Partial word removal + # Trim start: Partial word removal if ($snip_start > $m->{start}) { $snippet =~ s/^[^\s]*\s//; } - # 3. Trim End: Length limit and partial word removal + # Trim end: Length limit and partial word removal my $has_more = 0; if (length($snippet) > 50) { $snippet = substr($snippet, 0, 50); $has_more = 1 if $snippet =~ s/\s+[^\s]*$//; } + elsif ($snip_start + $read_len < $m->{end}) { + # This check handles snippets that are naturally short but + # there's still more text in the article we didn't read + $has_more = 1; + } - # 4. Cleanup & Capitalize + # Cleanup & capitalize $snippet = ucfirst($snippet); + $snippet = escape_html($snippet) . ($has_more ? "..." : ""); + my $clean_path = $m->{path}; $clean_path =~ s|^\.\./_site/||; - # 5. Build Final Snippet - my $display_snippet = escape_html($snippet) . ($has_more ? "..." : ""); - push @results, { path => $clean_path, - title => (split('/', $m->{path}))[-2], - snippet => $display_snippet + title => $m->{title},, + snippet => $snippet }; } last; diff --git a/_site/index.html b/_site/index.html index 90cbbfd..57ba207 100644 --- a/_site/index.html +++ b/_site/index.html @@ -59,6 +59,19 @@ + + + Site search with Perl + CGI + SA + + + + + + + + + + Matrix Rain: 2025 refactor @@ -176,19 +189,6 @@ - - - Neo4j shortest path optimization - - - - - - - - - - diff --git a/_site/log/index.html b/_site/log/index.html index 32fbf8d..813832d 100644 --- a/_site/log/index.html +++ b/_site/log/index.html @@ -49,6 +49,19 @@ + + + Site search with Perl + CGI + SA + + + + + + + + + + Matrix Rain: 2025 refactor diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html new file mode 100644 index 0000000..3c32ba3 --- /dev/null +++ b/_site/log/site-search/index.html @@ -0,0 +1,338 @@ + + + + + + Site search with Perl + CGI + SA + + + + + + + + + + + +
+
+
+

SITE SEARCH WITH PERL + CGI + SA

+
02 JANUARY 2026
+
+

Number of articles on the site is growing. Need a way to find them.

+ +

Search must match substrings, be case-insensitive, fast, and secure. Site must +continue to work without JavaScript, on text web browsers–so no client-side +search. Page rank not applicable.

+ +

Architecture. OpenBSD’s httpd speaks FastCGI. Perl and slowcgi are in the base +system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the +results back over FastCGI.

+ +

Data structure. Suffix array. Build-time indexer generates three files: +corpus.bin–articles, sa.bin-suffix array, file_map.dat–article metadata +(title, path).

+ +

Indexer

+ +

Crawl the directory containing posts, extract HTML with regex +(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case +the text (for case-insensitive search), convert to raw octets, concatenate.

+ +

Sort the text lexicographically, and store their byte offsets in sa.bin:

+ +
# Use a block that forces byte-level comparison
+{
+    use bytes; 
+    @sa = sort { 
+        # First 64 bytes check (fast path)
+        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
+        # Full string fallback (required for correctness)
+        (substr($corpus, $a) cmp substr($corpus, $b))
+    } @sa;
+}
+
+ +

This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the +suffix count, L the median length of suffixes. The fast path caps L at 64 bytes +for most cases, bringing complexity down to O(NlogN)–64-byte length chosen to +fit in cache lines.

+ +
CORPUS:   a  s  c  i \0  i  m  x
+OFFSET:   0  1  2  3  4  5  6  7
+
+SORTED SUFFIXES:
+[4] \0imx
+[0] asci\0imx
+[2] ci\0imx
+[3] i\0imx       <-- "i" from "asci"
+[5] imx          <-- "i" from "imx"
+[6] mx
+[1] sci\0imx
+[7] x
+
+ +

Sentinel (null byte) must be lexicographically smaller than every other +character in the alphabet (offset 3 and 5 handled correctly at the document +boundary).

+ +

Benchmarks.

+ +

Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM) +i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.

+ +

All articles are 16 KB (2x my avg post size) documents generated with seed.sh.

+ +

Total Time: 0.1475 seconds
+Files Processed: 500
+File Sizes:
+   corpus.bin       33.59 KB
+   sa.bin          134.34 KB
+   file_map.dat     37.01 KB
+   Total Index:    204.94 KB

+ +

Total Time: 0.3101 seconds
+Files Processed: 1000
+File Sizes:
+   corpus.bin       67.28 KB
+   sa.bin          269.11 KB
+   file_map.dat     74.12 KB
+   Total Index:    410.51 KB

+ +

Total Time: 10.9661 seconds
+Files Processed: 10000
+File Sizes:
+   corpus.bin      682.51 KB
+   sa.bin         2730.05 KB
+   file_map.dat    750.88 KB
+   Total Index:   4163.44 KB

+ +

There are 11 posts today. Unlikely that it’d grow to 10K. Even if it did, the +index is very manageable.

+ + + +

Need to perform a range query to find offsets of all matches. Suffix +array is sorted–use binary search.

+ +
# left bound
+while ($low <= $high) {
+    # Int overflow: Perl handles by promoting to float.
+    my $mid = int(($low + $high) / 2);
+    seek($fh_sa, $mid * 4, 0);
+    read($fh_sa, my $bin_off, 4);
+    my $off = unpack("L", $bin_off);
+    seek($fh_cp, $off, 0);
+    read($fh_cp, my $text, $query_len);
+        
+    my $cmp = $text cmp $query;
+    if ($cmp >= 0) {
+        $first_hit = $mid if $cmp == 0;
+        $high = $mid - 1;
+    } else {
+        $low = $mid + 1;
+    }
+}
+
+# Continue with right bound...
+
+ +

Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to +seek/read exact chunks into memory. Since the FastCGI script is invoked per +query, small, frequent allocations are better than less frequent large +allocations (see bm_*.txt).

+ +

Once the range is found through binary search, the results are collected using +a simple for loop. Max result count is capped at 20.

+ +

Benchmarks. Search for ‘arduino’ (expect 0 matches). Left: SA-based index +search. Right: Naive search using regex.

+ +

500 articles:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
Files Read:3500
User CPU (s):0.01000.0600
System CPU (s):0.02000.0200
Total Time (s):0.00120.0407
Peak RAM (KB):88289136
+ +

1,000 articles:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
Files Read:31000
User CPU (s):0.02000.0500
System CPU (s):0.01000.0500
Total Time (s):0.00190.0795
Peak RAM (KB):89809460
+ +

10,000 articles:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
Files Read:310000
User CPU (s):0.03000.4000
System CPU (s):0.03000.5400
Total Time (s):0.01610.9120
Peak RAM (KB):1250412804
+ +

Note: Total time < User + System CPU time is likely a measurement artifact–the +script finishes faster than the OS tick interval.

+ +

Security

+ +

Memory usage remains a concern. Parallel queries could exhaust memory. Limit +parallel searches using a semaphore implemented with lock files:

+ +
my $max_parallel   = 50;
+
+mkdir $lock_dir, 0777 unless -d $lock_dir;
+my $active_count = 0;
+my $now = time();
+
+opendir(my $dh, $lock_dir);
+while (my $file = readdir($dh)) {
+    next unless $file =~ /\.lock$/;
+    my $path = "$lock_dir/$file";
+    my $mtime = (stat($path))[9] || 0;
+    ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++;
+}
+closedir($dh);
+
+ +

XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).

+ +

ReDOS: Sanitize search text by tossing non-printable characters, limit length, +work within utf-8. Use \Q...\E in searches.

+ +
if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) {
+    $search_text = $1;
+    $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
+    $search_text = decode_utf8($search_text // "");
+    $search_text =~ s/\P{Print}//g; 
+    $search_text = substr($search_text, 0, 64);
+    $search_text =~ s/^\s+|\s+$//g;
+}
+
+ +

Command injection: no exec()/system() calls. Non-privileged user. +Read-only directory.

+ +

Symlink attacks: paths are hard-coded. Script can’t be modified (554).

+ +

No API keys, passwords involved.

+ +

Last line of defence: chroot.

+ +

Can’t think of anything else.

+ +

Verdict: Fast SA index lookup. Mitigated primary attack vectors. No +dependencies. Works on every conceivable device.

+ +

Commit: +9889aca +| Benchmarks: +8a4da68

+
+ +
+
+
+ + + + + + diff --git a/_site/posts.xml b/_site/posts.xml index d327a7b..0136478 100644 --- a/_site/posts.xml +++ b/_site/posts.xml @@ -1 +1,2 @@ Jekyll2026-01-02T19:06:26+08:00/posts.xmlASCIIMXW. D. Sadeep Madurange +Jekyll2026-01-03T12:59:44+08:00/posts.xmlASCIIMXW. D. Sadeep Madurange diff --git a/_site/sitemap.xml b/_site/sitemap.xml index b799547..c5fe837 100644 --- a/_site/sitemap.xml +++ b/_site/sitemap.xml @@ -41,6 +41,10 @@ 2025-12-21T00:00:00+08:00 +/log/site-search/ +2026-01-02T00:00:00+08:00 + + /about/ diff --git a/assets/css/main.css b/assets/css/main.css index 091baf3..1b4341b 100644 --- a/assets/css/main.css +++ b/assets/css/main.css @@ -62,7 +62,7 @@ table { } table, tr, td { - border: none; + border: none !important; } td { -- cgit v1.2.3