From 16fe66dd83cbffa18af31676a380660ebce4e827 Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Sat, 3 Jan 2026 19:24:25 +0800 Subject: Polish the search engine post. --- _log/site-search.md | 276 ++++----------------------------------- _site/feed.xml | 2 +- _site/index.html | 2 +- _site/log/site-search/index.html | 274 ++++---------------------------------- _site/posts.xml | 2 +- 5 files changed, 52 insertions(+), 504 deletions(-) diff --git a/_log/site-search.md b/_log/site-search.md index 2a552f6..6af9f63 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -1,30 +1,20 @@ --- -title: Site search with Perl + CGI + SA +title: Perl + FastCGI + SA search engine date: 2026-01-02 layout: post --- -Number of articles on the site is growing. Need a way to find them. +Number of articles growing. Need search. -Search must match substrings, be case-insensitive, fast, and secure. Site must -continue to work without JavaScript, on text web browsers--so no client-side -search. Page rank not applicable. +Requirements: substring match, case-insensitive, fast, secure. No JavaScript. -Architecture. OpenBSD's httpd speaks FastCGI. Perl and slowcgi are in the base -system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the -results back over FastCGI. +Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script. -Data structure. Suffix array. Build-time indexer generates three files: -corpus.bin--articles, sa.bin-suffix array, file_map.dat--article metadata -(title, path). +Data structure: suffix array. Three files: corpus.bin (articles), sa.bin +(sorted byte offsets), file_map.dat (metadata). -## Indexer - -Crawl the directory containing posts, extract HTML with regex -(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case -the text (for case-insensitive search), convert to raw octets, concatenate. - -Sort the text lexicographically, and store their byte offsets in sa.bin: +Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null +byte sentinel for document boundaries. Sort lexicographically:: ``` # Use a block that forces byte-level comparison @@ -39,247 +29,29 @@ Sort the text lexicographically, and store their byte offsets in sa.bin: } ``` -This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the -suffix count, L the median length of suffixes. The fast path caps L at 64 bytes -for most cases, bringing complexity down to O(NlogN)--64-byte length chosen to -fit in cache lines. - -``` -CORPUS: a s c i \0 i m x -OFFSET: 0 1 2 3 4 5 6 7 - -SORTED SUFFIXES: -[4] \0imx -[0] asci\0imx -[2] ci\0imx -[3] i\0imx <-- "i" from "asci" -[5] imx <-- "i" from "imx" -[6] mx -[1] sci\0imx -[7] x -``` - -Sentinel (null byte) must be lexicographically smaller than every other -character in the alphabet (offset 3 and 5 handled correctly at the document -boundary). - -Benchmarks. - -Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM) -i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory. - -All articles are 16 KB (2x my avg post size) documents generated with seed.sh. - -Total Time: 0.1475 seconds
-Files Processed: 500
-File Sizes:
-   corpus.bin       33.59 KB
-   sa.bin          134.34 KB
-   file_map.dat     37.01 KB
-   Total Index:    204.94 KB
- -Total Time: 0.3101 seconds
-Files Processed: 1000
-File Sizes:
-   corpus.bin       67.28 KB
-   sa.bin          269.11 KB
-   file_map.dat     74.12 KB
-   Total Index:    410.51 KB
- -Total Time: 10.9661 seconds
-Files Processed: 10000
-File Sizes:
-   corpus.bin      682.51 KB
-   sa.bin         2730.05 KB
-   file_map.dat    750.88 KB
-   Total Index:   4163.44 KB
- -There are 11 posts today. Unlikely that it'd grow to 10K. Even if it did, the -index is very manageable. - -## Search - -Need to perform a range query to find offsets of all matches. Suffix -array is sorted--use binary search. - -``` -# left bound -while ($low <= $high) { - # Int overflow: Perl handles by promoting to float. - my $mid = int(($low + $high) / 2); - seek($fh_sa, $mid * 4, 0); - read($fh_sa, my $bin_off, 4); - my $off = unpack("L", $bin_off); - seek($fh_cp, $off, 0); - read($fh_cp, my $text, $query_len); - - my $cmp = $text cmp $query; - if ($cmp >= 0) { - $first_hit = $mid if $cmp == 0; - $high = $mid - 1; - } else { - $low = $mid + 1; - } -} - -# Continue with right bound... -``` - -Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to -seek/read exact chunks into memory. Since the FastCGI script is invoked per -query, small, frequent allocations are better than less frequent large -allocations (see bm_*.txt). - -Once the range is found through binary search, the results are collected using -a simple for loop. Max result count is capped at 20. - -Benchmarks. Search for 'arduino' (expect 0 matches). Left: SA-based index -search. Right: Naive search using regex. - -500 articles: - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Files Read:3500
User CPU (s):0.01000.0600
System CPU (s):0.02000.0200
Total Time (s):0.00120.0407
Peak RAM (KB):88289136
- -1,000 articles: - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Files Read:31000
User CPU (s):0.02000.0500
System CPU (s):0.01000.0500
Total Time (s):0.00190.0795
Peak RAM (KB):89809460
- -10,000 articles: - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Files Read:310000
User CPU (s):0.03000.4000
System CPU (s):0.03000.5400
Total Time (s):0.01610.9120
Peak RAM (KB):1250412804
- -Note: Total time < User + System CPU time is likely a measurement artifact--the -script finishes faster than the OS tick interval. - -## Security - -Memory usage remains a concern. Parallel queries could exhaust memory. Limit -parallel searches using a semaphore implemented with lock files: - -``` -my $max_parallel = 50; - -mkdir $lock_dir, 0777 unless -d $lock_dir; -my $active_count = 0; -my $now = time(); - -opendir(my $dh, $lock_dir); -while (my $file = readdir($dh)) { - next unless $file =~ /\.lock$/; - my $path = "$lock_dir/$file"; - my $mtime = (stat($path))[9] || 0; - ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++; -} -closedir($dh); -``` - -XSS attacks. Escape all HTML using HTML::Escape qw(escape_html). - -ReDOS: Sanitize search text by tossing non-printable characters, limit length, -work within utf-8. Use `\Q...\E` in searches. +Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte +length targets cache lines. -``` -if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) { - $search_text = $1; - $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; - $search_text = decode_utf8($search_text // ""); - $search_text =~ s/\P{Print}//g; - $search_text = substr($search_text, 0, 64); - $search_text =~ s/^\s+|\s+$//g; -} -``` +Search: binary search for range query. Cap at 20 results--define limits or be +surprised by them. -Command injection: no exec()/system() calls. Non-privileged user. -Read-only directory. +File IO and memory: many seek/read small chunks beat one large allocation (see +benchmarks for find_one_file.cgi). -Symlink attacks: paths are hard-coded. Script can't be modified (554). +Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles): -No API keys, passwords involved. +1,000 files: 0.31s indexing, 410 KB total index. +10,000 files: 10.97s indexing, 4.16 MB total index. -Last line of defence: chroot. +Search 'arduino' (0 matches): +1,000 files: 0.002s (SA) vs 0.080s (naive regex). +10,000 files: 0.016s (SA) vs 0.912s (naive regex). -Can't think of anything else. +Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS). +Sanitize input--strip non-printables, limit length, and quote metacharacters +(ReDOS). No exec/system (command injection). Chroot. -Verdict: Fast SA index lookup. Mitigated primary attack vectors. No -dependencies. Works on every conceivable device. +Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies. Commit: [6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e) diff --git a/_site/feed.xml b/_site/feed.xml index 5026409..4070b39 100644 --- a/_site/feed.xml +++ b/_site/feed.xml @@ -1 +1 @@ -Jekyll2026-01-03T16:04:13+08:00/feed.xmlASCIIMX | LogW. D. Sadeep MadurangeSite search with Perl + CGI + SA2026-01-02T00:00:00+08:002026-01-02T00:00:00+08:00/log/site-searchW. D. Sadeep MadurangeMatrix Rain: 2025 refactor2025-12-21T00:00:00+08:002025-12-21T00:00:00+08:00/log/matrix-digital-rainW. D. Sadeep MadurangeFingerprint door lock (LP)2025-08-18T00:00:00+08:002025-08-18T00:00:00+08:00/log/fpm-door-lock-lpW. D. Sadeep MadurangeHigh-side MOSFET switching2025-06-22T00:00:00+08:002025-06-22T00:00:00+08:00/log/mosfet-switchesW. D. Sadeep MadurangeATmega328P at 3.3V and 5V2025-06-10T00:00:00+08:002025-06-10T00:00:00+08:00/log/arduino-unoW. D. Sadeep MadurangeFingerprint door lock (RF)2025-06-05T00:00:00+08:002025-06-05T00:00:00+08:00/log/fpm-door-lock-rfW. D. Sadeep MadurangeBumblebee: browser automation2025-04-02T00:00:00+08:002025-04-02T00:00:00+08:00/log/bumblebeeW. D. Sadeep MadurangeBare-metal ATSAM3X8E2024-09-16T00:00:00+08:002024-09-16T00:00:00+08:00/log/arduino-dueW. D. Sadeep MadurangeEtlas: e-paper dashboard2024-09-05T00:00:00+08:002024-09-05T00:00:00+08:00/log/etlasW. D. Sadeep MadurangeESP32 e-reader prototype2023-10-24T00:00:00+08:002023-10-24T00:00:00+08:00/log/e-readerW. D. Sadeep Madurange \ No newline at end of file +Jekyll2026-01-03T19:18:57+08:00/feed.xmlASCIIMX | LogW. D. Sadeep MadurangePerl + FastCGI + SA search engine2026-01-02T00:00:00+08:002026-01-02T00:00:00+08:00/log/site-searchW. D. Sadeep MadurangeMatrix Rain: 2025 refactor2025-12-21T00:00:00+08:002025-12-21T00:00:00+08:00/log/matrix-digital-rainW. D. Sadeep MadurangeFingerprint door lock (LP)2025-08-18T00:00:00+08:002025-08-18T00:00:00+08:00/log/fpm-door-lock-lpW. D. Sadeep MadurangeHigh-side MOSFET switching2025-06-22T00:00:00+08:002025-06-22T00:00:00+08:00/log/mosfet-switchesW. D. Sadeep MadurangeATmega328P at 3.3V and 5V2025-06-10T00:00:00+08:002025-06-10T00:00:00+08:00/log/arduino-unoW. D. Sadeep MadurangeFingerprint door lock (RF)2025-06-05T00:00:00+08:002025-06-05T00:00:00+08:00/log/fpm-door-lock-rfW. D. Sadeep MadurangeBumblebee: browser automation2025-04-02T00:00:00+08:002025-04-02T00:00:00+08:00/log/bumblebeeW. D. Sadeep MadurangeBare-metal ATSAM3X8E2024-09-16T00:00:00+08:002024-09-16T00:00:00+08:00/log/arduino-dueW. D. Sadeep MadurangeEtlas: e-paper dashboard2024-09-05T00:00:00+08:002024-09-05T00:00:00+08:00/log/etlasW. D. Sadeep MadurangeESP32 e-reader prototype2023-10-24T00:00:00+08:002023-10-24T00:00:00+08:00/log/e-readerW. D. Sadeep Madurange \ No newline at end of file diff --git a/_site/index.html b/_site/index.html index 9a10f2d..39a51a5 100644 --- a/_site/index.html +++ b/_site/index.html @@ -58,7 +58,7 @@ - Site search with Perl + CGI + SA + Perl + FastCGI + SA search engine diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html index 65e69e0..04f98e9 100644 --- a/_site/log/site-search/index.html +++ b/_site/log/site-search/index.html @@ -3,7 +3,7 @@ - Site search with Perl + CGI + SA + Perl + FastCGI + SA search engine @@ -37,30 +37,20 @@
-

SITE SEARCH WITH PERL + CGI + SA

+

PERL + FASTCGI + SA SEARCH ENGINE

02 JANUARY 2026

-

Number of articles on the site is growing. Need a way to find them.

+

Number of articles growing. Need search.

-

Search must match substrings, be case-insensitive, fast, and secure. Site must -continue to work without JavaScript, on text web browsers–so no client-side -search. Page rank not applicable.

+

Requirements: substring match, case-insensitive, fast, secure. No JavaScript.

-

Architecture. OpenBSD’s httpd speaks FastCGI. Perl and slowcgi are in the base -system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the -results back over FastCGI.

+

Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.

-

Data structure. Suffix array. Build-time indexer generates three files: -corpus.bin–articles, sa.bin-suffix array, file_map.dat–article metadata -(title, path).

+

Data structure: suffix array. Three files: corpus.bin (articles), sa.bin +(sorted byte offsets), file_map.dat (metadata).

-

Indexer

- -

Crawl the directory containing posts, extract HTML with regex -(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case -the text (for case-insensitive search), convert to raw octets, concatenate.

- -

Sort the text lexicographically, and store their byte offsets in sa.bin:

+

Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null +byte sentinel for document boundaries. Sort lexicographically::

# Use a block that forces byte-level comparison
 {
@@ -74,243 +64,29 @@ the text (for case-insensitive search), convert to raw octets, concatenate.

}
-

This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the -suffix count, L the median length of suffixes. The fast path caps L at 64 bytes -for most cases, bringing complexity down to O(NlogN)–64-byte length chosen to -fit in cache lines.

- -
CORPUS:   a  s  c  i \0  i  m  x
-OFFSET:   0  1  2  3  4  5  6  7
-
-SORTED SUFFIXES:
-[4] \0imx
-[0] asci\0imx
-[2] ci\0imx
-[3] i\0imx       <-- "i" from "asci"
-[5] imx          <-- "i" from "imx"
-[6] mx
-[1] sci\0imx
-[7] x
-
- -

Sentinel (null byte) must be lexicographically smaller than every other -character in the alphabet (offset 3 and 5 handled correctly at the document -boundary).

- -

Benchmarks.

- -

Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM) -i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.

- -

All articles are 16 KB (2x my avg post size) documents generated with seed.sh.

- -

Total Time: 0.1475 seconds
-Files Processed: 500
-File Sizes:
-   corpus.bin       33.59 KB
-   sa.bin          134.34 KB
-   file_map.dat     37.01 KB
-   Total Index:    204.94 KB

- -

Total Time: 0.3101 seconds
-Files Processed: 1000
-File Sizes:
-   corpus.bin       67.28 KB
-   sa.bin          269.11 KB
-   file_map.dat     74.12 KB
-   Total Index:    410.51 KB

- -

Total Time: 10.9661 seconds
-Files Processed: 10000
-File Sizes:
-   corpus.bin      682.51 KB
-   sa.bin         2730.05 KB
-   file_map.dat    750.88 KB
-   Total Index:   4163.44 KB

- -

There are 11 posts today. Unlikely that it’d grow to 10K. Even if it did, the -index is very manageable.

- - - -

Need to perform a range query to find offsets of all matches. Suffix -array is sorted–use binary search.

- -
# left bound
-while ($low <= $high) {
-    # Int overflow: Perl handles by promoting to float.
-    my $mid = int(($low + $high) / 2);
-    seek($fh_sa, $mid * 4, 0);
-    read($fh_sa, my $bin_off, 4);
-    my $off = unpack("L", $bin_off);
-    seek($fh_cp, $off, 0);
-    read($fh_cp, my $text, $query_len);
-        
-    my $cmp = $text cmp $query;
-    if ($cmp >= 0) {
-        $first_hit = $mid if $cmp == 0;
-        $high = $mid - 1;
-    } else {
-        $low = $mid + 1;
-    }
-}
-
-# Continue with right bound...
-
- -

Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to -seek/read exact chunks into memory. Since the FastCGI script is invoked per -query, small, frequent allocations are better than less frequent large -allocations (see bm_*.txt).

- -

Once the range is found through binary search, the results are collected using -a simple for loop. Max result count is capped at 20.

- -

Benchmarks. Search for ‘arduino’ (expect 0 matches). Left: SA-based index -search. Right: Naive search using regex.

- -

500 articles:

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Files Read:3500
User CPU (s):0.01000.0600
System CPU (s):0.02000.0200
Total Time (s):0.00120.0407
Peak RAM (KB):88289136
- -

1,000 articles:

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Files Read:31000
User CPU (s):0.02000.0500
System CPU (s):0.01000.0500
Total Time (s):0.00190.0795
Peak RAM (KB):89809460
- -

10,000 articles:

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Files Read:310000
User CPU (s):0.03000.4000
System CPU (s):0.03000.5400
Total Time (s):0.01610.9120
Peak RAM (KB):1250412804
- -

Note: Total time < User + System CPU time is likely a measurement artifact–the -script finishes faster than the OS tick interval.

- -

Security

- -

Memory usage remains a concern. Parallel queries could exhaust memory. Limit -parallel searches using a semaphore implemented with lock files:

- -
my $max_parallel   = 50;
-
-mkdir $lock_dir, 0777 unless -d $lock_dir;
-my $active_count = 0;
-my $now = time();
-
-opendir(my $dh, $lock_dir);
-while (my $file = readdir($dh)) {
-    next unless $file =~ /\.lock$/;
-    my $path = "$lock_dir/$file";
-    my $mtime = (stat($path))[9] || 0;
-    ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++;
-}
-closedir($dh);
-
- -

XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).

- -

ReDOS: Sanitize search text by tossing non-printable characters, limit length, -work within utf-8. Use \Q...\E in searches.

+

Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte +length targets cache lines.

-
if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) {
-    $search_text = $1;
-    $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
-    $search_text = decode_utf8($search_text // "");
-    $search_text =~ s/\P{Print}//g; 
-    $search_text = substr($search_text, 0, 64);
-    $search_text =~ s/^\s+|\s+$//g;
-}
-
+

Search: binary search for range query. Cap at 20 results–define limits or be +surprised by them.

-

Command injection: no exec()/system() calls. Non-privileged user. -Read-only directory.

+

File IO and memory: many seek/read small chunks beat one large allocation (see +benchmarks for find_one_file.cgi).

-

Symlink attacks: paths are hard-coded. Script can’t be modified (554).

+

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):

-

No API keys, passwords involved.

+

1,000 files: 0.31s indexing, 410 KB total index.
+10,000 files: 10.97s indexing, 4.16 MB total index.

-

Last line of defence: chroot.

+

Search ‘arduino’ (0 matches):
+1,000 files: 0.002s (SA) vs 0.080s (naive regex).
+10,000 files: 0.016s (SA) vs 0.912s (naive regex).

-

Can’t think of anything else.

+

Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS). +Sanitize input–strip non-printables, limit length, and quote metacharacters +(ReDOS). No exec/system (command injection). Chroot.

-

Verdict: Fast SA index lookup. Mitigated primary attack vectors. No -dependencies. Works on every conceivable device.

+

Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.

Commit: 6da102d diff --git a/_site/posts.xml b/_site/posts.xml index 64fd4e6..02e7149 100644 --- a/_site/posts.xml +++ b/_site/posts.xml @@ -1 +1 @@ -Jekyll2026-01-03T16:04:13+08:00/posts.xmlASCIIMXW. D. Sadeep Madurange \ No newline at end of file +Jekyll2026-01-03T19:18:57+08:00/posts.xmlASCIIMXW. D. Sadeep Madurange \ No newline at end of file -- cgit v1.2.3