summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorSadeep Madurange <sadeep@asciimx.com>2026-01-03 19:24:25 +0800
committerSadeep Madurange <sadeep@asciimx.com>2026-01-03 19:24:25 +0800
commit16fe66dd83cbffa18af31676a380660ebce4e827 (patch)
tree2e3c655bac69430998296a0ab8dbfcc4987090bb
parentd3aff7d14e66f19a8e4e9a3315494ad612ccc1f5 (diff)
downloadwww-16fe66dd83cbffa18af31676a380660ebce4e827.tar.gz
Polish the search engine post.
-rw-r--r--_log/site-search.md276
-rw-r--r--_site/feed.xml2
-rw-r--r--_site/index.html2
-rw-r--r--_site/log/site-search/index.html274
-rw-r--r--_site/posts.xml2
5 files changed, 52 insertions, 504 deletions
diff --git a/_log/site-search.md b/_log/site-search.md
index 2a552f6..6af9f63 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -1,30 +1,20 @@
---
-title: Site search with Perl + CGI + SA
+title: Perl + FastCGI + SA search engine
date: 2026-01-02
layout: post
---
-Number of articles on the site is growing. Need a way to find them.
+Number of articles growing. Need search.
-Search must match substrings, be case-insensitive, fast, and secure. Site must
-continue to work without JavaScript, on text web browsers--so no client-side
-search. Page rank not applicable.
+Requirements: substring match, case-insensitive, fast, secure. No JavaScript.
-Architecture. OpenBSD's httpd speaks FastCGI. Perl and slowcgi are in the base
-system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the
-results back over FastCGI.
+Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.
-Data structure. Suffix array. Build-time indexer generates three files:
-corpus.bin--articles, sa.bin-suffix array, file_map.dat--article metadata
-(title, path).
+Data structure: suffix array. Three files: corpus.bin (articles), sa.bin
+(sorted byte offsets), file_map.dat (metadata).
-## Indexer
-
-Crawl the directory containing posts, extract HTML with regex
-(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case
-the text (for case-insensitive search), convert to raw octets, concatenate.
-
-Sort the text lexicographically, and store their byte offsets in sa.bin:
+Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null
+byte sentinel for document boundaries. Sort lexicographically::
```
# Use a block that forces byte-level comparison
@@ -39,247 +29,29 @@ Sort the text lexicographically, and store their byte offsets in sa.bin:
}
```
-This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the
-suffix count, L the median length of suffixes. The fast path caps L at 64 bytes
-for most cases, bringing complexity down to O(NlogN)--64-byte length chosen to
-fit in cache lines.
-
-```
-CORPUS: a s c i \0 i m x
-OFFSET: 0 1 2 3 4 5 6 7
-
-SORTED SUFFIXES:
-[4] \0imx
-[0] asci\0imx
-[2] ci\0imx
-[3] i\0imx <-- "i" from "asci"
-[5] imx <-- "i" from "imx"
-[6] mx
-[1] sci\0imx
-[7] x
-```
-
-Sentinel (null byte) must be lexicographically smaller than every other
-character in the alphabet (offset 3 and 5 handled correctly at the document
-boundary).
-
-Benchmarks.
-
-Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM)
-i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.
-
-All articles are 16 KB (2x my avg post size) documents generated with seed.sh.
-
-Total Time: 0.1475 seconds<br>
-Files Processed: 500 <br>
-File Sizes:<br>
-&nbsp;&nbsp; corpus.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;33.59 KB<br>
-&nbsp;&nbsp; sa.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;134.34 KB<br>
-&nbsp;&nbsp; file_map.dat &nbsp;&nbsp;&nbsp; 37.01 KB<br>
-&nbsp;&nbsp; Total Index: &nbsp;&nbsp;&nbsp;204.94 KB<br>
-
-Total Time: 0.3101 seconds<br>
-Files Processed: 1000<br>
-File Sizes:<br>
-&nbsp;&nbsp; corpus.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;67.28 KB<br>
-&nbsp;&nbsp; sa.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;269.11 KB<br>
-&nbsp;&nbsp; file_map.dat &nbsp;&nbsp;&nbsp; 74.12 KB<br>
-&nbsp;&nbsp; Total Index: &nbsp;&nbsp;&nbsp;410.51 KB<br>
-
-Total Time: 10.9661 seconds<br>
-Files Processed: 10000<br>
-File Sizes:<br>
-&nbsp;&nbsp; corpus.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;682.51 KB<br>
-&nbsp;&nbsp; sa.bin &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2730.05 KB<br>
-&nbsp;&nbsp; file_map.dat &nbsp;&nbsp; 750.88 KB<br>
-&nbsp;&nbsp; Total Index: &nbsp;&nbsp;4163.44 KB<br>
-
-There are 11 posts today. Unlikely that it'd grow to 10K. Even if it did, the
-index is very manageable.
-
-## Search
-
-Need to perform a range query to find offsets of all matches. Suffix
-array is sorted--use binary search.
-
-```
-# left bound
-while ($low <= $high) {
- # Int overflow: Perl handles by promoting to float.
- my $mid = int(($low + $high) / 2);
- seek($fh_sa, $mid * 4, 0);
- read($fh_sa, my $bin_off, 4);
- my $off = unpack("L", $bin_off);
- seek($fh_cp, $off, 0);
- read($fh_cp, my $text, $query_len);
-
- my $cmp = $text cmp $query;
- if ($cmp >= 0) {
- $first_hit = $mid if $cmp == 0;
- $high = $mid - 1;
- } else {
- $low = $mid + 1;
- }
-}
-
-# Continue with right bound...
-```
-
-Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to
-seek/read exact chunks into memory. Since the FastCGI script is invoked per
-query, small, frequent allocations are better than less frequent large
-allocations (see bm_*.txt).
-
-Once the range is found through binary search, the results are collected using
-a simple for loop. Max result count is capped at 20.
-
-Benchmarks. Search for 'arduino' (expect 0 matches). Left: SA-based index
-search. Right: Naive search using regex.
-
-500 articles:
-
-<table>
- <tr>
- <td style="padding: 0">Files Read:</td>
- <td style="padding: 0; text-align: right;">3</td>
- <td style="padding: 0; text-align: right;">500</td>
- </tr>
- <tr>
- <td style="padding: 0">User CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0100</td>
- <td style="padding: 0; text-align: right;">0.0600</td>
- </tr>
- <tr>
- <td style="padding: 0">System CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0200</td>
- <td style="padding: 0; text-align: right;">0.0200</td>
- </tr>
- <tr>
- <td style="padding: 0">Total Time (s):</td>
- <td style="padding: 0; text-align: right;">0.0012</td>
- <td style="padding: 0; text-align: right;">0.0407</td>
- </tr>
- <tr>
- <td style="padding: 0">Peak RAM (KB):</td>
- <td style="padding: 0; text-align: right;">8828</td>
- <td style="padding: 0; text-align: right;">9136</td>
- </tr>
-</table>
-
-1,000 articles:
-
-<table>
- <tr>
- <td style="padding: 0">Files Read:</td>
- <td style="padding: 0; text-align: right;">3</td>
- <td style="padding: 0; text-align: right;">1000</td>
- </tr>
- <tr>
- <td style="padding: 0">User CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0200</td>
- <td style="padding: 0; text-align: right;">0.0500</td>
- </tr>
- <tr>
- <td style="padding: 0">System CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0100</td>
- <td style="padding: 0; text-align: right;">0.0500</td>
- </tr>
- <tr>
- <td style="padding: 0">Total Time (s):</td>
- <td style="padding: 0; text-align: right;">0.0019</td>
- <td style="padding: 0; text-align: right;">0.0795</td>
- </tr>
- <tr>
- <td style="padding: 0">Peak RAM (KB):</td>
- <td style="padding: 0; text-align: right;">8980</td>
- <td style="padding: 0; text-align: right;">9460</td>
- </tr>
-</table>
-
-10,000 articles:
-
-<table>
- <tr>
- <td style="padding: 0">Files Read:</td>
- <td style="padding: 0; text-align: right;">3</td>
- <td style="padding: 0; text-align: right;">10000</td>
- </tr>
- <tr>
- <td style="padding: 0">User CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0300</td>
- <td style="padding: 0; text-align: right;">0.4000</td>
- </tr>
- <tr>
- <td style="padding: 0">System CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0300</td>
- <td style="padding: 0; text-align: right;">0.5400</td>
- </tr>
- <tr>
- <td style="padding: 0">Total Time (s):</td>
- <td style="padding: 0; text-align: right;">0.0161</td>
- <td style="padding: 0; text-align: right;">0.9120</td>
- </tr>
- <tr>
- <td style="padding: 0">Peak RAM (KB):</td>
- <td style="padding: 0; text-align: right;">12504</td>
- <td style="padding: 0; text-align: right;">12804</td>
- </tr>
-</table>
-
-Note: Total time < User + System CPU time is likely a measurement artifact--the
-script finishes faster than the OS tick interval.
-
-## Security
-
-Memory usage remains a concern. Parallel queries could exhaust memory. Limit
-parallel searches using a semaphore implemented with lock files:
-
-```
-my $max_parallel = 50;
-
-mkdir $lock_dir, 0777 unless -d $lock_dir;
-my $active_count = 0;
-my $now = time();
-
-opendir(my $dh, $lock_dir);
-while (my $file = readdir($dh)) {
- next unless $file =~ /\.lock$/;
- my $path = "$lock_dir/$file";
- my $mtime = (stat($path))[9] || 0;
- ($now - $mtime > $lock_timeout) ? unlink($path) : $active_count++;
-}
-closedir($dh);
-```
-
-XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).
-
-ReDOS: Sanitize search text by tossing non-printable characters, limit length,
-work within utf-8. Use `\Q...\E` in searches.
+Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte
+length targets cache lines.
-```
-if (($ENV{QUERY_STRING} || '') =~ /^q=([^&]*)/) {
- $search_text = $1;
- $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
- $search_text = decode_utf8($search_text // "");
- $search_text =~ s/\P{Print}//g;
- $search_text = substr($search_text, 0, 64);
- $search_text =~ s/^\s+|\s+$//g;
-}
-```
+Search: binary search for range query. Cap at 20 results--define limits or be
+surprised by them.
-Command injection: no exec()/system() calls. Non-privileged user.
-Read-only directory.
+File IO and memory: many seek/read small chunks beat one large allocation (see
+benchmarks for find_one_file.cgi).
-Symlink attacks: paths are hard-coded. Script can't be modified (554).
+Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):
-No API keys, passwords involved.
+1,000 files: 0.31s indexing, 410 KB total index.
+10,000 files: 10.97s indexing, 4.16 MB total index.
-Last line of defence: chroot.
+Search 'arduino' (0 matches):
+1,000 files: 0.002s (SA) vs 0.080s (naive regex).
+10,000 files: 0.016s (SA) vs 0.912s (naive regex).
-Can't think of anything else.
+Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS).
+Sanitize input--strip non-printables, limit length, and quote metacharacters
+(ReDOS). No exec/system (command injection). Chroot.
-Verdict: Fast SA index lookup. Mitigated primary attack vectors. No
-dependencies. Works on every conceivable device.
+Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.
Commit:
[6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e)
diff --git a/_site/feed.xml b/_site/feed.xml
index 5026409..4070b39 100644
--- a/_site/feed.xml
+++ b/_site/feed.xml
@@ -1 +1 @@
-<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-03T16:04:13+08:00</updated><id>/feed.xml</id><title type="html">ASCIIMX | Log</title><author><name>W. D. Sadeep Madurange</name></author><entry><title type="html">Site search with Perl + CGI + SA</title><link href="/log/site-search/" rel="alternate" type="text/html" title="Site search with Perl + CGI + SA" /><published>2026-01-02T00:00:00+08:00</published><updated>2026-01-02T00:00:00+08:00</updated><id>/log/site-search</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Number of articles on the site is growing. Need a way to find them.]]></summary></entry><entry><title type="html">Matrix Rain: 2025 refactor</title><link href="/log/matrix-digital-rain/" rel="alternate" type="text/html" title="Matrix Rain: 2025 refactor" /><published>2025-12-21T00:00:00+08:00</published><updated>2025-12-21T00:00:00+08:00</updated><id>/log/matrix-digital-rain</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[The 2022 version worked but had some loose ends. Unicode support was inflexible–couldn’t mix ASCII with Katakana; Phosphor decay was stored in a separate array when it should’ve been packed with RGB; Code was harder to read than it needed to be.]]></summary></entry><entry><title type="html">Fingerprint door lock (LP)</title><link href="/log/fpm-door-lock-lp/" rel="alternate" type="text/html" title="Fingerprint door lock (LP)" /><published>2025-08-18T00:00:00+08:00</published><updated>2025-08-18T00:00:00+08:00</updated><id>/log/fpm-door-lock-lp</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Second iteration of the RF door lock. Old version worked but drew too much quiescent current. Sensor and servo pulled 13.8mA and 4.6mA idle. Linear regulators were a disaster. Battery didn’t last 24 hours.]]></summary></entry><entry><title type="html">High-side MOSFET switching</title><link href="/log/mosfet-switches/" rel="alternate" type="text/html" title="High-side MOSFET switching" /><published>2025-06-22T00:00:00+08:00</published><updated>2025-06-22T00:00:00+08:00</updated><id>/log/mosfet-switches</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Needed low-power switching for the fingerprint door lock. Servo and FPM draw high quiescent current–had to cut power electronically during sleep. MOSFETs can do this.]]></summary></entry><entry><title type="html">ATmega328P at 3.3V and 5V</title><link href="/log/arduino-uno/" rel="alternate" type="text/html" title="ATmega328P at 3.3V and 5V" /><published>2025-06-10T00:00:00+08:00</published><updated>2025-06-10T00:00:00+08:00</updated><id>/log/arduino-uno</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Quick reference for wiring ATmega328P ICs at 5V and 3.3V. 5V uses 16MHz crystal, 3.3V uses 8MHz.]]></summary></entry><entry><title type="html">Fingerprint door lock (RF)</title><link href="/log/fpm-door-lock-rf/" rel="alternate" type="text/html" title="Fingerprint door lock (RF)" /><published>2025-06-05T00:00:00+08:00</published><updated>2025-06-05T00:00:00+08:00</updated><id>/log/fpm-door-lock-rf</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Wanted to unlock door with fingerprint, wirelessly to avoid drilling.]]></summary></entry><entry><title type="html">Bumblebee: browser automation</title><link href="/log/bumblebee/" rel="alternate" type="text/html" title="Bumblebee: browser automation" /><published>2025-04-02T00:00:00+08:00</published><updated>2025-04-02T00:00:00+08:00</updated><id>/log/bumblebee</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Built with Andy Zhang for an employer. Tool to automate web scraping script generation.]]></summary></entry><entry><title type="html">Bare-metal ATSAM3X8E</title><link href="/log/arduino-due/" rel="alternate" type="text/html" title="Bare-metal ATSAM3X8E" /><published>2024-09-16T00:00:00+08:00</published><updated>2024-09-16T00:00:00+08:00</updated><id>/log/arduino-due</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Notes on programming bare-metal ATSAM3X8E chips (Arduino Due) using Serial Wire Debug (SwD) protocol.]]></summary></entry><entry><title type="html">Etlas: e-paper dashboard</title><link href="/log/etlas/" rel="alternate" type="text/html" title="Etlas: e-paper dashboard" /><published>2024-09-05T00:00:00+08:00</published><updated>2024-09-05T00:00:00+08:00</updated><id>/log/etlas</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Repurposed e-reader prototype into something for regular use. News, stocks, weather dashboard. ESP32 NodeMCU D1 + 7.5” Waveshare e-paper + DHT22 sensor.]]></summary></entry><entry><title type="html">ESP32 e-reader prototype</title><link href="/log/e-reader/" rel="alternate" type="text/html" title="ESP32 e-reader prototype" /><published>2023-10-24T00:00:00+08:00</published><updated>2023-10-24T00:00:00+08:00</updated><id>/log/e-reader</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[First project with e-paper displays and ESP32.]]></summary></entry></feed> \ No newline at end of file
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-03T19:18:57+08:00</updated><id>/feed.xml</id><title type="html">ASCIIMX | Log</title><author><name>W. D. Sadeep Madurange</name></author><entry><title type="html">Perl + FastCGI + SA search engine</title><link href="/log/site-search/" rel="alternate" type="text/html" title="Perl + FastCGI + SA search engine" /><published>2026-01-02T00:00:00+08:00</published><updated>2026-01-02T00:00:00+08:00</updated><id>/log/site-search</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Number of articles growing. Need search.]]></summary></entry><entry><title type="html">Matrix Rain: 2025 refactor</title><link href="/log/matrix-digital-rain/" rel="alternate" type="text/html" title="Matrix Rain: 2025 refactor" /><published>2025-12-21T00:00:00+08:00</published><updated>2025-12-21T00:00:00+08:00</updated><id>/log/matrix-digital-rain</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[The 2022 version worked but had some loose ends. Unicode support was inflexible–couldn’t mix ASCII with Katakana; Phosphor decay was stored in a separate array when it should’ve been packed with RGB; Code was harder to read than it needed to be.]]></summary></entry><entry><title type="html">Fingerprint door lock (LP)</title><link href="/log/fpm-door-lock-lp/" rel="alternate" type="text/html" title="Fingerprint door lock (LP)" /><published>2025-08-18T00:00:00+08:00</published><updated>2025-08-18T00:00:00+08:00</updated><id>/log/fpm-door-lock-lp</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Second iteration of the RF door lock. Old version worked but drew too much quiescent current. Sensor and servo pulled 13.8mA and 4.6mA idle. Linear regulators were a disaster. Battery didn’t last 24 hours.]]></summary></entry><entry><title type="html">High-side MOSFET switching</title><link href="/log/mosfet-switches/" rel="alternate" type="text/html" title="High-side MOSFET switching" /><published>2025-06-22T00:00:00+08:00</published><updated>2025-06-22T00:00:00+08:00</updated><id>/log/mosfet-switches</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Needed low-power switching for the fingerprint door lock. Servo and FPM draw high quiescent current–had to cut power electronically during sleep. MOSFETs can do this.]]></summary></entry><entry><title type="html">ATmega328P at 3.3V and 5V</title><link href="/log/arduino-uno/" rel="alternate" type="text/html" title="ATmega328P at 3.3V and 5V" /><published>2025-06-10T00:00:00+08:00</published><updated>2025-06-10T00:00:00+08:00</updated><id>/log/arduino-uno</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Quick reference for wiring ATmega328P ICs at 5V and 3.3V. 5V uses 16MHz crystal, 3.3V uses 8MHz.]]></summary></entry><entry><title type="html">Fingerprint door lock (RF)</title><link href="/log/fpm-door-lock-rf/" rel="alternate" type="text/html" title="Fingerprint door lock (RF)" /><published>2025-06-05T00:00:00+08:00</published><updated>2025-06-05T00:00:00+08:00</updated><id>/log/fpm-door-lock-rf</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Wanted to unlock door with fingerprint, wirelessly to avoid drilling.]]></summary></entry><entry><title type="html">Bumblebee: browser automation</title><link href="/log/bumblebee/" rel="alternate" type="text/html" title="Bumblebee: browser automation" /><published>2025-04-02T00:00:00+08:00</published><updated>2025-04-02T00:00:00+08:00</updated><id>/log/bumblebee</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Built with Andy Zhang for an employer. Tool to automate web scraping script generation.]]></summary></entry><entry><title type="html">Bare-metal ATSAM3X8E</title><link href="/log/arduino-due/" rel="alternate" type="text/html" title="Bare-metal ATSAM3X8E" /><published>2024-09-16T00:00:00+08:00</published><updated>2024-09-16T00:00:00+08:00</updated><id>/log/arduino-due</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Notes on programming bare-metal ATSAM3X8E chips (Arduino Due) using Serial Wire Debug (SwD) protocol.]]></summary></entry><entry><title type="html">Etlas: e-paper dashboard</title><link href="/log/etlas/" rel="alternate" type="text/html" title="Etlas: e-paper dashboard" /><published>2024-09-05T00:00:00+08:00</published><updated>2024-09-05T00:00:00+08:00</updated><id>/log/etlas</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Repurposed e-reader prototype into something for regular use. News, stocks, weather dashboard. ESP32 NodeMCU D1 + 7.5” Waveshare e-paper + DHT22 sensor.]]></summary></entry><entry><title type="html">ESP32 e-reader prototype</title><link href="/log/e-reader/" rel="alternate" type="text/html" title="ESP32 e-reader prototype" /><published>2023-10-24T00:00:00+08:00</published><updated>2023-10-24T00:00:00+08:00</updated><id>/log/e-reader</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[First project with e-paper displays and ESP32.]]></summary></entry></feed> \ No newline at end of file
diff --git a/_site/index.html b/_site/index.html
index 9a10f2d..39a51a5 100644
--- a/_site/index.html
+++ b/_site/index.html
@@ -58,7 +58,7 @@
<tr>
<td class="posts-td posts-td-link">
- <a href="/log/site-search/" class="link-decor-none">Site search with Perl + CGI + SA</a>
+ <a href="/log/site-search/" class="link-decor-none">Perl + FastCGI + SA search engine</a>
</td>
<td class="posts-td posts-td-time">
<span class="post-meta">
diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html
index 65e69e0..04f98e9 100644
--- a/_site/log/site-search/index.html
+++ b/_site/log/site-search/index.html
@@ -3,7 +3,7 @@
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
- <title>Site search with Perl + CGI + SA</title>
+ <title>Perl + FastCGI + SA search engine</title>
<link rel="stylesheet" href="/assets/css/main.css">
<link rel="stylesheet" href="/assets/css/skeleton.css">
</head>
@@ -37,30 +37,20 @@
<main>
<div class="container">
<div class="container-2">
- <h2 class="center" id="title">SITE SEARCH WITH PERL + CGI + SA</h2>
+ <h2 class="center" id="title">PERL + FASTCGI + SA SEARCH ENGINE</h2>
<h5 class="center">02 JANUARY 2026</h5>
<br>
- <div class="twocol justify"><p>Number of articles on the site is growing. Need a way to find them.</p>
+ <div class="twocol justify"><p>Number of articles growing. Need search.</p>
-<p>Search must match substrings, be case-insensitive, fast, and secure. Site must
-continue to work without JavaScript, on text web browsers–so no client-side
-search. Page rank not applicable.</p>
+<p>Requirements: substring match, case-insensitive, fast, secure. No JavaScript.</p>
-<p>Architecture. OpenBSD’s httpd speaks FastCGI. Perl and slowcgi are in the base
-system. Search using a Perl script. Invoke it from httpd via slowcgi. Send the
-results back over FastCGI.</p>
+<p>Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.</p>
-<p>Data structure. Suffix array. Build-time indexer generates three files:
-corpus.bin–articles, sa.bin-suffix array, file_map.dat–article metadata
-(title, path).</p>
+<p>Data structure: suffix array. Three files: corpus.bin (articles), sa.bin
+(sorted byte offsets), file_map.dat (metadata).</p>
-<h2 id="indexer">Indexer</h2>
-
-<p>Crawl the directory containing posts, extract HTML with regex
-(no-regex-for-HTML doctrinaires getting a seizure right about now), lower-case
-the text (for case-insensitive search), convert to raw octets, concatenate.</p>
-
-<p>Sort the text lexicographically, and store their byte offsets in sa.bin:</p>
+<p>Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null
+byte sentinel for document boundaries. Sort lexicographically::</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Use a block that forces byte-level comparison
{
@@ -74,243 +64,29 @@ the text (for case-insensitive search), convert to raw octets, concatenate.</p>
}
</code></pre></div></div>
-<p>This is the slowest part of the indexer running at ~ O(L⋅NlogN), N is the
-suffix count, L the median length of suffixes. The fast path caps L at 64 bytes
-for most cases, bringing complexity down to O(NlogN)–64-byte length chosen to
-fit in cache lines.</p>
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CORPUS: a s c i \0 i m x
-OFFSET: 0 1 2 3 4 5 6 7
-
-SORTED SUFFIXES:
-[4] \0imx
-[0] asci\0imx
-[2] ci\0imx
-[3] i\0imx &lt;-- "i" from "asci"
-[5] imx &lt;-- "i" from "imx"
-[6] mx
-[1] sci\0imx
-[7] x
-</code></pre></div></div>
-
-<p>Sentinel (null byte) must be lexicographically smaller than every other
-character in the alphabet (offset 3 and 5 handled correctly at the document
-boundary).</p>
-
-<p>Benchmarks.</p>
-
-<p>Performed on a T490 running OpenBSD 7.8. CPU: Intel(R) Core(TM)
-i7-10510U CPU @ 1.80GHz. AMD64. 15 GB memory.</p>
-
-<p>All articles are 16 KB (2x my avg post size) documents generated with seed.sh.</p>
-
-<p>Total Time: 0.1475 seconds<br />
-Files Processed: 500 <br />
-File Sizes:<br />
-   corpus.bin       33.59 KB<br />
-   sa.bin          134.34 KB<br />
-   file_map.dat     37.01 KB<br />
-   Total Index:    204.94 KB<br /></p>
-
-<p>Total Time: 0.3101 seconds<br />
-Files Processed: 1000<br />
-File Sizes:<br />
-   corpus.bin       67.28 KB<br />
-   sa.bin          269.11 KB<br />
-   file_map.dat     74.12 KB<br />
-   Total Index:    410.51 KB<br /></p>
-
-<p>Total Time: 10.9661 seconds<br />
-Files Processed: 10000<br />
-File Sizes:<br />
-   corpus.bin      682.51 KB<br />
-   sa.bin         2730.05 KB<br />
-   file_map.dat    750.88 KB<br />
-   Total Index:   4163.44 KB<br /></p>
-
-<p>There are 11 posts today. Unlikely that it’d grow to 10K. Even if it did, the
-index is very manageable.</p>
-
-<h2 id="search">Search</h2>
-
-<p>Need to perform a range query to find offsets of all matches. Suffix
-array is sorted–use binary search.</p>
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># left bound
-while ($low &lt;= $high) {
- # Int overflow: Perl handles by promoting to float.
- my $mid = int(($low + $high) / 2);
- seek($fh_sa, $mid * 4, 0);
- read($fh_sa, my $bin_off, 4);
- my $off = unpack("L", $bin_off);
- seek($fh_cp, $off, 0);
- read($fh_cp, my $text, $query_len);
-
- my $cmp = $text cmp $query;
- if ($cmp &gt;= 0) {
- $first_hit = $mid if $cmp == 0;
- $high = $mid - 1;
- } else {
- $low = $mid + 1;
- }
-}
-
-# Continue with right bound...
-</code></pre></div></div>
-
-<p>Each offset is a 32-bit unsigned int in sa.bin. Use the word-aligned offsets to
-seek/read exact chunks into memory. Since the FastCGI script is invoked per
-query, small, frequent allocations are better than less frequent large
-allocations (see bm_*.txt).</p>
-
-<p>Once the range is found through binary search, the results are collected using
-a simple for loop. Max result count is capped at 20.</p>
-
-<p>Benchmarks. Search for ‘arduino’ (expect 0 matches). Left: SA-based index
-search. Right: Naive search using regex.</p>
-
-<p>500 articles:</p>
-
-<table>
- <tr>
- <td style="padding: 0">Files Read:</td>
- <td style="padding: 0; text-align: right;">3</td>
- <td style="padding: 0; text-align: right;">500</td>
- </tr>
- <tr>
- <td style="padding: 0">User CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0100</td>
- <td style="padding: 0; text-align: right;">0.0600</td>
- </tr>
- <tr>
- <td style="padding: 0">System CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0200</td>
- <td style="padding: 0; text-align: right;">0.0200</td>
- </tr>
- <tr>
- <td style="padding: 0">Total Time (s):</td>
- <td style="padding: 0; text-align: right;">0.0012</td>
- <td style="padding: 0; text-align: right;">0.0407</td>
- </tr>
- <tr>
- <td style="padding: 0">Peak RAM (KB):</td>
- <td style="padding: 0; text-align: right;">8828</td>
- <td style="padding: 0; text-align: right;">9136</td>
- </tr>
-</table>
-
-<p>1,000 articles:</p>
-
-<table>
- <tr>
- <td style="padding: 0">Files Read:</td>
- <td style="padding: 0; text-align: right;">3</td>
- <td style="padding: 0; text-align: right;">1000</td>
- </tr>
- <tr>
- <td style="padding: 0">User CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0200</td>
- <td style="padding: 0; text-align: right;">0.0500</td>
- </tr>
- <tr>
- <td style="padding: 0">System CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0100</td>
- <td style="padding: 0; text-align: right;">0.0500</td>
- </tr>
- <tr>
- <td style="padding: 0">Total Time (s):</td>
- <td style="padding: 0; text-align: right;">0.0019</td>
- <td style="padding: 0; text-align: right;">0.0795</td>
- </tr>
- <tr>
- <td style="padding: 0">Peak RAM (KB):</td>
- <td style="padding: 0; text-align: right;">8980</td>
- <td style="padding: 0; text-align: right;">9460</td>
- </tr>
-</table>
-
-<p>10,000 articles:</p>
-
-<table>
- <tr>
- <td style="padding: 0">Files Read:</td>
- <td style="padding: 0; text-align: right;">3</td>
- <td style="padding: 0; text-align: right;">10000</td>
- </tr>
- <tr>
- <td style="padding: 0">User CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0300</td>
- <td style="padding: 0; text-align: right;">0.4000</td>
- </tr>
- <tr>
- <td style="padding: 0">System CPU (s):</td>
- <td style="padding: 0; text-align: right;">0.0300</td>
- <td style="padding: 0; text-align: right;">0.5400</td>
- </tr>
- <tr>
- <td style="padding: 0">Total Time (s):</td>
- <td style="padding: 0; text-align: right;">0.0161</td>
- <td style="padding: 0; text-align: right;">0.9120</td>
- </tr>
- <tr>
- <td style="padding: 0">Peak RAM (KB):</td>
- <td style="padding: 0; text-align: right;">12504</td>
- <td style="padding: 0; text-align: right;">12804</td>
- </tr>
-</table>
-
-<p>Note: Total time &lt; User + System CPU time is likely a measurement artifact–the
-script finishes faster than the OS tick interval.</p>
-
-<h2 id="security">Security</h2>
-
-<p>Memory usage remains a concern. Parallel queries could exhaust memory. Limit
-parallel searches using a semaphore implemented with lock files:</p>
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my $max_parallel = 50;
-
-mkdir $lock_dir, 0777 unless -d $lock_dir;
-my $active_count = 0;
-my $now = time();
-
-opendir(my $dh, $lock_dir);
-while (my $file = readdir($dh)) {
- next unless $file =~ /\.lock$/;
- my $path = "$lock_dir/$file";
- my $mtime = (stat($path))[9] || 0;
- ($now - $mtime &gt; $lock_timeout) ? unlink($path) : $active_count++;
-}
-closedir($dh);
-</code></pre></div></div>
-
-<p>XSS attacks. Escape all HTML using HTML::Escape qw(escape_html).</p>
-
-<p>ReDOS: Sanitize search text by tossing non-printable characters, limit length,
-work within utf-8. Use <code class="language-plaintext highlighter-rouge">\Q...\E</code> in searches.</p>
+<p>Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte
+length targets cache lines.</p>
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (($ENV{QUERY_STRING} || '') =~ /^q=([^&amp;]*)/) {
- $search_text = $1;
- $search_text =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
- $search_text = decode_utf8($search_text // "");
- $search_text =~ s/\P{Print}//g;
- $search_text = substr($search_text, 0, 64);
- $search_text =~ s/^\s+|\s+$//g;
-}
-</code></pre></div></div>
+<p>Search: binary search for range query. Cap at 20 results–define limits or be
+surprised by them.</p>
-<p>Command injection: no exec()/system() calls. Non-privileged user.
-Read-only directory.</p>
+<p>File IO and memory: many seek/read small chunks beat one large allocation (see
+benchmarks for find_one_file.cgi).</p>
-<p>Symlink attacks: paths are hard-coded. Script can’t be modified (554).</p>
+<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):</p>
-<p>No API keys, passwords involved.</p>
+<p>1,000 files: 0.31s indexing, 410 KB total index.<br />
+10,000 files: 10.97s indexing, 4.16 MB total index.</p>
-<p>Last line of defence: chroot.</p>
+<p>Search ‘arduino’ (0 matches):<br />
+1,000 files: 0.002s (SA) vs 0.080s (naive regex).<br />
+10,000 files: 0.016s (SA) vs 0.912s (naive regex).</p>
-<p>Can’t think of anything else.</p>
+<p>Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS).
+Sanitize input–strip non-printables, limit length, and quote metacharacters
+(ReDOS). No exec/system (command injection). Chroot.</p>
-<p>Verdict: Fast SA index lookup. Mitigated primary attack vectors. No
-dependencies. Works on every conceivable device.</p>
+<p>Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.</p>
<p>Commit:
<a href="https://git.asciimx.com/www/commit/?h=term&amp;id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a>
diff --git a/_site/posts.xml b/_site/posts.xml
index 64fd4e6..02e7149 100644
--- a/_site/posts.xml
+++ b/_site/posts.xml
@@ -1 +1 @@
-<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/posts.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-03T16:04:13+08:00</updated><id>/posts.xml</id><title type="html">ASCIIMX</title><author><name>W. D. Sadeep Madurange</name></author></feed> \ No newline at end of file
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/posts.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-03T19:18:57+08:00</updated><id>/posts.xml</id><title type="html">ASCIIMX</title><author><name>W. D. Sadeep Madurange</name></author></feed> \ No newline at end of file