diff options
| author | Sadeep Madurange <sadeep@asciimx.com> | 2026-01-04 17:57:39 +0800 |
|---|---|---|
| committer | Sadeep Madurange <sadeep@asciimx.com> | 2026-01-04 17:57:39 +0800 |
| commit | 1a4a6cb6d2aa2c8512e9637dc5dd95997321c444 (patch) | |
| tree | 7b6b9e514b48d64dd811b75c680c1268b532aec6 /_site | |
| parent | 16fe66dd83cbffa18af31676a380660ebce4e827 (diff) | |
| download | www-1a4a6cb6d2aa2c8512e9637dc5dd95997321c444.tar.gz | |
Fix the search engine post.
Diffstat (limited to '_site')
| -rw-r--r-- | _site/feed.xml | 2 | ||||
| -rw-r--r-- | _site/index.html | 4 | ||||
| -rw-r--r-- | _site/log/site-search/index.html | 130 | ||||
| -rw-r--r-- | _site/posts.xml | 2 | ||||
| -rw-r--r-- | _site/sitemap.xml | 2 |
5 files changed, 105 insertions, 35 deletions
diff --git a/_site/feed.xml b/_site/feed.xml index 4070b39..cb9e91e 100644 --- a/_site/feed.xml +++ b/_site/feed.xml @@ -1 +1 @@ -<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-03T19:18:57+08:00</updated><id>/feed.xml</id><title type="html">ASCIIMX | Log</title><author><name>W. D. Sadeep Madurange</name></author><entry><title type="html">Perl + FastCGI + SA search engine</title><link href="/log/site-search/" rel="alternate" type="text/html" title="Perl + FastCGI + SA search engine" /><published>2026-01-02T00:00:00+08:00</published><updated>2026-01-02T00:00:00+08:00</updated><id>/log/site-search</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Number of articles growing. Need search.]]></summary></entry><entry><title type="html">Matrix Rain: 2025 refactor</title><link href="/log/matrix-digital-rain/" rel="alternate" type="text/html" title="Matrix Rain: 2025 refactor" /><published>2025-12-21T00:00:00+08:00</published><updated>2025-12-21T00:00:00+08:00</updated><id>/log/matrix-digital-rain</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[The 2022 version worked but had some loose ends. Unicode support was inflexible–couldn’t mix ASCII with Katakana; Phosphor decay was stored in a separate array when it should’ve been packed with RGB; Code was harder to read than it needed to be.]]></summary></entry><entry><title type="html">Fingerprint door lock (LP)</title><link href="/log/fpm-door-lock-lp/" rel="alternate" type="text/html" title="Fingerprint door lock (LP)" /><published>2025-08-18T00:00:00+08:00</published><updated>2025-08-18T00:00:00+08:00</updated><id>/log/fpm-door-lock-lp</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Second iteration of the RF door lock. Old version worked but drew too much quiescent current. Sensor and servo pulled 13.8mA and 4.6mA idle. Linear regulators were a disaster. Battery didn’t last 24 hours.]]></summary></entry><entry><title type="html">High-side MOSFET switching</title><link href="/log/mosfet-switches/" rel="alternate" type="text/html" title="High-side MOSFET switching" /><published>2025-06-22T00:00:00+08:00</published><updated>2025-06-22T00:00:00+08:00</updated><id>/log/mosfet-switches</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Needed low-power switching for the fingerprint door lock. Servo and FPM draw high quiescent current–had to cut power electronically during sleep. MOSFETs can do this.]]></summary></entry><entry><title type="html">ATmega328P at 3.3V and 5V</title><link href="/log/arduino-uno/" rel="alternate" type="text/html" title="ATmega328P at 3.3V and 5V" /><published>2025-06-10T00:00:00+08:00</published><updated>2025-06-10T00:00:00+08:00</updated><id>/log/arduino-uno</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Quick reference for wiring ATmega328P ICs at 5V and 3.3V. 5V uses 16MHz crystal, 3.3V uses 8MHz.]]></summary></entry><entry><title type="html">Fingerprint door lock (RF)</title><link href="/log/fpm-door-lock-rf/" rel="alternate" type="text/html" title="Fingerprint door lock (RF)" /><published>2025-06-05T00:00:00+08:00</published><updated>2025-06-05T00:00:00+08:00</updated><id>/log/fpm-door-lock-rf</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Wanted to unlock door with fingerprint, wirelessly to avoid drilling.]]></summary></entry><entry><title type="html">Bumblebee: browser automation</title><link href="/log/bumblebee/" rel="alternate" type="text/html" title="Bumblebee: browser automation" /><published>2025-04-02T00:00:00+08:00</published><updated>2025-04-02T00:00:00+08:00</updated><id>/log/bumblebee</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Built with Andy Zhang for an employer. Tool to automate web scraping script generation.]]></summary></entry><entry><title type="html">Bare-metal ATSAM3X8E</title><link href="/log/arduino-due/" rel="alternate" type="text/html" title="Bare-metal ATSAM3X8E" /><published>2024-09-16T00:00:00+08:00</published><updated>2024-09-16T00:00:00+08:00</updated><id>/log/arduino-due</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Notes on programming bare-metal ATSAM3X8E chips (Arduino Due) using Serial Wire Debug (SwD) protocol.]]></summary></entry><entry><title type="html">Etlas: e-paper dashboard</title><link href="/log/etlas/" rel="alternate" type="text/html" title="Etlas: e-paper dashboard" /><published>2024-09-05T00:00:00+08:00</published><updated>2024-09-05T00:00:00+08:00</updated><id>/log/etlas</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Repurposed e-reader prototype into something for regular use. News, stocks, weather dashboard. ESP32 NodeMCU D1 + 7.5” Waveshare e-paper + DHT22 sensor.]]></summary></entry><entry><title type="html">ESP32 e-reader prototype</title><link href="/log/e-reader/" rel="alternate" type="text/html" title="ESP32 e-reader prototype" /><published>2023-10-24T00:00:00+08:00</published><updated>2023-10-24T00:00:00+08:00</updated><id>/log/e-reader</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[First project with e-paper displays and ESP32.]]></summary></entry></feed>
\ No newline at end of file +<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-04T17:31:30+08:00</updated><id>/feed.xml</id><title type="html">ASCIIMX | Log</title><author><name>W. D. Sadeep Madurange</name></author><entry><title type="html">Search engine (Perl + FastCGI + SA)</title><link href="/log/site-search/" rel="alternate" type="text/html" title="Search engine (Perl + FastCGI + SA)" /><published>2026-01-03T00:00:00+08:00</published><updated>2026-01-03T00:00:00+08:00</updated><id>/log/site-search</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Article count on the website is growing. Need a way to search.]]></summary></entry><entry><title type="html">Matrix Rain: 2025 refactor</title><link href="/log/matrix-digital-rain/" rel="alternate" type="text/html" title="Matrix Rain: 2025 refactor" /><published>2025-12-21T00:00:00+08:00</published><updated>2025-12-21T00:00:00+08:00</updated><id>/log/matrix-digital-rain</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[The 2022 version worked but had some loose ends. Unicode support was inflexible–couldn’t mix ASCII with Katakana; Phosphor decay was stored in a separate array when it should’ve been packed with RGB; Code was harder to read than it needed to be.]]></summary></entry><entry><title type="html">Fingerprint door lock (LP)</title><link href="/log/fpm-door-lock-lp/" rel="alternate" type="text/html" title="Fingerprint door lock (LP)" /><published>2025-08-18T00:00:00+08:00</published><updated>2025-08-18T00:00:00+08:00</updated><id>/log/fpm-door-lock-lp</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Second iteration of the RF door lock. Old version worked but drew too much quiescent current. Sensor and servo pulled 13.8mA and 4.6mA idle. Linear regulators were a disaster. Battery didn’t last 24 hours.]]></summary></entry><entry><title type="html">High-side MOSFET switching</title><link href="/log/mosfet-switches/" rel="alternate" type="text/html" title="High-side MOSFET switching" /><published>2025-06-22T00:00:00+08:00</published><updated>2025-06-22T00:00:00+08:00</updated><id>/log/mosfet-switches</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Needed low-power switching for the fingerprint door lock. Servo and FPM draw high quiescent current–had to cut power electronically during sleep. MOSFETs can do this.]]></summary></entry><entry><title type="html">ATmega328P at 3.3V and 5V</title><link href="/log/arduino-uno/" rel="alternate" type="text/html" title="ATmega328P at 3.3V and 5V" /><published>2025-06-10T00:00:00+08:00</published><updated>2025-06-10T00:00:00+08:00</updated><id>/log/arduino-uno</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Quick reference for wiring ATmega328P ICs at 5V and 3.3V. 5V uses 16MHz crystal, 3.3V uses 8MHz.]]></summary></entry><entry><title type="html">Fingerprint door lock (RF)</title><link href="/log/fpm-door-lock-rf/" rel="alternate" type="text/html" title="Fingerprint door lock (RF)" /><published>2025-06-05T00:00:00+08:00</published><updated>2025-06-05T00:00:00+08:00</updated><id>/log/fpm-door-lock-rf</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Wanted to unlock door with fingerprint, wirelessly to avoid drilling.]]></summary></entry><entry><title type="html">Bumblebee: browser automation</title><link href="/log/bumblebee/" rel="alternate" type="text/html" title="Bumblebee: browser automation" /><published>2025-04-02T00:00:00+08:00</published><updated>2025-04-02T00:00:00+08:00</updated><id>/log/bumblebee</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Built with Andy Zhang for an employer. Tool to automate web scraping script generation.]]></summary></entry><entry><title type="html">Bare-metal ATSAM3X8E</title><link href="/log/arduino-due/" rel="alternate" type="text/html" title="Bare-metal ATSAM3X8E" /><published>2024-09-16T00:00:00+08:00</published><updated>2024-09-16T00:00:00+08:00</updated><id>/log/arduino-due</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Notes on programming bare-metal ATSAM3X8E chips (Arduino Due) using Serial Wire Debug (SwD) protocol.]]></summary></entry><entry><title type="html">Etlas: e-paper dashboard</title><link href="/log/etlas/" rel="alternate" type="text/html" title="Etlas: e-paper dashboard" /><published>2024-09-05T00:00:00+08:00</published><updated>2024-09-05T00:00:00+08:00</updated><id>/log/etlas</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[Repurposed e-reader prototype into something for regular use. News, stocks, weather dashboard. ESP32 NodeMCU D1 + 7.5” Waveshare e-paper + DHT22 sensor.]]></summary></entry><entry><title type="html">ESP32 e-reader prototype</title><link href="/log/e-reader/" rel="alternate" type="text/html" title="ESP32 e-reader prototype" /><published>2023-10-24T00:00:00+08:00</published><updated>2023-10-24T00:00:00+08:00</updated><id>/log/e-reader</id><author><name>W. D. Sadeep Madurange</name></author><summary type="html"><![CDATA[First project with e-paper displays and ESP32.]]></summary></entry></feed>
\ No newline at end of file diff --git a/_site/index.html b/_site/index.html index 39a51a5..ef782ee 100644 --- a/_site/index.html +++ b/_site/index.html @@ -58,11 +58,11 @@ <tr> <td class="posts-td posts-td-link"> - <a href="/log/site-search/" class="link-decor-none">Perl + FastCGI + SA search engine</a> + <a href="/log/site-search/" class="link-decor-none">Search engine (Perl + FastCGI + SA)</a> </td> <td class="posts-td posts-td-time"> <span class="post-meta"> - <time datetime="2026-01-02 00:00:00 +0800">2026-01-02</time> + <time datetime="2026-01-03 00:00:00 +0800">2026-01-03</time> </span> </td> </tr> diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html index 04f98e9..9170002 100644 --- a/_site/log/site-search/index.html +++ b/_site/log/site-search/index.html @@ -3,7 +3,7 @@ <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> - <title>Perl + FastCGI + SA search engine</title> + <title>Search engine (Perl + FastCGI + SA)</title> <link rel="stylesheet" href="/assets/css/main.css"> <link rel="stylesheet" href="/assets/css/skeleton.css"> </head> @@ -37,24 +37,47 @@ <main> <div class="container"> <div class="container-2"> - <h2 class="center" id="title">PERL + FASTCGI + SA SEARCH ENGINE</h2> - <h5 class="center">02 JANUARY 2026</h5> + <h2 class="center" id="title">SEARCH ENGINE (PERL + FASTCGI + SA)</h2> + <h5 class="center">03 JANUARY 2026</h5> <br> - <div class="twocol justify"><p>Number of articles growing. Need search.</p> + <div class="twocol justify"><p>Article count on the website is growing. Need a way to search.</p> -<p>Requirements: substring match, case-insensitive, fast, secure. No JavaScript.</p> +<p>Requirements: matches substrings, case-insensitive, fast, secure. No +JavaScript.</p> -<p>Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.</p> +<p>Architecture: browser → httpd → slowcgi → Perl CGI script.</p> -<p>Data structure: suffix array. Three files: corpus.bin (articles), sa.bin -(sorted byte offsets), file_map.dat (metadata).</p> +<p>httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access +governed by file system permissions–no secrets to manage. Secure by default.</p> -<p>Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null -byte sentinel for document boundaries. Sort lexicographically::</p> +<p>2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files +using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast +enough. Start to feel the O(N) pull at higher file counts.</p> -<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Use a block that forces byte-level comparison +<p>Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can +be mitigated. Tempted to stop here.</p> + +<p>2026-01-03: Suffix array (SA) based index lookup.</p> + +<p>Inefficiency of scanning every file on each request troubles me. Regex search +depends almost entirely on hardware for speed.</p> + +<p>Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted +byte offsets), file_map.dat (metadata). Index created with the site:</p> + +<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ JEKYLL_ENV=production bundle exec jekyll build +$ cd cgi-bin/ +$ perl indexer.pl +</code></pre></div></div> + +<p>Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and +saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix +array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the +lexicographical order:</p> + +<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my @sa = 0 .. (length($corpus) - 1); { - use bytes; + use bytes; # Force compare 8-bit Unicode value comparisons @sa = sort { # First 64 bytes check (fast path) (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || @@ -62,31 +85,78 @@ byte sentinel for document boundaries. Sort lexicographically::</p> (substr($corpus, $a) cmp substr($corpus, $b)) } @sa; } -</code></pre></div></div> - -<p>Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte -length targets cache lines.</p> -<p>Search: binary search for range query. Cap at 20 results–define limits or be -surprised by them.</p> +CORPUS: a s c i \0 i m x +OFFSET: 0 1 2 3 4 5 6 7 -<p>File IO and memory: many seek/read small chunks beat one large allocation (see -benchmarks for find_one_file.cgi).</p> +SORTED SUFFIXES: -<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):</p> +[4] \0imx +[0] asci\0imx +[2] ci\0imx +[3] i\0imx <-- "i" from "asci" +[5] imx <-- "i" from "imx" +[6] mx +[1] sci\0imx +[7] x +</code></pre></div></div> -<p>1,000 files: 0.31s indexing, 410 KB total index.<br /> -10,000 files: 10.97s indexing, 4.16 MB total index.</p> +<p>Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a +cache line), reducing complexity to O(N log N).</p> -<p>Search ‘arduino’ (0 matches):<br /> -1,000 files: 0.002s (SA) vs 0.080s (naive regex).<br /> -10,000 files: 0.016s (SA) vs 0.912s (naive regex).</p> +<p>Search: Textbook range query with two binary searches. Random access to offsets +and the text is possible via the fixed-width offsets:</p> -<p>Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS). -Sanitize input–strip non-printables, limit length, and quote metacharacters -(ReDOS). No exec/system (command injection). Chroot.</p> +<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seek($fh_sa, $mid * 4, 0); +read($fh_sa, my $bin_off, 4); +my $off = unpack("L", $bin_off); +seek($fh_cp, $off, 0); +read($fh_cp, my $text, $query_len); +</code></pre></div></div> -<p>Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.</p> +<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy +to implement. Note to self: mmap.</p> + +<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).</p> + +<p>500 files:</p> +<ul> + <li>Index size: 204.94 KB</li> + <li>Indexing time: 0.1475 s</li> + <li>Peak RAM (SA): 8828 KB</li> + <li>Peak RAM (Regex): 9136 KB</li> + <li>Search (SA): 0.0012 s</li> + <li>Search (Regex): 0.0407 s</li> +</ul> + +<p>1,000 files:</p> +<ul> + <li>Index size: 410.51 KB</li> + <li>Indexing time: 0.3101 s</li> + <li>Peak RAM (SA): 8980 KB</li> + <li>Peak RAM (Regex): 9460 KB</li> + <li>Search (SA): 0.0019 s</li> + <li>Search (Regex): 0.0795 s</li> +</ul> + +<p>10,000 files:</p> +<ul> + <li>Index size: 4163.44 KB</li> + <li>Indexing time: 10.9661 s</li> + <li>Peak RAM (SA): 12504 KB</li> + <li>Peak RAM (Regex): 12804 KB</li> + <li>Search (SA): 0.0161 s</li> + <li>Search (Regex): 0.9120 s</li> +</ul> + +<p>Security: Much of it comes from its architectural simplicity. No dependencies +to manage, no secrets to hide, no assets for lateral movement. Runs in chroot.</p> + +<p>Resource exhaustion and XSS attacks are inherent. The former is mitigated by +limiting concurrent searches via lock-file semaphores and capping query length +(64B) and result sets (20). All output is HTML-escaped to prevent XSS.</p> + +<p>Verdict: Fast SA lookup; Works on every conceivable web browser.</p> <p>Commit: <a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a> diff --git a/_site/posts.xml b/_site/posts.xml index 02e7149..b7913a1 100644 --- a/_site/posts.xml +++ b/_site/posts.xml @@ -1 +1 @@ -<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/posts.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-03T19:18:57+08:00</updated><id>/posts.xml</id><title type="html">ASCIIMX</title><author><name>W. D. Sadeep Madurange</name></author></feed>
\ No newline at end of file +<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/posts.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-01-04T17:31:30+08:00</updated><id>/posts.xml</id><title type="html">ASCIIMX</title><author><name>W. D. Sadeep Madurange</name></author></feed>
\ No newline at end of file diff --git a/_site/sitemap.xml b/_site/sitemap.xml index 3a7874c..ad3bdcd 100644 --- a/_site/sitemap.xml +++ b/_site/sitemap.xml @@ -42,7 +42,7 @@ </url> <url> <loc>/log/site-search/</loc> -<lastmod>2026-01-02T00:00:00+08:00</lastmod> +<lastmod>2026-01-03T00:00:00+08:00</lastmod> </url> <url> <loc>/about/</loc> |
