From 1a4a6cb6d2aa2c8512e9637dc5dd95997321c444 Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Sun, 4 Jan 2026 17:57:39 +0800 Subject: Fix the search engine post. --- _site/feed.xml | 2 +- _site/index.html | 4 +- _site/log/site-search/index.html | 130 ++++++++++++++++++++++++++++++--------- _site/posts.xml | 2 +- _site/sitemap.xml | 2 +- 5 files changed, 105 insertions(+), 35 deletions(-) (limited to '_site') diff --git a/_site/feed.xml b/_site/feed.xml index 4070b39..cb9e91e 100644 --- a/_site/feed.xml +++ b/_site/feed.xml @@ -1 +1 @@ -Jekyll2026-01-03T19:18:57+08:00/feed.xmlASCIIMX | LogW. D. Sadeep MadurangePerl + FastCGI + SA search engine2026-01-02T00:00:00+08:002026-01-02T00:00:00+08:00/log/site-searchW. D. Sadeep MadurangeMatrix Rain: 2025 refactor2025-12-21T00:00:00+08:002025-12-21T00:00:00+08:00/log/matrix-digital-rainW. D. Sadeep MadurangeFingerprint door lock (LP)2025-08-18T00:00:00+08:002025-08-18T00:00:00+08:00/log/fpm-door-lock-lpW. D. Sadeep MadurangeHigh-side MOSFET switching2025-06-22T00:00:00+08:002025-06-22T00:00:00+08:00/log/mosfet-switchesW. D. Sadeep MadurangeATmega328P at 3.3V and 5V2025-06-10T00:00:00+08:002025-06-10T00:00:00+08:00/log/arduino-unoW. D. Sadeep MadurangeFingerprint door lock (RF)2025-06-05T00:00:00+08:002025-06-05T00:00:00+08:00/log/fpm-door-lock-rfW. D. Sadeep MadurangeBumblebee: browser automation2025-04-02T00:00:00+08:002025-04-02T00:00:00+08:00/log/bumblebeeW. D. Sadeep MadurangeBare-metal ATSAM3X8E2024-09-16T00:00:00+08:002024-09-16T00:00:00+08:00/log/arduino-dueW. D. Sadeep MadurangeEtlas: e-paper dashboard2024-09-05T00:00:00+08:002024-09-05T00:00:00+08:00/log/etlasW. D. Sadeep MadurangeESP32 e-reader prototype2023-10-24T00:00:00+08:002023-10-24T00:00:00+08:00/log/e-readerW. D. Sadeep Madurange \ No newline at end of file +Jekyll2026-01-04T17:31:30+08:00/feed.xmlASCIIMX | LogW. D. Sadeep MadurangeSearch engine (Perl + FastCGI + SA)2026-01-03T00:00:00+08:002026-01-03T00:00:00+08:00/log/site-searchW. D. Sadeep MadurangeMatrix Rain: 2025 refactor2025-12-21T00:00:00+08:002025-12-21T00:00:00+08:00/log/matrix-digital-rainW. D. Sadeep MadurangeFingerprint door lock (LP)2025-08-18T00:00:00+08:002025-08-18T00:00:00+08:00/log/fpm-door-lock-lpW. D. Sadeep MadurangeHigh-side MOSFET switching2025-06-22T00:00:00+08:002025-06-22T00:00:00+08:00/log/mosfet-switchesW. D. Sadeep MadurangeATmega328P at 3.3V and 5V2025-06-10T00:00:00+08:002025-06-10T00:00:00+08:00/log/arduino-unoW. D. Sadeep MadurangeFingerprint door lock (RF)2025-06-05T00:00:00+08:002025-06-05T00:00:00+08:00/log/fpm-door-lock-rfW. D. Sadeep MadurangeBumblebee: browser automation2025-04-02T00:00:00+08:002025-04-02T00:00:00+08:00/log/bumblebeeW. D. Sadeep MadurangeBare-metal ATSAM3X8E2024-09-16T00:00:00+08:002024-09-16T00:00:00+08:00/log/arduino-dueW. D. Sadeep MadurangeEtlas: e-paper dashboard2024-09-05T00:00:00+08:002024-09-05T00:00:00+08:00/log/etlasW. D. Sadeep MadurangeESP32 e-reader prototype2023-10-24T00:00:00+08:002023-10-24T00:00:00+08:00/log/e-readerW. D. Sadeep Madurange \ No newline at end of file diff --git a/_site/index.html b/_site/index.html index 39a51a5..ef782ee 100644 --- a/_site/index.html +++ b/_site/index.html @@ -58,11 +58,11 @@ - Perl + FastCGI + SA search engine + Search engine (Perl + FastCGI + SA) - + diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html index 04f98e9..9170002 100644 --- a/_site/log/site-search/index.html +++ b/_site/log/site-search/index.html @@ -3,7 +3,7 @@ - Perl + FastCGI + SA search engine + Search engine (Perl + FastCGI + SA) @@ -37,24 +37,47 @@
-

PERL + FASTCGI + SA SEARCH ENGINE

-
02 JANUARY 2026
+

SEARCH ENGINE (PERL + FASTCGI + SA)

+
03 JANUARY 2026

-

Number of articles growing. Need search.

+

Article count on the website is growing. Need a way to search.

-

Requirements: substring match, case-insensitive, fast, secure. No JavaScript.

+

Requirements: matches substrings, case-insensitive, fast, secure. No +JavaScript.

-

Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.

+

Architecture: browser → httpd → slowcgi → Perl CGI script.

-

Data structure: suffix array. Three files: corpus.bin (articles), sa.bin -(sorted byte offsets), file_map.dat (metadata).

+

httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access +governed by file system permissions–no secrets to manage. Secure by default.

-

Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null -byte sentinel for document boundaries. Sort lexicographically::

+

2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files +using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast +enough. Start to feel the O(N) pull at higher file counts.

-
# Use a block that forces byte-level comparison
+

Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can +be mitigated. Tempted to stop here.

+ +

2026-01-03: Suffix array (SA) based index lookup.

+ +

Inefficiency of scanning every file on each request troubles me. Regex search +depends almost entirely on hardware for speed.

+ +

Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted +byte offsets), file_map.dat (metadata). Index created with the site:

+ +
$ JEKYLL_ENV=production bundle exec jekyll build
+$ cd cgi-bin/
+$ perl indexer.pl
+
+ +

Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and +saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix +array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the +lexicographical order:

+ +
my @sa = 0 .. (length($corpus) - 1);
 {
-    use bytes; 
+    use bytes; # Force compare 8-bit Unicode value comparisons
     @sa = sort { 
         # First 64 bytes check (fast path)
         (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
@@ -62,31 +85,78 @@ byte sentinel for document boundaries. Sort lexicographically::

(substr($corpus, $a) cmp substr($corpus, $b)) } @sa; } -
- -

Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte -length targets cache lines.

-

Search: binary search for range query. Cap at 20 results–define limits or be -surprised by them.

+CORPUS: a s c i \0 i m x +OFFSET: 0 1 2 3 4 5 6 7 -

File IO and memory: many seek/read small chunks beat one large allocation (see -benchmarks for find_one_file.cgi).

+SORTED SUFFIXES: -

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):

+[4] \0imx +[0] asci\0imx +[2] ci\0imx +[3] i\0imx <-- "i" from "asci" +[5] imx <-- "i" from "imx" +[6] mx +[1] sci\0imx +[7] x +
-

1,000 files: 0.31s indexing, 410 KB total index.
-10,000 files: 10.97s indexing, 4.16 MB total index.

+

Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a +cache line), reducing complexity to O(N log N).

-

Search ‘arduino’ (0 matches):
-1,000 files: 0.002s (SA) vs 0.080s (naive regex).
-10,000 files: 0.016s (SA) vs 0.912s (naive regex).

+

Search: Textbook range query with two binary searches. Random access to offsets +and the text is possible via the fixed-width offsets:

-

Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS). -Sanitize input–strip non-printables, limit length, and quote metacharacters -(ReDOS). No exec/system (command injection). Chroot.

+
seek($fh_sa, $mid * 4, 0);
+read($fh_sa, my $bin_off, 4);
+my $off = unpack("L", $bin_off);
+seek($fh_cp, $off, 0);
+read($fh_cp, my $text, $query_len);
+
-

Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.

+

Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy +to implement. Note to self: mmap.

+ +

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).

+ +

500 files:

+
    +
  • Index size: 204.94 KB
  • +
  • Indexing time: 0.1475 s
  • +
  • Peak RAM (SA): 8828 KB
  • +
  • Peak RAM (Regex): 9136 KB
  • +
  • Search (SA): 0.0012 s
  • +
  • Search (Regex): 0.0407 s
  • +
+ +

1,000 files:

+
    +
  • Index size: 410.51 KB
  • +
  • Indexing time: 0.3101 s
  • +
  • Peak RAM (SA): 8980 KB
  • +
  • Peak RAM (Regex): 9460 KB
  • +
  • Search (SA): 0.0019 s
  • +
  • Search (Regex): 0.0795 s
  • +
+ +

10,000 files:

+
    +
  • Index size: 4163.44 KB
  • +
  • Indexing time: 10.9661 s
  • +
  • Peak RAM (SA): 12504 KB
  • +
  • Peak RAM (Regex): 12804 KB
  • +
  • Search (SA): 0.0161 s
  • +
  • Search (Regex): 0.9120 s
  • +
+ +

Security: Much of it comes from its architectural simplicity. No dependencies +to manage, no secrets to hide, no assets for lateral movement. Runs in chroot.

+ +

Resource exhaustion and XSS attacks are inherent. The former is mitigated by +limiting concurrent searches via lock-file semaphores and capping query length +(64B) and result sets (20). All output is HTML-escaped to prevent XSS.

+ +

Verdict: Fast SA lookup; Works on every conceivable web browser.

Commit: 6da102d diff --git a/_site/posts.xml b/_site/posts.xml index 02e7149..b7913a1 100644 --- a/_site/posts.xml +++ b/_site/posts.xml @@ -1 +1 @@ -Jekyll2026-01-03T19:18:57+08:00/posts.xmlASCIIMXW. D. Sadeep Madurange \ No newline at end of file +Jekyll2026-01-04T17:31:30+08:00/posts.xmlASCIIMXW. D. Sadeep Madurange \ No newline at end of file diff --git a/_site/sitemap.xml b/_site/sitemap.xml index 3a7874c..ad3bdcd 100644 --- a/_site/sitemap.xml +++ b/_site/sitemap.xml @@ -42,7 +42,7 @@ /log/site-search/ -2026-01-02T00:00:00+08:00 +2026-01-03T00:00:00+08:00 /about/ -- cgit v1.2.3