From e836c4b9e78cc3892cdebf8126cb650f1b91ed37 Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Thu, 8 Jan 2026 22:28:37 +0800 Subject: Tighten prose. --- _site/log/site-search/index.html | 65 +++++++++++++++++++++------------------- 1 file changed, 34 insertions(+), 31 deletions(-) (limited to '_site/log/site-search') diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html index 9170002..e5b9f15 100644 --- a/_site/log/site-search/index.html +++ b/_site/log/site-search/index.html @@ -3,7 +3,7 @@ - Search engine (Perl + FastCGI + SA) + Site search with suffix array @@ -37,43 +37,43 @@
-

SEARCH ENGINE (PERL + FASTCGI + SA)

+

SITE SEARCH WITH SUFFIX ARRAY

03 JANUARY 2026

-

Article count on the website is growing. Need a way to search.

+

Article count is growing. Need a way to search.

Requirements: matches substrings, case-insensitive, fast, secure. No JavaScript.

Architecture: browser → httpd → slowcgi → Perl CGI script.

-

httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access -governed by file system permissions–no secrets to manage. Secure by default.

+

Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file +system permissions govern access.

-

2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files -using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast -enough. Start to feel the O(N) pull at higher file counts.

+

2025-12-30: Regex search.

-

Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can -be mitigated. Tempted to stop here.

+

140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at +higher file counts.

-

2026-01-03: Suffix array (SA) based index lookup.

+

Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to +stop here.

-

Inefficiency of scanning every file on each request troubles me. Regex search -depends almost entirely on hardware for speed.

+

2026-01-03: Suffix Array (SA) based index lookup.

-

Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted -byte offsets), file_map.dat (metadata). Index created with the site:

+

Slurping files on every request bothers me. Regex search depends almost +entirely on hardware for speed.

+ +

Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index +built with site:

$ JEKYLL_ENV=production bundle exec jekyll build
 $ cd cgi-bin/
 $ perl indexer.pl
 
-

Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and -saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix -array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the -lexicographical order:

+

Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null +byte sentinel for document boundaries. sa.bin stores suffix offsets as +32-bit unsigned integers, sorted by lexicographical order:

my @sa = 0 .. (length($corpus) - 1);
 {
@@ -101,11 +101,11 @@ SORTED SUFFIXES:
 [7] x
 
-

Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a -cache line), reducing complexity to O(N log N).

+

Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of cache +line), reducing complexity to O(N log N).

-

Search: Textbook range query with two binary searches. Random access to offsets -and the text is possible via the fixed-width offsets:

+

Search: Textbook range query with twin binary searches. Random access to +suffixes and the text made possible by the fixed-width offsets:

seek($fh_sa, $mid * 4, 0);
 read($fh_sa, my $bin_off, 4);
@@ -114,8 +114,9 @@ seek($fh_cp, $off, 0);
 read($fh_cp, my $text, $query_len);
 
-

Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy -to implement. Note to self: mmap.

+

Small seek/reads are fast on modern SSDs. Keeps memory usage low.

+ +

NOTE: mmap.

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).

@@ -149,14 +150,16 @@ to implement. Note to self: mmap.

  • Search (Regex): 0.9120 s
  • -

    Security: Much of it comes from its architectural simplicity. No dependencies -to manage, no secrets to hide, no assets for lateral movement. Runs in chroot.

    +

    Security: Derived from architectural simplicity. Zero dependencies to manage, +no secrets to hide, no targets for lateral movement.

    + +

    Runs in chroot.

    -

    Resource exhaustion and XSS attacks are inherent. The former is mitigated by -limiting concurrent searches via lock-file semaphores and capping query length -(64B) and result sets (20). All output is HTML-escaped to prevent XSS.

    +

    Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting +concurrent searches via lock-file semaphores. Query length (64B) and result set +(20) capped. All output is HTML-escaped to prevent XSS.

    -

    Verdict: Fast SA lookup; Works on every conceivable web browser.

    +

    Secure by default. Fast. Durable.

    Commit: 6da102d -- cgit v1.2.3