From a851a2d646f439f7126c232ba1524c55a8990872 Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Fri, 9 Jan 2026 16:45:56 +0800 Subject: Remove _site from git. --- _site/log/site-search/index.html | 184 --------------------------------------- 1 file changed, 184 deletions(-) delete mode 100644 _site/log/site-search/index.html (limited to '_site/log/site-search/index.html') diff --git a/_site/log/site-search/index.html b/_site/log/site-search/index.html deleted file mode 100644 index 77fe19e..0000000 --- a/_site/log/site-search/index.html +++ /dev/null @@ -1,184 +0,0 @@ - - - - - - Full-text search: SA lookup - - - - - - - - - - - -
-
-
-

FULL-TEXT SEARCH: SA LOOKUP

-
03 JANUARY 2026
-
-

Article count is growing. Need a way to search.

- -

Requirements: matches substrings, case-insensitive, fast, secure. No -JavaScript.

- -

Architecture: browser → httpd → slowcgi → Perl CGI script.

- -

Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file -system permissions govern access.

- -

2025-12-30: Regex search.

- -

140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at -higher file counts.

- -

Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to -stop here.

- -

2026-01-03: Suffix Array (SA) based index lookup.

- -

Slurping files on every request bothers me. Regex search depends almost -entirely on hardware for speed.

- -

Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index -built with site:

- -
$ JEKYLL_ENV=production bundle exec jekyll build
-$ cd cgi-bin/
-$ perl indexer.pl
-
- -

Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null -byte sentinel for document boundaries. sa.bin stores suffix offsets as -32-bit unsigned integers, sorted by lexicographical order:

- -
my @sa = 0 .. (length($corpus) - 1);
-{
-    use bytes; # Force compare 8-bit Unicode value comparisons
-    @sa = sort { 
-        # First 64 bytes check (fast path)
-        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
-        # Full string fallback (required for correctness)
-        (substr($corpus, $a) cmp substr($corpus, $b))
-    } @sa;
-}
-
-CORPUS:   a  s  c  i \0  i  m  x
-OFFSET:   0  1  2  3  4  5  6  7
-
-SORTED SUFFIXES:
-
-[4] \0imx
-[0] asci\0imx
-[2] ci\0imx
-[3] i\0imx       <-- "i" from "asci"
-[5] imx          <-- "i" from "imx"
-[6] mx
-[1] sci\0imx
-[7] x
-
- -

Time complexity: O(Lā‹…N log N). Fast path caps L at 64 bytes (length of cache -line), reducing complexity to O(N log N).

- -

Search: Textbook range query with twin binary searches. Random access to -suffixes and the text made possible by the fixed-width offsets:

- -
seek($fh_sa, $mid * 4, 0);
-read($fh_sa, my $bin_off, 4);
-my $off = unpack("L", $bin_off);
-seek($fh_cp, $off, 0);
-read($fh_cp, my $text, $query_len);
-
- -

Small seek/reads are fast on modern SSDs. Keeps memory usage low.

- -

NOTE: mmap.

- -

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).

- -

500 files:

-
    -
  • Index size: 204.94 KB
  • -
  • Indexing time: 0.1475 s
  • -
  • Peak RAM (SA): 8828 KB
  • -
  • Peak RAM (Regex): 9136 KB
  • -
  • Search (SA): 0.0012 s
  • -
  • Search (Regex): 0.0407 s
  • -
- -

1,000 files:

-
    -
  • Index size: 410.51 KB
  • -
  • Indexing time: 0.3101 s
  • -
  • Peak RAM (SA): 8980 KB
  • -
  • Peak RAM (Regex): 9460 KB
  • -
  • Search (SA): 0.0019 s
  • -
  • Search (Regex): 0.0795 s
  • -
- -

10,000 files:

-
    -
  • Index size: 4163.44 KB
  • -
  • Indexing time: 10.9661 s
  • -
  • Peak RAM (SA): 12504 KB
  • -
  • Peak RAM (Regex): 12804 KB
  • -
  • Search (SA): 0.0161 s
  • -
  • Search (Regex): 0.9120 s
  • -
- -

Security: Derived from architectural simplicity. Zero dependencies to manage, -no secrets to hide, no targets for lateral movement.

- -

Runs in chroot.

- -

Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting -concurrent searches via lock-file semaphores. Query length (64B) and result set -(20) capped. All output is HTML-escaped to prevent XSS.

- -

Secure by default. Fast. Durable.

- -

Commit: -6da102d -| Benchmarks: -8a4da68

-
- -
-
-
- - - - - - -- cgit v1.2.3