summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--.gitignore1
-rw-r--r--_log/site-search.md44
2 files changed, 26 insertions, 19 deletions
diff --git a/.gitignore b/.gitignore
index e3f0e76..b5ed3aa 100644
--- a/.gitignore
+++ b/.gitignore
@@ -3,5 +3,6 @@
**/*.swp
**/*.dat
**/*.bin
+**/*.core
_site/
diff --git a/_log/site-search.md b/_log/site-search.md
index 62059e0..f87f6e3 100644
--- a/_log/site-search.md
+++ b/_log/site-search.md
@@ -1,17 +1,22 @@
---
-title: 'Search: suffix array + FastCGI'
+title: 'Site search: suffix array + FastCGI'
date: 2026-01-03
layout: post
---
-Needed search for site.
+Built search.
-Requirements: match substrings, case-insensitive, fast, secure. No JavaScript.
+Requirements: match substrings, ignore case, handle 50 concurrent queries with
+1 vCPU, 1 GB RAM, 25 GB SSD. No JavaScript—work with uBlock Origin, NoScript,
+text browsers.
-Architecture: browser ↔ httpd ↔ slowcgi ↔ Perl (SA).
+Architecture: browser ↔ httpd ↔ slowcgi ↔ search engine.
-Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
-built with site:
+Server-side regex is viable for a personal site. But an index has clear
+advantages. Not much harder to implement.
+
+Index: suffix array (SA) implemented in Perl. Three files: corpus.bin, sa.bin,
+file_map.dat. Built with site:
```
$ JEKYLL_ENV=production bundle exec jekyll build
@@ -19,9 +24,9 @@ $ cd cgi-bin/
$ perl indexer.pl
```
-Indexer: Extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte
-sentinel marks document boundaries. sa.bin stores suffix offsets as 32-bit
-unsigned integers, sorted lexicographically:
+Indexer extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte sentinel
+marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned
+integers, sorted lexicographically:
```
my @sa = 0 .. (length($corpus) - 1);
@@ -36,13 +41,14 @@ my @sa = 0 .. (length($corpus) - 1);
}
```
-Sort is the bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at 64
-bytes (length of a cache line) → O(N log N).
+32-bit offsets limit index size to 4GB—more than sufficient and, if necessary,
+easily expanded.
-32-bit offsets limit index size to 4GB.
+Sort is the real bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at
+64 bytes (length of a typical cache line).
Search: Textbook range query with twin binary searches. Fixed-width offsets
-enable random access:
+enable random access to index:
```
seek($fh_sa, $mid * 4, 0);
@@ -52,7 +58,7 @@ seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
```
-Small seek/reads are fast on modern SSDs. Keeps memory usage low.
+Small seek/reads are fast on modern SSDs; keeps memory footprint small.
NOTE: mmap.
@@ -81,18 +87,18 @@ regex search:
- Peak RAM (SA): 12504 KB
- Peak RAM (Regex): 12804 KB
- Search (SA): 0.0161 s
- - Search (Regex): 0.9120 S
+ - Search (Regex): 0.9120 s
-Security: httpd, slowcgi, Perl are in the base system--no dependencies. File
-system permissions govern access. Runs in chroot.
+Security: httpd, slowcgi, Perl in OpenBSD base system. File system permissions
+govern access. Runs in chroot.
Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit
-concurrent searches. Query length (64B) and result set (20) are capped. All
+concurrent searches; query length (64B) and result set (20) are capped. All
output is HTML-escaped to prevent XSS.
Warranty: 10,000 / 12 → 833 years.
-Next release: inverted index; Anno Domini 2859.
+Next: inverted index, release date: Anno Domini 2859.
Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"