1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
|
---
title: 'Full-text search: SA lookup'
date: 2026-01-03
layout: post
---
Needed search for site.
Requirements: matches substrings, case-insensitive, fast, secure. No
JavaScript.
Architecture: browser → httpd → slowcgi → Perl CGI script.
SA index implemented with three files: corpus.bin, sa.bin, file_map.dat. Index
built with site:
```
$ JEKYLL_ENV=production bundle exec jekyll build
$ cd cgi-bin/
$ perl indexer.pl
```
Indexer: Extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte
sentinel marks document boundaries. sa.bin stores suffix offsets as 32-bit
unsigned integers, sorted lexicographically:
```
my @sa = 0 .. (length($corpus) - 1);
{
use bytes; # Force compare 8-bit Unicode value comparisons
@sa = sort {
# First 64 bytes check (fast path)
(substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
# Full string fallback (required for correctness)
(substr($corpus, $a) cmp substr($corpus, $b))
} @sa;
}
```
Sort is the bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at 64
bytes (length of a cache line) → O(N log N).
Search: Textbook range query with twin binary searches. Uses fixed-width
offsets for random access:
```
seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
```
Small seek/reads are fast on modern SSDs. Keeps memory usage low.
NOTE: mmap.
Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear
regex search:
500 files:
- Index size: 204.94 KB
- Indexing time: 0.1475 s
- Peak RAM (SA): 8828 KB
- Peak RAM (Regex): 9136 KB
- Search (SA): 0.0012 s
- Search (Regex): 0.0407 s
1,000 files:
- Index size: 410.51 KB
- Indexing time: 0.3101 s
- Peak RAM (SA): 8980 KB
- Peak RAM (Regex): 9460 KB
- Search (SA): 0.0019 s
- Search (Regex): 0.0795 s
10,000 files:
- Index size: 4163.44 KB
- Indexing time: 10.9661 s
- Peak RAM (SA): 12504 KB
- Peak RAM (Regex): 12804 KB
- Search (SA): 0.0161 s
- Search (Regex): 0.9120 s
Security: httpd, slowcgi, Perl in base system--no dependencies. File system
permissions govern access. Runs in chroot.
Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit
concurrent searches. Query length (64B) and result set (20) capped. All output
is HTML-escaped to prevent XSS.
Warranty: 10,000 / 12 → 833 years. Next release: inverted index, year of our
Lord 2859.
Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"
class="external" target="_blank" rel="noopener noreferrer">6da102d</a> |
Benchmarks: <a
href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9"
class="external" target="_blank" rel="noopener noreferrer">8a4da68</a>
|