1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
|
---
title: Under-engineered search
date: 2026-01-03
layout: post
---
Developed a suffix-array-based search engine for the site today. While a simple
regex search was enough, couldn't resist the technical elegance of a proper
index.
Indexer: Indexer crawls the HTML, lowercases the text, and encodes it into
UTF-8 bytes. Null byte sentinel marks the document boundaries;
Lexicographically sorted 32-bit unsigned integer offsets are stored in sa.bin:
```
my @sa = 0 .. (length($corpus) - 1);
{
use bytes; # Force compare raw bytes
@sa = sort {
# First 64 bytes check (fast path)
(substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
# Full string fallback (required for correctness)
(substr($corpus, $a) cmp substr($corpus, $b))
} @sa;
}
```
32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but
comforting to have.
Search: Textbook range query with two binary searches hosted in a FastCGI
process. Fixed-width offsets enable fast random access to the index:
```
seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
```
Seek + read outperformed mmap for <1k files. At 10k, mmap was occasionally
faster (~200 µs), but consumed more memory—possibly due to OpenBSD’s VM
security trade-offs. Results may vary by OS.
Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against
linear regex search:
<pre class="pre-no-style">
=============================================================
SEARCH BENCHMARK: Suffix array vs. Linear regex
ARTICLE SIZE: 8 KB
=============================================================
500 files (Targeting: keyword_-1):
----------------+----------------------+---------------------
METRIC | SA | REGEX
----------------+----------------------+---------------------
Search time | 0.0014s | 0.0451s
Peak RAM | 8124 KB | 9612 KB
Indexing time | 18.1865s | N/A
Index size | 19610.39 KB | N/A
----------------+----------------------+---------------------
1000 files (Targeting: keyword_-1):
----------------+----------------------+---------------------
METRIC | SA | REGEX
----------------+----------------------+---------------------
Search time | 0.0021s | 0.0918s
Peak RAM | 8280 KB | 9960 KB
Indexing time | 43.1748s | N/A
Index size | 39225.06 KB | N/A
----------------+----------------------+---------------------
10000 files (Targeting: keyword_-1):
----------------+----------------------+---------------------
METRIC | SA | REGEX
----------------+----------------------+---------------------
Search time | 0.0173s | 1.1275s
Peak RAM | 11848 KB | 13392 KB
Indexing time | 663.3909s | N/A
Index size | 392263.01 KB | N/A
----------------+----------------------+---------------------
</pre>
Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system
permissions to govern access. Hardened the system by running it in chroot.
Resource exhaustion and XSS attacks are inherent. Limited concurrent searches
using lock-file semaphores, and capped the query length (64 B) and the result
set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape.
Performance: Without SA-IS, indexing is slow. With O(L⋅N log N) naive sort, 100
8 KB articles took 6.58 minutes to index. L=64 fast path reduces that to 2.69
seconds (L=16, 32, 64: 2.68-2.69s; 128, 256: 2.75-2.77s). Even so, 43.1748s to
index 500 articles is untenable.
I under-engineered search.
Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"
class="external" target="_blank" rel="noopener noreferrer">6da102d</a> |
Benchmarks: <a
href="https://git.asciimx.com/site-search-bm/commit/?id=de9d82e8074c9b67a04989f9b6be62890b7c95bb"
class="external" target="_blank" rel="noopener noreferrer">de9d82e</a>
|