1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
|
---
title: Perl + FastCGI + SA search engine
date: 2026-01-02
layout: post
---
Number of articles growing. Need search.
Requirements: substring match, case-insensitive, fast, secure. No JavaScript.
Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.
Data structure: suffix array. Three files: corpus.bin (articles), sa.bin
(sorted byte offsets), file_map.dat (metadata).
Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null
byte sentinel for document boundaries. Sort lexicographically::
```
# Use a block that forces byte-level comparison
{
use bytes;
@sa = sort {
# First 64 bytes check (fast path)
(substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
# Full string fallback (required for correctness)
(substr($corpus, $a) cmp substr($corpus, $b))
} @sa;
}
```
Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte
length targets cache lines.
Search: binary search for range query. Cap at 20 results--define limits or be
surprised by them.
File IO and memory: many seek/read small chunks beat one large allocation (see
benchmarks for find_one_file.cgi).
Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):
1,000 files: 0.31s indexing, 410 KB total index.
10,000 files: 10.97s indexing, 4.16 MB total index.
Search 'arduino' (0 matches):
1,000 files: 0.002s (SA) vs 0.080s (naive regex).
10,000 files: 0.016s (SA) vs 0.912s (naive regex).
Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS).
Sanitize input--strip non-printables, limit length, and quote metacharacters
(ReDOS). No exec/system (command injection). Chroot.
Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.
Commit:
[6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e)
| Benchmarks:
[8a4da68](https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9)
|