_log/site-search.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128

---
title: 'Full-text search: SA lookup'
date: 2026-01-03
layout: post
---

Article count is growing. Need a way to search.

Requirements: matches substrings, case-insensitive, fast, secure. No
JavaScript.

Architecture: browser → httpd → slowcgi → Perl CGI script.

Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file
system permissions govern access.

2025-12-30: Regex search.

140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at
higher file counts.

Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to
stop here.

2026-01-03: Suffix Array (SA) based index lookup.

Slurping files on every request bothers me. Regex search depends almost
entirely on hardware for speed.

Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
built with site:

```
$ JEKYLL_ENV=production bundle exec jekyll build
$ cd cgi-bin/
$ perl indexer.pl
```

Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null
byte sentinel for document boundaries. sa.bin stores suffix offsets as
32-bit unsigned integers, sorted by lexicographical order:

```
my @sa = 0 .. (length($corpus) - 1);
{
    use bytes; # Force compare 8-bit Unicode value comparisons
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}

CORPUS:   a  s  c  i \0  i  m  x
OFFSET:   0  1  2  3  4  5  6  7

SORTED SUFFIXES:

[4] \0imx
[0] asci\0imx
[2] ci\0imx
[3] i\0imx       <-- "i" from "asci"
[5] imx          <-- "i" from "imx"
[6] mx
[1] sci\0imx
[7] x
```

Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of cache
line), reducing complexity to O(N log N).

Search: Textbook range query with twin binary searches. Random access to
suffixes and the text made possible by the fixed-width offsets:

```
seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
```

Small seek/reads are fast on modern SSDs. Keeps memory usage low. 

NOTE: mmap.

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).

500 files:
 - Index size: 204.94 KB 
 - Indexing time: 0.1475 s
 - Peak RAM (SA): 8828 KB
 - Peak RAM (Regex): 9136 KB
 - Search (SA): 0.0012 s
 - Search (Regex): 0.0407 s

1,000 files:
 - Index size: 410.51 KB 
 - Indexing time: 0.3101 s
 - Peak RAM (SA): 8980 KB
 - Peak RAM (Regex): 9460 KB
 - Search (SA): 0.0019 s
 - Search (Regex): 0.0795 s

10,000 files:
 - Index size: 4163.44 KB 
 - Indexing time: 10.9661 s
 - Peak RAM (SA): 12504 KB
 - Peak RAM (Regex): 12804 KB
 - Search (SA): 0.0161 s
 - Search (Regex): 0.9120 s

Security: Derived from architectural simplicity. Zero dependencies to manage,
no secrets to hide, no targets for lateral movement. 

Runs in chroot. 

Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting
concurrent searches via lock-file semaphores. Query length (64B) and result set
(20) capped. All output is HTML-escaped to prevent XSS.

Secure by default. Fast. Durable.

Commit:
[6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e)
| Benchmarks:
[8a4da68](https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9)