_log/site-search.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131

---
title: 'Full-text search: SA lookup'
date: 2026-01-03
layout: post
---

Article count is growing. Need search.

Requirements: matches substrings, case-insensitive, fast, secure. No
JavaScript.

Architecture: browser → httpd → slowcgi → Perl CGI script.

Perl, httpd, slowcgi are in the base system. Search is anonymous; file system
permissions govern access.

2025-12-30: Regex search.

Threw together a 140-line Perl script; searches 500 files in 40ms. Fast enough.
O(N) drag felt at higher file counts.

Crawling files introduces ReDoS and symlink attack vectors. Both can be
mitigated. Tempted to stop here.

2026-01-03: Suffix array (SA) search.

Regex search depends almost entirely on hardware for speed. Slurping files on
every request bothers me.

Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
built with site:

```
$ JEKYLL_ENV=production bundle exec jekyll build
$ cd cgi-bin/
$ perl indexer.pl
```

Indexer: Extracts HTML, lowercases, and encodes into UTF-8 binary sequences.
Null byte sentinel marks document boundaries. sa.bin stores suffix offsets as
32-bit unsigned integers, sorted by lexicographical order:

```
my @sa = 0 .. (length($corpus) - 1);
{
    use bytes; # Force compare 8-bit Unicode value comparisons
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}

CORPUS:   a  s  c  i \0  i  m  x
OFFSET:   0  1  2  3  4  5  6  7

SORTED SUFFIXES:

[4] \0imx
[0] asci\0imx
[2] ci\0imx
[3] i\0imx       <-- "i" from "asci"
[5] imx          <-- "i" from "imx"
[6] mx
[1] sci\0imx
[7] x
```

Sort is the bottleneck. Time complexity: O(L⋅N log N). Fast path caps L at 64
bytes (length of a cache line) → O(N log N).

Search: Textbook range query with twin binary searches. Random access to
suffixes and the text made possible by the fixed-width offsets:

```
seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
```

Small seek/reads are fast on modern SSDs. Keeps memory usage low. 

NOTE: mmap.

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).

500 files:
 - Index size: 204.94 KB 
 - Indexing time: 0.1475 s
 - Peak RAM (SA): 8828 KB
 - Peak RAM (Regex): 9136 KB
 - Search (SA): 0.0012 s
 - Search (Regex): 0.0407 s

1,000 files:
 - Index size: 410.51 KB 
 - Indexing time: 0.3101 s
 - Peak RAM (SA): 8980 KB
 - Peak RAM (Regex): 9460 KB
 - Search (SA): 0.0019 s
 - Search (Regex): 0.0795 s

10,000 files:
 - Index size: 4163.44 KB 
 - Indexing time: 10.9661 s
 - Peak RAM (SA): 12504 KB
 - Peak RAM (Regex): 12804 KB
 - Search (SA): 0.0161 s
 - Search (Regex): 0.9120 s

Security: Derived from architectural simplicity. Zero dependencies to manage.
No secrets to hide. No targets for lateral movement. 

Runs in chroot. 

Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting
concurrent searches via lock-file semaphores. Query length (64B) and result set
(20) capped. All output is HTML-escaped to prevent XSS.

O(N) drag gone. Architecture durable; secure by default.

Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"
class="external" target="_blank" rel="noopener noreferrer">6da102d</a> |
Benchmarks: <a
href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9"
class="external" target="_blank" rel="noopener noreferrer">8a4da68</a>