summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
blob: d178d07ab5f772b0f1056c8df2332c9c3b19229a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
title: Search engine (Perl + FastCGI + SA)
date: 2026-01-03
layout: post
---

Article count on the website is growing. Need a way to search.

Requirements: matches substrings, case-insensitive, fast, secure. No
JavaScript.

Architecture: browser → httpd → slowcgi → Perl CGI script.

httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access
governed by file system permissions--no secrets to manage. Secure by default.

2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files
using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast
enough. Start to feel the O(N) pull at higher file counts.

Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can
be mitigated. Tempted to stop here.

2026-01-03: Suffix array (SA) based index lookup.

Inefficiency of scanning every file on each request troubles me. Regex search
depends almost entirely on hardware for speed.

Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted
byte offsets), file_map.dat (metadata). Index created with the site:

```
$ JEKYLL_ENV=production bundle exec jekyll build
$ cd cgi-bin/
$ perl indexer.pl
```

Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and
saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix
array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the
lexicographical order:

```
my @sa = 0 .. (length($corpus) - 1);
{
    use bytes; # Force compare 8-bit Unicode value comparisons
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}

CORPUS:   a  s  c  i \0  i  m  x
OFFSET:   0  1  2  3  4  5  6  7

SORTED SUFFIXES:

[4] \0imx
[0] asci\0imx
[2] ci\0imx
[3] i\0imx       <-- "i" from "asci"
[5] imx          <-- "i" from "imx"
[6] mx
[1] sci\0imx
[7] x
```

Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a
cache line), reducing complexity to O(N log N).

Search: Textbook range query with two binary searches. Random access to offsets
and the text is possible via the fixed-width offsets:

```
seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
```

Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy
to implement. Note to self: mmap.

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).

500 files:
 - Index size: 204.94 KB 
 - Indexing time: 0.1475 s
 - Peak RAM (SA): 8828 KB
 - Peak RAM (Regex): 9136 KB
 - Search (SA): 0.0012 s
 - Search (Regex): 0.0407 s

1,000 files:
 - Index size: 410.51 KB 
 - Indexing time: 0.3101 s
 - Peak RAM (SA): 8980 KB
 - Peak RAM (Regex): 9460 KB
 - Search (SA): 0.0019 s
 - Search (Regex): 0.0795 s

10,000 files:
 - Index size: 4163.44 KB 
 - Indexing time: 10.9661 s
 - Peak RAM (SA): 12504 KB
 - Peak RAM (Regex): 12804 KB
 - Search (SA): 0.0161 s
 - Search (Regex): 0.9120 s

Security: Much of it comes from its architectural simplicity. No dependencies
to manage, no secrets to hide, no assets for lateral movement. Runs in chroot. 

Resource exhaustion and XSS attacks are inherent. The former is mitigated by
limiting concurrent searches via lock-file semaphores and capping query length
(64B) and result sets (20). All output is HTML-escaped to prevent XSS.

Verdict: Fast SA lookup; Works on every conceivable web browser.

Commit:
[6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e)
| Benchmarks:
[8a4da68](https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9)