summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
blob: d4d7fe41e58ed27e17194ed701e4465fa652dbf6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
title: Overengineered search
date: 2026-01-03
layout: post
---

Developed a suffix-array-based search engine. While a simple regex search
would've been sufficient, couldn't resist the technical elegance of a proper
index.

Indexer crawls the HTML, lowercases the text, and encodes it into UTF-8 bytes.
Null byte sentinels mark document boundaries; sa.bin stores lexicographically
sorted 32-bit unsigned integer offsets:

```
my @sa = 0 .. (length($corpus) - 1);
{
    use bytes; # Force compare raw bytes
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}
```

32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but
comforting to have.

O(L⋅N log N) sort is the bottleneck. 100 4.1 KB articles took 97.9s to index.
L=64 fast path reduces that to 1.31s. Experimented with 16, 32, 128, and 256
bytes; 64 was the sweet spot—lower values were marginally faster, higher ones
marginally slower.

Implemented search using a textbook range query with two binary searches,
hosted in a FastCGI process. Fixed-width offsets allow fast random access to
the index:

```
seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
```

Seek + read outperformed mmap for <1k files. At 10k, mmap was occasionally
faster (~200 µs), but consumed more memory—possibly due to OpenBSD’s VM
security trade-offs. Results may vary by OS.

Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system
permissions to govern access. Hardened the system by running it in chroot.

Resource exhaustion and XSS attacks are inherent. Limited concurrent searches
using lock-file semaphores, and capped the query length (64 B) and the result
set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape.

Benchmarks: My articles have a 3.42 KB median, 3.43 KB mean, and 5.39 KB max.
Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 4.1 KB) against
linear regex search:

<pre class="pre-no-style">
=============================================================
SEARCH BENCHMARK: Suffix array vs. Linear regex
ARTICLE SIZE: 4.1 KB
=============================================================

100 files (Targeting: keyword_-1):
----------------+----------------------+---------------------
METRIC          | SA                   | REGEX               
----------------+----------------------+---------------------
Search time     | 0.0009s              | 0.0084s             
Peak RAM        | 7968 KB              | 9676 KB             
Indexing time   | 1.3332s              | N/A                 
Index size      | 2070.38 KB           | N/A                 
----------------+----------------------+---------------------

300 files (Targeting: keyword_-1):
----------------+----------------------+---------------------
METRIC          | SA                   | REGEX               
----------------+----------------------+---------------------
Search time     | 0.0009s              | 0.0242s             
Peak RAM        | 8024 KB              | 9680 KB             
Indexing time   | 4.5658s              | N/A                 
Index size      | 6211.76 KB           | N/A                 
----------------+----------------------+---------------------

5000 files (Targeting: keyword_-1):
----------------+----------------------+---------------------
METRIC          | SA                   | REGEX               
----------------+----------------------+---------------------
Search time     | 0.0088s              | 0.4937s             
Peak RAM        | 9948 KB              | 11436 KB            
Indexing time   | 138.7510s            | N/A                 
Index size      | 103557.18 KB         | N/A                 
----------------+----------------------+---------------------
</pre>

Search scales well—0.9 ms at 100 files, 8.8 ms at 5000. Indexing doesn't. 4.5s
at 300 files is tolerable; 138s at 5000 is impractical.

Warranty: 300 / 6 → 50 years. 

Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"
class="external" target="_blank" rel="noopener noreferrer">6da102d</a> |
Benchmarks: <a
href="https://git.asciimx.com/site-search-bm/commit/?id=f6d7c3fdbecbcb880c0c02fdffefa1f467c46b03"
class="external" target="_blank" rel="noopener noreferrer">f6d7c3f</a>