summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
blob: 0848dce291f4adf1388fa596acee17e51132bdcf (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
title: Built a search engine for website based on suffix arrays
date: 2026-01-03
layout: post
---

Built search.

Requirements: match substrings, ignore case, handle 50 concurrent queries with
1 vCPU, 1 GB RAM, 25 GB SSD. No JavaScript—work with uBlock Origin, NoScript,
text browsers.

Architecture: browser ↔ httpd ↔ slowcgi ↔ search engine.

Server-side regex is viable for a personal site. But an index has clear
advantages. Not that much harder to implement.

Index: suffix array (SA) implemented in Perl. Index— corpus.bin, sa.bin,
file_map.dat—built with site:

```
$ JEKYLL_ENV=production bundle exec jekyll build
$ cd cgi-bin/
$ perl indexer.pl
```

Indexer extracts HTML, lowercases, encodes into UTF-8 bytes. Null byte sentinel
marks document boundaries. sa.bin stores suffix offsets as 32-bit unsigned
integers, sorted lexicographically:

```
my @sa = 0 .. (length($corpus) - 1);
{
    use bytes; # Force compare 8-bit Unicode value comparisons
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}
```

32-bit offsets limit index size to 4GB—more than sufficient and, if necessary,
easily expanded. 

Sort (O(L⋅N log N)) is the real bottleneck. Fast path caps L at 64 bytes
(length of a typical cache line).

Search: Textbook range query with two binary searches. Fixed-width offsets
enable random access to index:

```
seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
```

Small seek/reads are fast on modern SSDs; keeps memory footprint small. 

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear
regex search:

<pre class="pre-no-style">
=============================================================
SEARCH BENCHMARK: Suffix array vs. Linear regex
ARTICLE SIZE: 16 KB
=============================================================

500 files:
-------------------------------------------------------------
METRIC          | SA                   | REGEX
----------------+----------------------+---------------------
Search time     |             0.0012s  |             0.0407s
Peak RAM        |              8828 KB |              9136 KB
Indexing time   |             0.1475s  |                 N/A
Index size      |            204.94 KB |                 N/A
-------------------------------------------------------------

1,000 files:
-------------------------------------------------------------
METRIC          | SA                   | REGEX
----------------+----------------------+---------------------
Search time     |             0.0019s  |             0.0795s
Peak RAM        |              8980 KB |              9460 KB
Indexing time   |             0.3101s  |                 N/A
Index size      |            410.51 KB |                 N/A
-------------------------------------------------------------

10,000 files:
-------------------------------------------------------------
METRIC          | SA                   | REGEX
----------------+----------------------+---------------------
Search time     |             0.0161s  |             0.9120s
Peak RAM        |             12504 KB |             12804 KB
Indexing time   |            10.9661s  |                 N/A
Index size      |           4163.44 KB |                 N/A
-------------------------------------------------------------
</pre>

Seek/read consistently outperformed mmap at <1k files. At 10k, mmap was
occasionally faster (~200 µs), but used more memory—possibly OpenBSD's VM
security trade-offs. Results may vary by OS.

Security: httpd, slowcgi, Perl in OpenBSD base system. File system permissions
govern access. Runs in chroot. 

Resource exhaustion and XSS attacks are inherent. Lock-file semaphores limit
concurrent searches. Query length (64B) and result set (20) are capped. All
output is HTML-escaped to prevent XSS.

Warranty: 10,000 / 12 → 833 years. <br>
Next release: inverted index—Anno Domini 2859.

Commit: <a
href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e"
class="external" target="_blank" rel="noopener noreferrer">6da102d</a> |
Benchmarks: <a
href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9"
class="external" target="_blank" rel="noopener noreferrer">8a4da68</a>