summaryrefslogtreecommitdiffstats
path: root/_log/site-search.md
blob: 6af9f633001b9f6c4fe76843b7a67940d535666d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
title: Perl + FastCGI + SA search engine
date: 2026-01-02
layout: post
---

Number of articles growing. Need search.

Requirements: substring match, case-insensitive, fast, secure. No JavaScript.

Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.

Data structure: suffix array. Three files: corpus.bin (articles), sa.bin
(sorted byte offsets), file_map.dat (metadata).

Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null
byte sentinel for document boundaries. Sort lexicographically::

```
# Use a block that forces byte-level comparison
{
    use bytes; 
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}
```

Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte
length targets cache lines.

Search: binary search for range query. Cap at 20 results--define limits or be
surprised by them. 

File IO and memory: many seek/read small chunks beat one large allocation (see
benchmarks for find_one_file.cgi).

Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):

1,000 files: 0.31s indexing, 410 KB total index.  
10,000 files: 10.97s indexing, 4.16 MB total index.  

Search 'arduino' (0 matches):  
1,000 files: 0.002s (SA) vs 0.080s (naive regex).  
10,000 files: 0.016s (SA) vs 0.912s (naive regex).

Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS).
Sanitize input--strip non-printables, limit length, and quote metacharacters
(ReDOS). No exec/system (command injection). Chroot.

Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.

Commit:
[6da102d](https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e)
| Benchmarks:
[8a4da68](https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9)