1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
|
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Perl + FastCGI + SA search engine</title>
<link rel="stylesheet" href="/assets/css/main.css">
<link rel="stylesheet" href="/assets/css/skeleton.css">
</head>
<body>
<div id="nav-container" class="container">
<ul id="navlist" class="left">
<li >
<a href="/" class="link-decor-none">hme</a>
</li>
<li >
<a href="/projects/" class="link-decor-none">poc</a>
</li>
<li >
<a href="/about/" class="link-decor-none">abt</a>
</li>
<li>
<a href="/cgi-bin/find.cgi" class="link-decor-none">lup</a>
</li>
<li>
<a href="/feed.xml" class="link-decor-none">rss</a>
</li>
</ul>
</div>
<main>
<div class="container">
<div class="container-2">
<h2 class="center" id="title">PERL + FASTCGI + SA SEARCH ENGINE</h2>
<h5 class="center">02 JANUARY 2026</h5>
<br>
<div class="twocol justify"><p>Number of articles growing. Need search.</p>
<p>Requirements: substring match, case-insensitive, fast, secure. No JavaScript.</p>
<p>Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.</p>
<p>Data structure: suffix array. Three files: corpus.bin (articles), sa.bin
(sorted byte offsets), file_map.dat (metadata).</p>
<p>Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null
byte sentinel for document boundaries. Sort lexicographically::</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Use a block that forces byte-level comparison
{
use bytes;
@sa = sort {
# First 64 bytes check (fast path)
(substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
# Full string fallback (required for correctness)
(substr($corpus, $a) cmp substr($corpus, $b))
} @sa;
}
</code></pre></div></div>
<p>Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte
length targets cache lines.</p>
<p>Search: binary search for range query. Cap at 20 results–define limits or be
surprised by them.</p>
<p>File IO and memory: many seek/read small chunks beat one large allocation (see
benchmarks for find_one_file.cgi).</p>
<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):</p>
<p>1,000 files: 0.31s indexing, 410 KB total index.<br />
10,000 files: 10.97s indexing, 4.16 MB total index.</p>
<p>Search ‘arduino’ (0 matches):<br />
1,000 files: 0.002s (SA) vs 0.080s (naive regex).<br />
10,000 files: 0.016s (SA) vs 0.912s (naive regex).</p>
<p>Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS).
Sanitize input–strip non-printables, limit length, and quote metacharacters
(ReDOS). No exec/system (command injection). Chroot.</p>
<p>Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.</p>
<p>Commit:
<a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a>
| Benchmarks:
<a href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9">8a4da68</a></p>
</div>
<p class="post-author right">by W. D. Sadeep Madurange</p>
</div>
</div>
</main>
<div class="footer">
<div class="container">
<div class="twelve columns right container-2">
<p id="footer-text">© ASCIIMX - 2026</p>
</div>
</div>
</div>
</body>
</html>
|