1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
|
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Search engine (Perl + FastCGI + SA)</title>
<link rel="stylesheet" href="/assets/css/main.css">
<link rel="stylesheet" href="/assets/css/skeleton.css">
</head>
<body>
<div id="nav-container" class="container">
<ul id="navlist" class="left">
<li >
<a href="/" class="link-decor-none">hme</a>
</li>
<li >
<a href="/projects/" class="link-decor-none">poc</a>
</li>
<li >
<a href="/about/" class="link-decor-none">abt</a>
</li>
<li>
<a href="/cgi-bin/find.cgi" class="link-decor-none">lup</a>
</li>
<li>
<a href="/feed.xml" class="link-decor-none">rss</a>
</li>
</ul>
</div>
<main>
<div class="container">
<div class="container-2">
<h2 class="center" id="title">SEARCH ENGINE (PERL + FASTCGI + SA)</h2>
<h5 class="center">03 JANUARY 2026</h5>
<br>
<div class="twocol justify"><p>Article count on the website is growing. Need a way to search.</p>
<p>Requirements: matches substrings, case-insensitive, fast, secure. No
JavaScript.</p>
<p>Architecture: browser → httpd → slowcgi → Perl CGI script.</p>
<p>httpd, slowcgi, Perl are in the OpenBSD base system. No dependencies. Access
governed by file system permissions–no secrets to manage. Secure by default.</p>
<p>2025-12-30: Regex search. Wrote a 140-line Perl script that searches HTML files
using Regex. Script slurps 500 files (16 KB each) in 40 milliseconds. Fast
enough. Start to feel the O(N) pull at higher file counts.</p>
<p>Regex and the site crawl introduce ReDoS and symlink attack vectors. Both can
be mitigated. Tempted to stop here.</p>
<p>2026-01-03: Suffix array (SA) based index lookup.</p>
<p>Inefficiency of scanning every file on each request troubles me. Regex search
depends almost entirely on hardware for speed.</p>
<p>Built SA index comprising three files: corpus.bin (articles), sa.bin (sorted
byte offsets), file_map.dat (metadata). Index created with the site:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ JEKYLL_ENV=production bundle exec jekyll build
$ cd cgi-bin/
$ perl indexer.pl
</code></pre></div></div>
<p>Indexer extracts HTML, lowercases, encodes text as UTF-8 binary sequences, and
saves to corpus.bin. Null byte sentinel marks the document boundaries. Suffix
array stores offsets of all suffixes as 32-bit unsigned integers, sorted by the
lexicographical order:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my @sa = 0 .. (length($corpus) - 1);
{
use bytes; # Force compare 8-bit Unicode value comparisons
@sa = sort {
# First 64 bytes check (fast path)
(substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) ||
# Full string fallback (required for correctness)
(substr($corpus, $a) cmp substr($corpus, $b))
} @sa;
}
CORPUS: a s c i \0 i m x
OFFSET: 0 1 2 3 4 5 6 7
SORTED SUFFIXES:
[4] \0imx
[0] asci\0imx
[2] ci\0imx
[3] i\0imx <-- "i" from "asci"
[5] imx <-- "i" from "imx"
[6] mx
[1] sci\0imx
[7] x
</code></pre></div></div>
<p>Algorithmic complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of a
cache line), reducing complexity to O(N log N).</p>
<p>Search: Textbook range query with two binary searches. Random access to offsets
and the text is possible via the fixed-width offsets:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
</code></pre></div></div>
<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low, and was easy
to implement. Note to self: mmap.</p>
<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).</p>
<p>500 files:</p>
<ul>
<li>Index size: 204.94 KB</li>
<li>Indexing time: 0.1475 s</li>
<li>Peak RAM (SA): 8828 KB</li>
<li>Peak RAM (Regex): 9136 KB</li>
<li>Search (SA): 0.0012 s</li>
<li>Search (Regex): 0.0407 s</li>
</ul>
<p>1,000 files:</p>
<ul>
<li>Index size: 410.51 KB</li>
<li>Indexing time: 0.3101 s</li>
<li>Peak RAM (SA): 8980 KB</li>
<li>Peak RAM (Regex): 9460 KB</li>
<li>Search (SA): 0.0019 s</li>
<li>Search (Regex): 0.0795 s</li>
</ul>
<p>10,000 files:</p>
<ul>
<li>Index size: 4163.44 KB</li>
<li>Indexing time: 10.9661 s</li>
<li>Peak RAM (SA): 12504 KB</li>
<li>Peak RAM (Regex): 12804 KB</li>
<li>Search (SA): 0.0161 s</li>
<li>Search (Regex): 0.9120 s</li>
</ul>
<p>Security: Much of it comes from its architectural simplicity. No dependencies
to manage, no secrets to hide, no assets for lateral movement. Runs in chroot.</p>
<p>Resource exhaustion and XSS attacks are inherent. The former is mitigated by
limiting concurrent searches via lock-file semaphores and capping query length
(64B) and result sets (20). All output is HTML-escaped to prevent XSS.</p>
<p>Verdict: Fast SA lookup; Works on every conceivable web browser.</p>
<p>Commit:
<a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a>
| Benchmarks:
<a href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9">8a4da68</a></p>
</div>
<p class="post-author right">by W. D. Sadeep Madurange</p>
</div>
</div>
</main>
<div class="footer">
<div class="container">
<div class="twelve columns right container-2">
<p id="footer-text">© ASCIIMX - 2026</p>
</div>
</div>
</div>
</body>
</html>
|