summaryrefslogtreecommitdiffstats
path: root/_site/log/site-search/index.html
blob: 77fe19efc3a692016db627c8724ade4603b73d8d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
<!DOCTYPE html>
<html>
    <head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>Full-text search: SA lookup</title>
  <link rel="stylesheet" href="/assets/css/main.css">
  <link rel="stylesheet" href="/assets/css/skeleton.css">
</head>


  <body>

    <div id="nav-container" class="container">
  <ul id="navlist" class="left">
    
    <li >
      <a href="/" class="link-decor-none">hme</a>
    </li>
    <li >
      <a href="/projects/" class="link-decor-none">poc</a>
    </li>
    <li >
      <a href="/about/" class="link-decor-none">abt</a>
    </li>
    <li>
      <a href="/cgi-bin/find.cgi" class="link-decor-none">lup</a>
    </li>
    <li>
      <a href="/feed.xml" class="link-decor-none">rss</a>
    </li>
  </ul>
</div>



    <main>
      <div class="container">
        <div class="container-2">
          <h2 class="center" id="title">FULL-TEXT SEARCH: SA LOOKUP</h2>
          <h5 class="center">03 JANUARY 2026</h5>
          <br>
          <div class="twocol justify"><p>Article count is growing. Need a way to search.</p>

<p>Requirements: matches substrings, case-insensitive, fast, secure. No
JavaScript.</p>

<p>Architecture: browser → httpd → slowcgi → Perl CGI script.</p>

<p>Perl, httpd, slowcgi are in the OpenBSD base system. Instead of secrets, file
system permissions govern access.</p>

<p>2025-12-30: Regex search.</p>

<p>140-line Perl script searches 500 files in 40ms. Fast enough; O(N) pull felt at
higher file counts.</p>

<p>Introduces ReDoS and symlink attack vectors. Both can be mitigated. Tempted to
stop here.</p>

<p>2026-01-03: Suffix Array (SA) based index lookup.</p>

<p>Slurping files on every request bothers me. Regex search depends almost
entirely on hardware for speed.</p>

<p>Implemented SA index with three files: corpus.bin, sa.bin, file_map.dat. Index
built with site:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ JEKYLL_ENV=production bundle exec jekyll build
$ cd cgi-bin/
$ perl indexer.pl
</code></pre></div></div>

<p>Indexer extracts HTML, lowercases, encodes into UTF-8 binary sequences. Null
byte sentinel for document boundaries. sa.bin stores suffix offsets as
32-bit unsigned integers, sorted by lexicographical order:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my @sa = 0 .. (length($corpus) - 1);
{
    use bytes; # Force compare 8-bit Unicode value comparisons
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}

CORPUS:   a  s  c  i \0  i  m  x
OFFSET:   0  1  2  3  4  5  6  7

SORTED SUFFIXES:

[4] \0imx
[0] asci\0imx
[2] ci\0imx
[3] i\0imx       &lt;-- "i" from "asci"
[5] imx          &lt;-- "i" from "imx"
[6] mx
[1] sci\0imx
[7] x
</code></pre></div></div>

<p>Time complexity: O(L⋅N log N). Fast path caps L at 64 bytes (length of cache
line), reducing complexity to O(N log N).</p>

<p>Search: Textbook range query with twin binary searches. Random access to
suffixes and the text made possible by the fixed-width offsets:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);
</code></pre></div></div>

<p>Small seek/reads are fast on modern SSDs. Keeps memory usage low.</p>

<p>NOTE: mmap.</p>

<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB).</p>

<p>500 files:</p>
<ul>
  <li>Index size: 204.94 KB</li>
  <li>Indexing time: 0.1475 s</li>
  <li>Peak RAM (SA): 8828 KB</li>
  <li>Peak RAM (Regex): 9136 KB</li>
  <li>Search (SA): 0.0012 s</li>
  <li>Search (Regex): 0.0407 s</li>
</ul>

<p>1,000 files:</p>
<ul>
  <li>Index size: 410.51 KB</li>
  <li>Indexing time: 0.3101 s</li>
  <li>Peak RAM (SA): 8980 KB</li>
  <li>Peak RAM (Regex): 9460 KB</li>
  <li>Search (SA): 0.0019 s</li>
  <li>Search (Regex): 0.0795 s</li>
</ul>

<p>10,000 files:</p>
<ul>
  <li>Index size: 4163.44 KB</li>
  <li>Indexing time: 10.9661 s</li>
  <li>Peak RAM (SA): 12504 KB</li>
  <li>Peak RAM (Regex): 12804 KB</li>
  <li>Search (SA): 0.0161 s</li>
  <li>Search (Regex): 0.9120 s</li>
</ul>

<p>Security: Derived from architectural simplicity. Zero dependencies to manage,
no secrets to hide, no targets for lateral movement.</p>

<p>Runs in chroot.</p>

<p>Resource exhaustion and XSS attacks are inherent. Former mitigated by limiting
concurrent searches via lock-file semaphores. Query length (64B) and result set
(20) capped. All output is HTML-escaped to prevent XSS.</p>

<p>Secure by default. Fast. Durable.</p>

<p>Commit:
<a href="https://git.asciimx.com/www/commit/?h=term&amp;id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a>
| Benchmarks:
<a href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9">8a4da68</a></p>
</div>
          <p class="post-author right">by W. D. Sadeep Madurange</p>
        </div>
      </div>
    </main>

    <div class="footer">
  <div class="container">
    <div class="twelve columns right container-2">
      <p id="footer-text">&copy; ASCIIMX - 2026</p>
    </div>
  </div>
</div>


  </body>
</html>