summaryrefslogtreecommitdiffstats
path: root/_site/log/site-search/index.html
blob: 04f98e91bc021766afe71eb749704bd8896f7ecd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
<!DOCTYPE html>
<html>
    <head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>Perl + FastCGI + SA search engine</title>
  <link rel="stylesheet" href="/assets/css/main.css">
  <link rel="stylesheet" href="/assets/css/skeleton.css">
</head>


  <body>

    <div id="nav-container" class="container">
  <ul id="navlist" class="left">
    
    <li >
      <a href="/" class="link-decor-none">hme</a>
    </li>
    <li >
      <a href="/projects/" class="link-decor-none">poc</a>
    </li>
    <li >
      <a href="/about/" class="link-decor-none">abt</a>
    </li>
    <li>
      <a href="/cgi-bin/find.cgi" class="link-decor-none">lup</a>
    </li>
    <li>
      <a href="/feed.xml" class="link-decor-none">rss</a>
    </li>
  </ul>
</div>



    <main>
      <div class="container">
        <div class="container-2">
          <h2 class="center" id="title">PERL + FASTCGI + SA SEARCH ENGINE</h2>
          <h5 class="center">02 JANUARY 2026</h5>
          <br>
          <div class="twocol justify"><p>Number of articles growing. Need search.</p>

<p>Requirements: substring match, case-insensitive, fast, secure. No JavaScript.</p>

<p>Architecture: OpenBSD httpd → slowcgi (FastCGI) → Perl script.</p>

<p>Data structure: suffix array. Three files: corpus.bin (articles), sa.bin
(sorted byte offsets), file_map.dat (metadata).</p>

<p>Indexer crawls posts, extracts HTML with regex, lowercases, concatenates. Null
byte sentinel for document boundaries. Sort lexicographically::</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Use a block that forces byte-level comparison
{
    use bytes; 
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}
</code></pre></div></div>

<p>Slow path: O(L⋅N log N). Fast path caps L at 64 bytes → O(N log N). 64-byte
length targets cache lines.</p>

<p>Search: binary search for range query. Cap at 20 results–define limits or be
surprised by them.</p>

<p>File IO and memory: many seek/read small chunks beat one large allocation (see
benchmarks for find_one_file.cgi).</p>

<p>Benchmarks on T490 (i7-10510U, OpenBSD 7.8, 16KB articles):</p>

<p>1,000 files: 0.31s indexing, 410 KB total index.<br />
10,000 files: 10.97s indexing, 4.16 MB total index.</p>

<p>Search ‘arduino’ (0 matches):<br />
1,000 files: 0.002s (SA) vs 0.080s (naive regex).<br />
10,000 files: 0.016s (SA) vs 0.912s (naive regex).</p>

<p>Security. Semaphore (lock files) limits parallel queries. Escape HTML (XSS).
Sanitize input–strip non-printables, limit length, and quote metacharacters
(ReDOS). No exec/system (command injection). Chroot.</p>

<p>Verdict: Fast SA lookup. Primary attack vectors mitigated. No dependencies.</p>

<p>Commit:
<a href="https://git.asciimx.com/www/commit/?h=term&amp;id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e">6da102d</a>
| Benchmarks:
<a href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9">8a4da68</a></p>
</div>
          <p class="post-author right">by W. D. Sadeep Madurange</p>
        </div>
      </div>
    </main>

    <div class="footer">
  <div class="container">
    <div class="twelve columns right container-2">
      <p id="footer-text">&copy; ASCIIMX - 2026</p>
    </div>
  </div>
</div>


  </body>
</html>