diff options
| -rw-r--r-- | _log/bumblebee.md | 34 | ||||
| -rw-r--r-- | _log/fpm-door-lock-rf.md | 2 | ||||
| -rw-r--r-- | _log/site-search.md | 39 | ||||
| -rw-r--r-- | _log/vcs-1.md | 21 |
4 files changed, 46 insertions, 50 deletions
diff --git a/_log/bumblebee.md b/_log/bumblebee.md index bbe0490..a5e761d 100644 --- a/_log/bumblebee.md +++ b/_log/bumblebee.md @@ -1,40 +1,32 @@ --- -title: Built a browser automation script synthesizer +title: Built a web script synthesizer date: 2025-04-02 layout: post project: true thumbnail: thumb_sm.png --- -One year at the trading firm. Webscrapers are causing problems. CPUs are -saturated, servers are stalling. +One year at the trading firm. Scripts are saturating CPUs, stalling servers; +Forced to restart them. -2025-02: Built a C# WinForms application to record browser sessions and -automate the synthesis of scripts. +2025-02: Built a tool to record browser sessions and synthesize better scripts. <video style="max-width:100%; margin-bottom: 10px" controls="" poster="poster.png"> <source src="bee.mp4" type="video/mp4"> </video> -Hosted WebView2 (Edge) in the WinForms application to render web content. +Stack: C# WinForms, WebView2 (Edge), Scintilla.NET editor. -Intercepted events by injecting JS hooks to web pages (client-side events) and -listening to WebView events (internal browser events). Converted intercepted -events to Selenium code by sending through if-else blocks. +Injected JS hooks, WebView2 and the editor generate events. If-else blocks +convert them to Selenium code. Optimizer squashes multiple events into single +commands (e.g., calendar clicks → text input), uses heuristics to improve DOM +addressing (xpath, id, element). -Implemented a basic optimizer to squash event sequences into single commands -(e.g., calendar clicks → text input), use heuristics to improve DOM addressing -(xpath, id, element). +Two linear lists store events and code—no ASTs. Mid-session manual edits desync +lists, block optimizer. Workaround: only edit scripts at the end of recording. -Integrated Scintilla.NET editor to allow user more control over the generated -script. - -Events and code are stored in two linear lists. Without ASTs, mid-session -manual edits desync the lists, block the optimizer. As a workaround, only edit -scripts at the end of recording. - -2025-03: Shipped the first iteration. Began work on a key optimization: bypass -the browser, grab data files directly when possible. +2025-03: Shipped the first iteration. Began work on key optimization: bypass +the browser, grab data files directly. 2025-04: Abandoned project. Left the firm. diff --git a/_log/fpm-door-lock-rf.md b/_log/fpm-door-lock-rf.md index 86ed5d3..1688a7d 100644 --- a/_log/fpm-door-lock-rf.md +++ b/_log/fpm-door-lock-rf.md @@ -12,7 +12,7 @@ Wanted to unlock the door with fingerprint, wirelessly to avoid drilling. lines of the transceivers to UART RXD/TXDs of the MCUs. Unreliable—constant packet loss. -2025-01: Switched to RFM69 modules. Complete ball-ache to program. Followed the +2025-01: Switched to RFM69 modules. Ball-ache to program. Followed the datasheet as well as I could, audited the code multiple times, cross-checked with RadioHead and RFM69 drivers. No luck. diff --git a/_log/site-search.md b/_log/site-search.md index e25c0fc..d4d7fe4 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -4,13 +4,13 @@ date: 2026-01-03 layout: post --- -Developed a suffix-array-based search engine my personal site. While a simple -regex search was enough, couldn't resist the technical elegance of a proper +Developed a suffix-array-based search engine. While a simple regex search +would've been sufficient, couldn't resist the technical elegance of a proper index. -Indexer: Indexer crawls the HTML, lowercases the text, and encodes it into -UTF-8 bytes. Null byte sentinel marks the document boundaries; -Lexicographically sorted 32-bit unsigned integer offsets are stored in sa.bin: +Indexer crawls the HTML, lowercases the text, and encodes it into UTF-8 bytes. +Null byte sentinels mark document boundaries; sa.bin stores lexicographically +sorted 32-bit unsigned integer offsets: ``` my @sa = 0 .. (length($corpus) - 1); @@ -28,12 +28,14 @@ my @sa = 0 .. (length($corpus) - 1); 32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but comforting to have. -O(L⋅N log N) sort is slow. 100 4.1 KB articles took 97.9s to index. L=64 fast -path reduces that to 1.31s (L=16, 32, 64: 1.29-1.31s; 128, 256: 1.33-1.35s). -Even with fast path optimization, indexer is unusable beyond 300 articles. +O(L⋅N log N) sort is the bottleneck. 100 4.1 KB articles took 97.9s to index. +L=64 fast path reduces that to 1.31s. Experimented with 16, 32, 128, and 256 +bytes; 64 was the sweet spot—lower values were marginally faster, higher ones +marginally slower. -Search: Textbook range query with two binary searches, hosted in a FastCGI -process. Fixed-width offsets allow fast random access to the index: +Implemented search using a textbook range query with two binary searches, +hosted in a FastCGI process. Fixed-width offsets allow fast random access to +the index: ``` seek($fh_sa, $mid * 4, 0); @@ -47,6 +49,13 @@ Seek + read outperformed mmap for <1k files. At 10k, mmap was occasionally faster (~200 µs), but consumed more memory—possibly due to OpenBSD’s VM security trade-offs. Results may vary by OS. +Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system +permissions to govern access. Hardened the system by running it in chroot. + +Resource exhaustion and XSS attacks are inherent. Limited concurrent searches +using lock-file semaphores, and capped the query length (64 B) and the result +set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape. + Benchmarks: My articles have a 3.42 KB median, 3.43 KB mean, and 5.39 KB max. Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 4.1 KB) against linear regex search: @@ -88,14 +97,10 @@ Index size | 103557.18 KB | N/A ----------------+----------------------+--------------------- </pre> -Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system -permissions to govern access. Hardened the system by running it in chroot. - -Resource exhaustion and XSS attacks are inherent. Limited concurrent searches -using lock-file semaphores, and capped the query length (64 B) and the result -set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape. +Search scales well—0.9 ms at 100 files, 8.8 ms at 5000. Indexing doesn't. 4.5s +at 300 files is tolerable; 138s at 5000 is impractical. -Next release: Incremental indexing + SA-IS, Anno Domini 2076. +Warranty: 300 / 6 → 50 years. Commit: <a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e" diff --git a/_log/vcs-1.md b/_log/vcs-1.md index ac74ad5..4244007 100644 --- a/_log/vcs-1.md +++ b/_log/vcs-1.md @@ -4,8 +4,8 @@ date: 2026-05-01 layout: post --- -Implemented init, status, add, commit, log, show, and diff commands. Depends -only on OpenBSD base system tools. Didn't bother with collaborative workflows. +Implemented init, status, add, commit, log, show, and diff commands using Perl +and OpenBSD base-system tools. Didn't bother with collaborative workflows. Initial design mirrored the work tree using symlinks. Using filesystem as a database felt clever, but walking directories on every command and the inode @@ -13,22 +13,21 @@ churn were untenable. Replaced the symlink architecture with a path-sorted index. The index tracks path, mtime, size, and SHA-1 hashes of staged, committed, and -base files. Hashing is skipped when mtime and size are unchanged. If the file -and the index share the same timestamp, it's rehashed to catch sub-second -changes. +base files. Only entries whose mtime and size changed, or share the same mtime +as the index are hashed. -Implemented directory scans as a two-finger walk with the index. Linear index +Implemented directory scans as a two-finger walk with the index; linear index access trades random-access speed for sequential IO and keeps memory footprint low. -Commits save staged files, trees, and deltas to the content-addressable object +Commits save staged files, trees, and deltas to a content-addressable object store. Bundled deltas into tarballs to conserve inodes. Gzipped objects larger than 512 bytes. The threshold was arbitrary. Did not tune further. Deltas, computed using diff, target the original file. Subsequent versions are -reconstructed via a single patch—no chains. When the delta exceeds the rebase -threshold, the file becomes the new base. Diff output is bloated but compresses -well, so rebase threshold is set to 1.4, assuming a 30-40% compression ratio. +reconstructed via a single patch—no chains. Diff output is bloated but +compresses well, so rebase threshold is set to 1.4, assuming a 30-40% +compression ratio. When the delta exceeds that, the file becomes the new base. Commands run in memory, using text streams and pipes wherever possible. Left MEM_LIMIT configurable to fall back to disk for large repositories: @@ -113,7 +112,7 @@ from 1,300 to 1,462. Then fell the GC hammer. Inodes: 41. Space recovered: 8.4 MB. -Precise impact on TBW and write amplification remains unknown. +Precise impact on TBW and write amplification is not yet known. Commit: <a href="https://git.asciimx.com/urn/commit/?id=79d9ec2bdef0a82172fa0aa56f12004bef206c04" |
