diff options
| -rw-r--r-- | _log/bumblebee.md | 19 | ||||
| -rw-r--r-- | _log/etlas.md | 6 | ||||
| -rw-r--r-- | _log/neo4j-a-star-search.md | 8 | ||||
| -rw-r--r-- | _log/site-search.md | 92 | ||||
| -rw-r--r-- | _log/vcs-1.md | 5 | ||||
| -rw-r--r-- | index.md | 6 |
6 files changed, 66 insertions, 70 deletions
diff --git a/_log/bumblebee.md b/_log/bumblebee.md index 21a9a07..bbe0490 100644 --- a/_log/bumblebee.md +++ b/_log/bumblebee.md @@ -1,16 +1,16 @@ --- -title: Built a browser session script synthesizer +title: Built a browser automation script synthesizer date: 2025-04-02 layout: post project: true thumbnail: thumb_sm.png --- -One year at trading firm. Webscraper is giving too many problems. CPUs are +One year at the trading firm. Webscrapers are causing problems. CPUs are saturated, servers are stalling. -2025-02: Built Bumblebee, a C# WinForms application, to record browser sessions -and automate the synthesis of scripts. +2025-02: Built a C# WinForms application to record browser sessions and +automate the synthesis of scripts. <video style="max-width:100%; margin-bottom: 10px" controls="" poster="poster.png"> <source src="bee.mp4" type="video/mp4"> @@ -20,8 +20,7 @@ Hosted WebView2 (Edge) in the WinForms application to render web content. Intercepted events by injecting JS hooks to web pages (client-side events) and listening to WebView events (internal browser events). Converted intercepted -events to Selenium code by sending through if-else statements. Crude—no time -for something better. +events to Selenium code by sending through if-else blocks. Implemented a basic optimizer to squash event sequences into single commands (e.g., calendar clicks → text input), use heuristics to improve DOM addressing @@ -30,11 +29,11 @@ Implemented a basic optimizer to squash event sequences into single commands Integrated Scintilla.NET editor to allow user more control over the generated script. -Events and code are stored in two linear lists. Mid-session manual edits desync -the lists, block the optimizer. ASTs are overkill for now. As a workaround, -only edit scripts at the end of recording. +Events and code are stored in two linear lists. Without ASTs, mid-session +manual edits desync the lists, block the optimizer. As a workaround, only edit +scripts at the end of recording. -2025-03: Shipped the first iteration and began work on key optimization: bypass +2025-03: Shipped the first iteration. Began work on a key optimization: bypass the browser, grab data files directly when possible. 2025-04: Abandoned project. Left the firm. diff --git a/_log/etlas.md b/_log/etlas.md index 9fd5ded..54745f6 100644 --- a/_log/etlas.md +++ b/_log/etlas.md @@ -68,9 +68,9 @@ KB RAM. Deployed a simple Flask API on VPS to manage the watchlist and relay the feed. Wrapped the API in FastCGI and exposed it through chroot-ed htpasswd + slowcgi + httpd—battle-tested OpenBSD base-system tools. -Rolled my own stepped graph for simplicity, but the code is hideous. Needed -vTaskDelay() to prevent the watchdog timer from triggering. Will look into -Bresenham’s in a future revision. +Custom stepped graph works but the code is crude. vTaskDelay() is needed to +keep the watchdog timer from triggering—revisit with Bresenham's line +algorithm. News: Used Channel NewsAsia RSS feed for news. Hand-coded the XML parsing in C—no Flask backend at the time. Now that I have one for stocks, will move the diff --git a/_log/neo4j-a-star-search.md b/_log/neo4j-a-star-search.md index 8790731..6664421 100644 --- a/_log/neo4j-a-star-search.md +++ b/_log/neo4j-a-star-search.md @@ -1,13 +1,13 @@ --- -title: Contributed A* search to Neo4J +title: Contributed A* search to Neo4J algorithms date: 2018-03-06 layout: post --- Written in 2026, backdated to 2018. -Before v3.4.0, Neo4J shipped with Dijkstra's shortest path search. The -algorithm was too slow for our marine vessel tracking application. +Before v3.4.0, Neo4J algorithms plugin shipped with Dijkstra's shortest path +search. The algorithm was too slow for our marine vessel tracking application. Forked and added A* search. Used the haversine function to steer the search: @@ -52,7 +52,7 @@ Upstreamed the changes. GitHub release: <a href="https://github.com/neo4j-contrib/neo4j-graph-algorithms/releases/tag/3.4.0.0" -class="external" target="_blank" rel="noopener noreferrer">Neo4J v3.4.0</a> | +class="external" target="_blank" rel="noopener noreferrer">neo4j-contrib v3.4.0</a> | <a href="https://github.com/neo4j-contrib/neo4j-graph-algorithms/blob/bd9732d9a690319552e134708692acb5a0d6b37c/algo/src/main/java/org/neo4j/graphalgo/impl/ShortestPathAStar.java" class="external" target="_blank" rel="noopener noreferrer">Full source</a>. diff --git a/_log/site-search.md b/_log/site-search.md index 7b916fe..c1d4c12 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -1,5 +1,5 @@ --- -title: Overengineered search +title: Under-engineered search date: 2026-01-03 layout: post --- @@ -8,15 +8,14 @@ Developed a suffix-array-based search engine for the site today. While a simple regex search was enough, couldn't resist the technical elegance of a proper index. -Indexer: Implemented the indexer in Perl to crawl the HTML, lowercase the text, -and encode it into UTF-8 bytes. Used a null byte sentinel to mark document -boundaries and stored the lexicographically sorted 32-bit unsigned integer -offsets to sa.bin: +Indexer: Indexer crawls the HTML, lowercases the text, and encodes it into +UTF-8 bytes. Null byte sentinel marks the document boundaries; +Lexicographically sorted 32-bit unsigned integer offsets are stored in sa.bin: ``` my @sa = 0 .. (length($corpus) - 1); { - use bytes; # Force compare 8-bit Unicode value comparisons + use bytes; # Force compare raw bytes @sa = sort { # First 64 bytes check (fast path) (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || @@ -29,13 +28,8 @@ my @sa = 0 .. (length($corpus) - 1); 32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but comforting to have. -It takes about 50ms to index my 12-entry website on a T490. As the site grows, -the O(L⋅N log N) sort could become a bottleneck. So I introduced a fast path -that caps L at 64 bytes—roughly the size of a cache line on common hardware. - -Search: Implemented the search in a FastCGI script as a textbook range query -with two binary searches. Leveraged the fixed-width offsets for fast random -access to the index: +Search: Textbook range query with two binary searches hosted in a FastCGI +process. Fixed-width offsets enable fast random access to the index: ``` seek($fh_sa, $mid * 4, 0); @@ -45,9 +39,9 @@ seek($fh_cp, $off, 0); read($fh_cp, my $text, $query_len); ``` -Chose seek + read over mmap because it outperformed mmap for <1k files. At -10k, mmap was occasionally faster (~200 µs), but used more memory—possibly -due to OpenBSD’s VM security trade-offs. Results may vary by OS. +Seek + read outperformed mmap for <1k files. At 10k, mmap was occasionally +faster (~200 µs), but consumed more memory—possibly due to OpenBSD’s VM +security trade-offs. Results may vary by OS. Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear regex search: @@ -55,38 +49,38 @@ linear regex search: <pre class="pre-no-style"> ============================================================= SEARCH BENCHMARK: Suffix array vs. Linear regex -ARTICLE SIZE: 16 KB +ARTICLE SIZE: 8 KB ============================================================= -500 files: -------------------------------------------------------------- -METRIC | SA | REGEX +500 files (Targeting: keyword_-1): +----------------+----------------------+--------------------- +METRIC | SA | REGEX +----------------+----------------------+--------------------- +Search time | 0.0014s | 0.0451s +Peak RAM | 8124 KB | 9612 KB +Indexing time | 18.1865s | N/A +Index size | 19610.39 KB | N/A +----------------+----------------------+--------------------- + +1000 files (Targeting: keyword_-1): ----------------+----------------------+--------------------- -Search time | 0.0012s | 0.0407s -Peak RAM | 8828 KB | 9136 KB -Indexing time | 0.1475s | N/A -Index size | 204.94 KB | N/A -------------------------------------------------------------- - -1,000 files: -------------------------------------------------------------- -METRIC | SA | REGEX +METRIC | SA | REGEX ----------------+----------------------+--------------------- -Search time | 0.0019s | 0.0795s -Peak RAM | 8980 KB | 9460 KB -Indexing time | 0.3101s | N/A -Index size | 410.51 KB | N/A -------------------------------------------------------------- - -10,000 files: -------------------------------------------------------------- -METRIC | SA | REGEX +Search time | 0.0021s | 0.0918s +Peak RAM | 8280 KB | 9960 KB +Indexing time | 43.1748s | N/A +Index size | 39225.06 KB | N/A +----------------+----------------------+--------------------- + +10000 files (Targeting: keyword_-1): +----------------+----------------------+--------------------- +METRIC | SA | REGEX +----------------+----------------------+--------------------- +Search time | 0.0173s | 1.1275s +Peak RAM | 11848 KB | 13392 KB +Indexing time | 663.3909s | N/A +Index size | 392263.01 KB | N/A ----------------+----------------------+--------------------- -Search time | 0.0161s | 0.9120s -Peak RAM | 12504 KB | 12804 KB -Indexing time | 10.9661s | N/A -Index size | 4163.44 KB | N/A -------------------------------------------------------------- </pre> Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system @@ -96,13 +90,17 @@ Resource exhaustion and XSS attacks are inherent. Limited concurrent searches using lock-file semaphores, and capped the query length (64 B) and the result set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape. -At six articles a year, this should work for the next 1600 years. Penciled in -the next release for Anno Domini 3626. +Performance: Without SA-IS, indexing is slow. With O(L⋅N log N) naive sort, 100 +8 KB articles took 6.58 minutes to index. L=64 fast path reduces that to 2.69 +seconds (L=16, 32, 64: 2.68-2.69s; 128, 256: 2.75-2.77s). Even so, 43.1748s to +index 500 articles is untenable. + +I under-engineered search. Commit: <a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e" class="external" target="_blank" rel="noopener noreferrer">6da102d</a> | Benchmarks: <a -href="https://git.asciimx.com/site-search-bm/commit/?id=8a4da6809cf9368cd6a5dd7351181ea4256453f9" -class="external" target="_blank" rel="noopener noreferrer">8a4da68</a> +href="https://git.asciimx.com/site-search-bm/commit/?id=de9d82e8074c9b67a04989f9b6be62890b7c95bb" +class="external" target="_blank" rel="noopener noreferrer">de9d82e</a> diff --git a/_log/vcs-1.md b/_log/vcs-1.md index f7520df..ac74ad5 100644 --- a/_log/vcs-1.md +++ b/_log/vcs-1.md @@ -113,10 +113,7 @@ from 1,300 to 1,462. Then fell the GC hammer. Inodes: 41. Space recovered: 8.4 MB. -Urn's sequential IO and reduced write frequency are theoretically gentler on -NAND. Git's dramatic GC pass (12 MB → 3.8 MB) incurs SSD wear Urn likely -avoids. Precise impact on TBW and write amplification, however, remains -unknown. +Precise impact on TBW and write amplification remains unknown. Commit: <a href="https://git.asciimx.com/urn/commit/?id=79d9ec2bdef0a82172fa0aa56f12004bef206c04" @@ -14,6 +14,8 @@ title: "Home" </ul> <footer> - <p>A journal of personal projects and experiments. <a href="/cgi-bin/find.cgi">Search</a></p> - <p>Built with <a href="https://github.com/ronv/minimalist" class="external" target="_blank" rel="noopener noreferrer">Minimalist</a></p> + <p>Built with <a href="https://github.com/ronv/minimalist" class="external" + target="_blank" rel="noopener noreferrer">Minimalist</a>. + <a href="/cgi-bin/find.cgi">Search</a> + </p> </footer> |
