From bae6f470f5ae274d4013c024e443e15484c1d2e0 Mon Sep 17 00:00:00 2001 From: Sadeep Madurange Date: Thu, 23 Apr 2026 12:54:58 +0800 Subject: Rewrite VCS and neo4J in my style, and fix font-size in pre. --- _config.yml | 2 +- _log/neo4j-a-star-search.md | 41 +++---------- _log/vcs-1.md | 143 ++++++++++++++++++++------------------------ assets/css/main.css | 2 +- 4 files changed, 74 insertions(+), 114 deletions(-) diff --git a/_config.yml b/_config.yml index a327ffb..3a2b528 100644 --- a/_config.yml +++ b/_config.yml @@ -1,4 +1,4 @@ -title: "Journal" +title: "Log" baseurl: "" # keep empty for root or subpath like /blog author: diff --git a/_log/neo4j-a-star-search.md b/_log/neo4j-a-star-search.md index 9f4a7d2..3105662 100644 --- a/_log/neo4j-a-star-search.md +++ b/_log/neo4j-a-star-search.md @@ -1,25 +1,15 @@ --- -title: That time I forked Neo4J to implement A* search +title: Neo4J contribution date: 2018-03-06 layout: post --- Written in 2026, backdated to 2018. -Our vessel tracking system hit a wall. We needed shortest paths through a -13,000-vertex graph. But the system slowed to a crawl after 4,000. Kristian, -our CTO, suggested replacing Dijkstra's search with the faster A* search—an -algorithm I had encountered as an undergraduate. +Before v3.4.0, Neo4J shipped with Dijkstra's shortest path search. The +algorithm was too slow for our marine vessel tracking application. -The database we used, Neo4J, employed a plug-in system for graph algorithms—a -mature open-source project developed by experts. Forking it for our unassuming -start-up was bold. Often, application developers are reluctant to step into -lower layers. - -It took me a week to set up the toolchain, build the project, and run the -tests. I reviewed Dijkstra's implementation, separating logic from -infrastructure. I then refactored the algorithm to use the haversine function -to steer the search: +Forked and added A* search. Used the haversine function to steer the search: ``` private double computeHeuristic( @@ -43,8 +33,7 @@ private double computeHeuristic( } ``` -And revised the core search loop to update the costs when a better path is -found: +Core search loop updates costs when a better path is found: ``` private void updateCosts( @@ -59,25 +48,9 @@ private void updateCosts( } ``` -I remember celebrating a 300x speed up, and comfortably scaling to the target -13,000. - -I was even more thrilled when the upstream project accepted my changes, and -invited me to contribute. In hindsight, I'm a little surprised that my changes -were accepted in their original form. The heuristic was tightly coupled to our -use case. Decoupling it would require significant refactoring and interface -changes. - -It was, for a time, the best outcome I could have imagined. - -My celebration of success, however, was short-lived. In a matter of months, as -the database grew larger and denser—to 16,000 vertices, if memory serves—A* -search too began to struggle. We fell back on a far less elegant PostgreSQL -cache. I learnt my first bitter lesson about optimization: dumb caches beat -clever algorithms. I briefly wondered if I should have stayed with pure -mathematics. +Upstreamed the changes. -Source code: Neo4J v3.4.0 | {path}; -my $current_hash = hash_file_content($p); -my $stg_path = File::Spec->catfile(TMP_DIR, $p); -make_path(dirname($stg_path)); - -(-l $p) - ? symlink(readlink($p), $stg_path) - : copy($p, $stg_path); - -printf $out "%-40s\t%-40s\t%-40s\t%-12d\t%-10d\t%s\n", - $current_hash, $idx_entry->{c_hash}, - $idx_entry->{b_hash}, $wrk_entry->{mtime}, - $wrk_entry->{size}, $p; -``` - -No CAS, no delta chains. One base file per tracked object. Revisions are -patches against the base. When the patch outgrows the file, snapshot and -rebase. Commits record directory structure and patchset. Patches stored in a -tarball—one inode per commit. Tarballs >512 bytes gzipped. Trees deduplicated -by content hash. - -Diff/show: look up revision, find tree and patchset, apply patch. O(1) -checkout. Single corrupt patch can't poison history. +When a new file is committed, it's saved in the object store as the base. When +the file changes, diff generates a patch against the base. If the patch is +larger than the file, the file becomes the new base. -External tools: sort, diff, patch, tar, gzip—all base system. Pipes and streams -throughout. MEM_LIMIT falls back to disk for large repos: +Unix diff doesn't handle binary files well. Rolled a diff for binary files that +works well enough except for small changes that shift bytes: ``` -use constant MEM_LIMIT => 64 * 1024 * 1024; -use constant CHUNK_LEN => 8192; -use constant IO_LAYER => ":raw:perlio(layer=" . CHUNK_LEN . ")"; - -if (!$use_disk) { - @buf = sort @buf; - return sub { - my $line = shift @buf; - return unless $line; - chomp $line; - my ($p, $m, $s) = split(/\t/, $line); - return { path => $p, mtime => $m, size => $s }; - }; -} else { - $flush->() if @buf; - close $tmp_fh; - open(my $sort_fh, "-|", "sort", "-t", "\t", "-k1,1", $tmp_path) or die $!; - return sub { - my $line = <$sort_fh>; - unless ($line) { close $sort_fh; return; } - chomp $line; - my ($p, $s, $m) = split(/\t/, $line); - return { path => $p, mtime => $m, size => $s }; - }; +my $patch = pack("Q", $new_size); +while (1) { + my $read_new = sysread($f_new, my $buf_new, $blk_size); + my $read_old = sysread($f_old, my $buf_old, $blk_size); + last if !$read_new && !$read_old; + + # If blocks differ, record the change + if (($buf_new // '') ne ($buf_old // '')) { + # Format: Offset (Q), Length (L), raw data + $patch .= pack("QL", $offset, length($buf_new)) . $buf_new; + } + $offset += $blk_size; } ``` -Index and tree processing: O(N) two-finger walk. Streamed line-by-line, -calibrated buffers. No mmap. +Commits store sorted lists of paths (tree), base hashes, and patch sets in the +object store. Patches are stored as tarballs. Tarballs larger than 512 bytes +are gzipped. Objects are content-addressable. To reconstruct a file, look up +the revision, follow the base hash and apply the patch. -## Benchmarks +Status and add commands scan the work tree, sort entries by path and performs a +two-finger walk with the index to minimize random access. Operations are +performed in memory — often using text streams and pipes. MEM_LIMIT can be used +to fall back on the disk for large repositories: -Performed on T490 (i7-10510U, OpenBSD 7.8) against git v2.51.0: - -### Impact of repository size +``` +my $flush = sub { + if (!$use_disk) { + ($tmp_fh, $tmp_path) = tempfile(UNLINK => 1); + $tmp_fh->setvbuf(undef, POSIX::_IOFBF(), $chunk_size); + binmode $tmp_fh, ":raw"; + $use_disk = 1; + } + print $tmp_fh @buf; +}; + +push @buf, $line; +$buf_size += length($line); +$tot_size += length($line); + +if ((!$use_disk && $tot_size > MEM_LIMIT) || + ($use_disk && $buf_size > $chunk_size)) { + $flush->(); +} +``` -Small repo: +Benchmarks on T490 (i7-10510U, OpenBSD 7.8) against git v2.51.0:
 =============================================================
@@ -128,11 +119,7 @@ Page faults     |        Maj:0 / Min:0 |        Maj:0 / Min:0
 Inodes          |                  347 |                  397
 Repo size       |              4212 KB |              3496 KB
 -------------------------------------------------------------
-
- -Larger repo: -
 =============================================================
  BENCHMARK: 5000 files @ 50 depth
 =============================================================
@@ -180,11 +167,7 @@ Page faults     |        Maj:0 / Min:0 |        Maj:0 / Min:0
 Inodes          |                 5264 |                 5342
 Repo size       |             91620 KB |             70592 KB
 -------------------------------------------------------------
-
-### Impact of commits over time - -
 =============================================================
  REBASE BENCHMARK: 1000 files (100 commits)
  CONDITIONS: Depth=2, Files Mod=5%, Change=50%
@@ -249,12 +232,16 @@ Repo size       |             20864 KB |             37008 KB
 TOTAL URN REBASES: 273
 
-Git wins on speed and memory. Cold start is the exception — urn's add + commit -beats git there. Git's zlib compression wins on initial disk usage. +Git wins on speed and memory. On small repositories, Urn is competitive. + +In high-revision workloads with modest per-file churn, Urn beats git on +storage. Over 80 commits, git wrote 17 MB to track a 17 MB repo. Urn only wrote +0.5 MB despite 273 rebases. Git's inode count went from 2,334 to 6,495. Urn's +went from 1,578 to 1,693. -Over time the picture flips. 80 commits: git wrote 17 MB, urn wrote 0.5 MB. -Git's inode count went from 2,334 to 6,495. Urn's went from 1,578 to 1,693. The -thing it was built to do, it does. +Git GC reclaims inodes, but doesn't save much space. On a 36.6 MB repo, git +used up 1819 inodes and 59 MB pre-GC. After GC inode count dropped to 1514, but +space only shrank by 6 MB. Commit: