Rewrite VCS and neo4J in my style, and fix font-size in pre.

author: Sadeep Madurange <sadeep@asciimx.com> 2026-04-23 12:54:58 +0800
committer: Sadeep Madurange <sadeep@asciimx.com> 2026-04-23 12:54:58 +0800
commit: bae6f470f5ae274d4013c024e443e15484c1d2e0 (patch)
tree: 8923b3d88be35380ab499ecf15cf600cd03acda1 /_log
parent: 953550aee9207fae3f1a636c92d76809c8a36e5e (diff)
download: www-bae6f470f5ae274d4013c024e443e15484c1d2e0.tar.gz
2 files changed, 72 insertions, 112 deletions
diff --git a/_log/neo4j-a-star-search.md b/_log/neo4j-a-star-search.md
index 9f4a7d2..3105662 100644
--- a/_log/neo4j-a-star-search.md
+++ b/_log/neo4j-a-star-search.md
@@ -1,25 +1,15 @@
 ---
-title: That time I forked Neo4J to implement A* search
+title: Neo4J contribution
 date: 2018-03-06
 layout: post
 ---
 
 Written in 2026, backdated to 2018. 
 
-Our vessel tracking system hit a wall. We needed shortest paths through a
-13,000-vertex graph. But the system slowed to a crawl after 4,000.  Kristian,
-our CTO, suggested replacing Dijkstra's search with the faster A* search—an
-algorithm I had encountered as an undergraduate.
+Before v3.4.0, Neo4J shipped with Dijkstra's shortest path search. The
+algorithm was too slow for our marine vessel tracking application.
 
-The database we used, Neo4J, employed a plug-in system for graph algorithms—a
-mature open-source project developed by experts. Forking it for our unassuming
-start-up was bold. Often, application developers are reluctant to step into
-lower layers.
-
-It took me a week to set up the toolchain, build the project, and run the
-tests. I reviewed Dijkstra's implementation, separating logic from
-infrastructure. I then refactored the algorithm to use the haversine function
-to steer the search:
+Forked and added A* search. Used the haversine function to steer the search:
 
 ```
 private double computeHeuristic(
@@ -43,8 +33,7 @@ private double computeHeuristic(
 }
 ```
 
-And revised the core search loop to update the costs when a better path is
-found:
+Core search loop updates costs when a better path is found:
 
 ```
 private void updateCosts(
@@ -59,25 +48,9 @@ private void updateCosts(
 }
 ```
 
-I remember celebrating a 300x speed up, and comfortably scaling to the target
-13,000. 
-
-I was even more thrilled when the upstream project accepted my changes, and
-invited me to contribute. In hindsight, I'm a little surprised that my changes
-were accepted in their original form. The heuristic was tightly coupled to our
-use case. Decoupling it would require significant refactoring and interface
-changes.
-
-It was, for a time, the best outcome I could have imagined.
-
-My celebration of success, however, was short-lived. In a matter of months, as
-the database grew larger and denser—to 16,000 vertices, if memory serves—A*
-search too began to struggle. We fell back on a far less elegant PostgreSQL
-cache. I learnt my first bitter lesson about optimization: dumb caches beat
-clever algorithms. I briefly wondered if I should have stayed with pure
-mathematics.
+Upstreamed the changes.
 
-Source code: <a
+GitHub release: <a
 href="https://github.com/neo4j-contrib/neo4j-graph-algorithms/releases/tag/3.4.0.0"
 class="external" target="_blank" rel="noopener noreferrer">Neo4J v3.4.0</a> |
 <a
diff --git a/_log/vcs-1.md b/_log/vcs-1.md
index 58cb896..69e9078 100644
--- a/_log/vcs-1.md
+++ b/_log/vcs-1.md
@@ -1,84 +1,75 @@
 ---
-title: What would VCS look like if SSD wear was the primary constraint?
-date: 2026-04-20
+title: 'Urn: An experimental SSD-oriented VCS'
+date: 2026-04-23
 layout: post
 ---
 
-Git: 1819 inodes, 59 MB to track a 36.59 MB repo pre-GC. GC collected 1514
-inodes. Packing reclaimed 6 MB. Can we do better?
+Urn is an experimental VCS built to minimize SSD wear, write amplification, and
+inode churn, even when that costs CPU time.
 
-PoC implements status, add, commit, log, show, diff. Supports symlinks, binary
-files. Optimized for SSD longevity — sequential reads/writes, reduced TBW/WA.
+Implemented init, status, add, commit, log, show, and diff. Handles text files,
+symlinks, and binary files. Collaborative workflows are out of scope.
 
-Architecture: sorted index tracks files. Staging copies object to staging area,
-records path, mtime, size, SHA-1:
+Sorted index tracks file paths, SHA-1 hashes of staged files (staged hash),
+parent (commit hash), and base (base hash), mtime, and size. Permissions are
+not tracked. mtime and size are used to avoid computing hashes for files that
+didn't change.
 
-```
-my $p = $wrk_entry->{path};
-my $current_hash = hash_file_content($p);
-my $stg_path = File::Spec->catfile(TMP_DIR, $p);
-make_path(dirname($stg_path));
-				
-(-l $p) 
-    ? symlink(readlink($p), $stg_path) 
-    : copy($p, $stg_path);
-
-printf $out "%-40s\t%-40s\t%-40s\t%-12d\t%-10d\t%s\n", 
-    $current_hash, $idx_entry->{c_hash}, 
-    $idx_entry->{b_hash}, $wrk_entry->{mtime}, 
-    $wrk_entry->{size}, $p;
-```
-
-No CAS, no delta chains. One base file per tracked object. Revisions are
-patches against the base. When the patch outgrows the file, snapshot and
-rebase. Commits record directory structure and patchset. Patches stored in a
-tarball—one inode per commit. Tarballs >512 bytes gzipped. Trees deduplicated
-by content hash.
-
-Diff/show: look up revision, find tree and patchset, apply patch. O(1)
-checkout. Single corrupt patch can't poison history.
+When a new file is committed, it's saved in the object store as the base. When
+the file changes, diff generates a patch against the base. If the patch is
+larger than the file, the file becomes the new base.
 
-External tools: sort, diff, patch, tar, gzip—all base system. Pipes and streams
-throughout. MEM_LIMIT falls back to disk for large repos:
+Unix diff doesn't handle binary files well. Rolled a diff for binary files that
+works well enough except for small changes that shift bytes:
 
 ```
-use constant MEM_LIMIT  => 64 * 1024 * 1024;
-use constant CHUNK_LEN  => 8192; 
-use constant IO_LAYER   => ":raw:perlio(layer=" . CHUNK_LEN . ")";
-
-if (!$use_disk) {
-    @buf = sort @buf;
-    return sub {
-        my $line = shift @buf;
-        return unless $line;
-        chomp $line;
-        my ($p, $m, $s) = split(/\t/, $line);
-        return { path => $p, mtime => $m, size => $s };
-    };
-} else {
-    $flush->() if @buf;
-    close $tmp_fh;
-    open(my $sort_fh, "-|", "sort", "-t", "\t", "-k1,1", $tmp_path) or die $!;
-    return sub {
-        my $line = <$sort_fh>;
-        unless ($line) { close $sort_fh; return; }
-        chomp $line;
-        my ($p, $s, $m) = split(/\t/, $line);
-        return { path => $p, mtime => $m, size => $s };
-    };
+my $patch = pack("Q", $new_size); 
+while (1) {
+    my $read_new = sysread($f_new, my $buf_new, $blk_size);
+    my $read_old = sysread($f_old, my $buf_old, $blk_size);
+    last if !$read_new && !$read_old;
+
+    # If blocks differ, record the change
+    if (($buf_new // '') ne ($buf_old // '')) {
+        # Format: Offset (Q), Length (L), raw data
+        $patch .= pack("QL", $offset, length($buf_new)) . $buf_new;
+    }
+    $offset += $blk_size; 
 }
 ```
 
-Index and tree processing: O(N) two-finger walk. Streamed line-by-line,
-calibrated buffers. No mmap.
+Commits store sorted lists of paths (tree), base hashes, and patch sets in the
+object store. Patches are stored as tarballs. Tarballs larger than 512 bytes
+are gzipped. Objects are content-addressable. To reconstruct a file, look up
+the revision, follow the base hash and apply the patch.
 
-## Benchmarks
+Status and add commands scan the work tree, sort entries by path and performs a
+two-finger walk with the index to minimize random access. Operations are
+performed in memory — often using text streams and pipes. MEM_LIMIT can be used
+to fall back on the disk for large repositories:
 
-Performed on T490 (i7-10510U, OpenBSD 7.8) against git v2.51.0:
-
-### Impact of repository size
+```
+my $flush = sub {
+    if (!$use_disk) {
+        ($tmp_fh, $tmp_path) = tempfile(UNLINK => 1);
+        $tmp_fh->setvbuf(undef, POSIX::_IOFBF(), $chunk_size); 
+        binmode $tmp_fh, ":raw"; 
+        $use_disk = 1;
+    }
+    print $tmp_fh @buf;
+};
+
+push @buf, $line;
+$buf_size += length($line);
+$tot_size += length($line);
+
+if ((!$use_disk && $tot_size > MEM_LIMIT) || 
+    ($use_disk && $buf_size > $chunk_size)) {
+    $flush->();
+}
+```
 
-Small repo:
+Benchmarks on T490 (i7-10510U, OpenBSD 7.8) against git v2.51.0:
 
 <pre class="pre-no-style">
 =============================================================
@@ -128,11 +119,7 @@ Page faults     |        Maj:0 / Min:0 |        Maj:0 / Min:0
 Inodes          |                  347 |                  397
 Repo size       |              4212 KB |              3496 KB
 -------------------------------------------------------------
-</pre>
-
-Larger repo:
 
-<pre class="pre-no-style">
 =============================================================
  BENCHMARK: 5000 files @ 50 depth
 =============================================================
@@ -180,11 +167,7 @@ Page faults     |        Maj:0 / Min:0 |        Maj:0 / Min:0
 Inodes          |                 5264 |                 5342
 Repo size       |             91620 KB |             70592 KB
 -------------------------------------------------------------
-</pre>
 
-### Impact of commits over time
-
-<pre class="pre-no-style">
 =============================================================
  REBASE BENCHMARK: 1000 files (100 commits)
  CONDITIONS: Depth=2, Files Mod=5%, Change=50%
@@ -249,12 +232,16 @@ Repo size       |             20864 KB |             37008 KB
 TOTAL URN REBASES: 273
 </pre>
 
-Git wins on speed and memory. Cold start is the exception — urn's add + commit
-beats git there. Git's zlib compression wins on initial disk usage.
+Git wins on speed and memory. On small repositories, Urn is competitive.
+
+In high-revision workloads with modest per-file churn, Urn beats git on
+storage. Over 80 commits, git wrote 17 MB to track a 17 MB repo. Urn only wrote
+0.5 MB despite 273 rebases. Git's inode count went from 2,334 to 6,495. Urn's
+went from 1,578 to 1,693.
 
-Over time the picture flips. 80 commits: git wrote 17 MB, urn wrote 0.5 MB.
-Git's inode count went from 2,334 to 6,495. Urn's went from 1,578 to 1,693. The
-thing it was built to do, it does.
+Git GC reclaims inodes, but doesn't save much space. On a 36.6 MB repo, git
+used up 1819 inodes and 59 MB pre-GC. After GC inode count dropped to 1514, but
+space only shrank by 6 MB.
 
 Commit: <a
 href="https://git.asciimx.com/urn/commit/?id=49ae7748e4a95afa1fd9d08f4886952dfc1deca4"
author	Sadeep Madurange <sadeep@asciimx.com>	2026-04-23 12:54:58 +0800
committer	Sadeep Madurange <sadeep@asciimx.com>	2026-04-23 12:54:58 +0800
commit	bae6f470f5ae274d4013c024e443e15484c1d2e0 (patch)
tree	8923b3d88be35380ab499ecf15cf600cd03acda1 /_log
parent	953550aee9207fae3f1a636c92d76809c8a36e5e (diff)
download	www-bae6f470f5ae274d4013c024e443e15484c1d2e0.tar.gz