summaryrefslogtreecommitdiffstats
path: root/_log
diff options
context:
space:
mode:
authorSadeep Madurange <sadeep@asciimx.com>2026-04-23 12:54:58 +0800
committerSadeep Madurange <sadeep@asciimx.com>2026-04-23 12:54:58 +0800
commitbae6f470f5ae274d4013c024e443e15484c1d2e0 (patch)
tree8923b3d88be35380ab499ecf15cf600cd03acda1 /_log
parent953550aee9207fae3f1a636c92d76809c8a36e5e (diff)
downloadwww-bae6f470f5ae274d4013c024e443e15484c1d2e0.tar.gz
Rewrite VCS and neo4J in my style, and fix font-size in pre.
Diffstat (limited to '_log')
-rw-r--r--_log/neo4j-a-star-search.md41
-rw-r--r--_log/vcs-1.md143
2 files changed, 72 insertions, 112 deletions
diff --git a/_log/neo4j-a-star-search.md b/_log/neo4j-a-star-search.md
index 9f4a7d2..3105662 100644
--- a/_log/neo4j-a-star-search.md
+++ b/_log/neo4j-a-star-search.md
@@ -1,25 +1,15 @@
---
-title: That time I forked Neo4J to implement A* search
+title: Neo4J contribution
date: 2018-03-06
layout: post
---
Written in 2026, backdated to 2018.
-Our vessel tracking system hit a wall. We needed shortest paths through a
-13,000-vertex graph. But the system slowed to a crawl after 4,000. Kristian,
-our CTO, suggested replacing Dijkstra's search with the faster A* search—an
-algorithm I had encountered as an undergraduate.
+Before v3.4.0, Neo4J shipped with Dijkstra's shortest path search. The
+algorithm was too slow for our marine vessel tracking application.
-The database we used, Neo4J, employed a plug-in system for graph algorithms—a
-mature open-source project developed by experts. Forking it for our unassuming
-start-up was bold. Often, application developers are reluctant to step into
-lower layers.
-
-It took me a week to set up the toolchain, build the project, and run the
-tests. I reviewed Dijkstra's implementation, separating logic from
-infrastructure. I then refactored the algorithm to use the haversine function
-to steer the search:
+Forked and added A* search. Used the haversine function to steer the search:
```
private double computeHeuristic(
@@ -43,8 +33,7 @@ private double computeHeuristic(
}
```
-And revised the core search loop to update the costs when a better path is
-found:
+Core search loop updates costs when a better path is found:
```
private void updateCosts(
@@ -59,25 +48,9 @@ private void updateCosts(
}
```
-I remember celebrating a 300x speed up, and comfortably scaling to the target
-13,000.
-
-I was even more thrilled when the upstream project accepted my changes, and
-invited me to contribute. In hindsight, I'm a little surprised that my changes
-were accepted in their original form. The heuristic was tightly coupled to our
-use case. Decoupling it would require significant refactoring and interface
-changes.
-
-It was, for a time, the best outcome I could have imagined.
-
-My celebration of success, however, was short-lived. In a matter of months, as
-the database grew larger and denser—to 16,000 vertices, if memory serves—A*
-search too began to struggle. We fell back on a far less elegant PostgreSQL
-cache. I learnt my first bitter lesson about optimization: dumb caches beat
-clever algorithms. I briefly wondered if I should have stayed with pure
-mathematics.
+Upstreamed the changes.
-Source code: <a
+GitHub release: <a
href="https://github.com/neo4j-contrib/neo4j-graph-algorithms/releases/tag/3.4.0.0"
class="external" target="_blank" rel="noopener noreferrer">Neo4J v3.4.0</a> |
<a
diff --git a/_log/vcs-1.md b/_log/vcs-1.md
index 58cb896..69e9078 100644
--- a/_log/vcs-1.md
+++ b/_log/vcs-1.md
@@ -1,84 +1,75 @@
---
-title: What would VCS look like if SSD wear was the primary constraint?
-date: 2026-04-20
+title: 'Urn: An experimental SSD-oriented VCS'
+date: 2026-04-23
layout: post
---
-Git: 1819 inodes, 59 MB to track a 36.59 MB repo pre-GC. GC collected 1514
-inodes. Packing reclaimed 6 MB. Can we do better?
+Urn is an experimental VCS built to minimize SSD wear, write amplification, and
+inode churn, even when that costs CPU time.
-PoC implements status, add, commit, log, show, diff. Supports symlinks, binary
-files. Optimized for SSD longevity — sequential reads/writes, reduced TBW/WA.
+Implemented init, status, add, commit, log, show, and diff. Handles text files,
+symlinks, and binary files. Collaborative workflows are out of scope.
-Architecture: sorted index tracks files. Staging copies object to staging area,
-records path, mtime, size, SHA-1:
+Sorted index tracks file paths, SHA-1 hashes of staged files (staged hash),
+parent (commit hash), and base (base hash), mtime, and size. Permissions are
+not tracked. mtime and size are used to avoid computing hashes for files that
+didn't change.
-```
-my $p = $wrk_entry->{path};
-my $current_hash = hash_file_content($p);
-my $stg_path = File::Spec->catfile(TMP_DIR, $p);
-make_path(dirname($stg_path));
-
-(-l $p)
- ? symlink(readlink($p), $stg_path)
- : copy($p, $stg_path);
-
-printf $out "%-40s\t%-40s\t%-40s\t%-12d\t%-10d\t%s\n",
- $current_hash, $idx_entry->{c_hash},
- $idx_entry->{b_hash}, $wrk_entry->{mtime},
- $wrk_entry->{size}, $p;
-```
-
-No CAS, no delta chains. One base file per tracked object. Revisions are
-patches against the base. When the patch outgrows the file, snapshot and
-rebase. Commits record directory structure and patchset. Patches stored in a
-tarball—one inode per commit. Tarballs >512 bytes gzipped. Trees deduplicated
-by content hash.
-
-Diff/show: look up revision, find tree and patchset, apply patch. O(1)
-checkout. Single corrupt patch can't poison history.
+When a new file is committed, it's saved in the object store as the base. When
+the file changes, diff generates a patch against the base. If the patch is
+larger than the file, the file becomes the new base.
-External tools: sort, diff, patch, tar, gzip—all base system. Pipes and streams
-throughout. MEM_LIMIT falls back to disk for large repos:
+Unix diff doesn't handle binary files well. Rolled a diff for binary files that
+works well enough except for small changes that shift bytes:
```
-use constant MEM_LIMIT => 64 * 1024 * 1024;
-use constant CHUNK_LEN => 8192;
-use constant IO_LAYER => ":raw:perlio(layer=" . CHUNK_LEN . ")";
-
-if (!$use_disk) {
- @buf = sort @buf;
- return sub {
- my $line = shift @buf;
- return unless $line;
- chomp $line;
- my ($p, $m, $s) = split(/\t/, $line);
- return { path => $p, mtime => $m, size => $s };
- };
-} else {
- $flush->() if @buf;
- close $tmp_fh;
- open(my $sort_fh, "-|", "sort", "-t", "\t", "-k1,1", $tmp_path) or die $!;
- return sub {
- my $line = <$sort_fh>;
- unless ($line) { close $sort_fh; return; }
- chomp $line;
- my ($p, $s, $m) = split(/\t/, $line);
- return { path => $p, mtime => $m, size => $s };
- };
+my $patch = pack("Q", $new_size);
+while (1) {
+ my $read_new = sysread($f_new, my $buf_new, $blk_size);
+ my $read_old = sysread($f_old, my $buf_old, $blk_size);
+ last if !$read_new && !$read_old;
+
+ # If blocks differ, record the change
+ if (($buf_new // '') ne ($buf_old // '')) {
+ # Format: Offset (Q), Length (L), raw data
+ $patch .= pack("QL", $offset, length($buf_new)) . $buf_new;
+ }
+ $offset += $blk_size;
}
```
-Index and tree processing: O(N) two-finger walk. Streamed line-by-line,
-calibrated buffers. No mmap.
+Commits store sorted lists of paths (tree), base hashes, and patch sets in the
+object store. Patches are stored as tarballs. Tarballs larger than 512 bytes
+are gzipped. Objects are content-addressable. To reconstruct a file, look up
+the revision, follow the base hash and apply the patch.
-## Benchmarks
+Status and add commands scan the work tree, sort entries by path and performs a
+two-finger walk with the index to minimize random access. Operations are
+performed in memory — often using text streams and pipes. MEM_LIMIT can be used
+to fall back on the disk for large repositories:
-Performed on T490 (i7-10510U, OpenBSD 7.8) against git v2.51.0:
-
-### Impact of repository size
+```
+my $flush = sub {
+ if (!$use_disk) {
+ ($tmp_fh, $tmp_path) = tempfile(UNLINK => 1);
+ $tmp_fh->setvbuf(undef, POSIX::_IOFBF(), $chunk_size);
+ binmode $tmp_fh, ":raw";
+ $use_disk = 1;
+ }
+ print $tmp_fh @buf;
+};
+
+push @buf, $line;
+$buf_size += length($line);
+$tot_size += length($line);
+
+if ((!$use_disk && $tot_size > MEM_LIMIT) ||
+ ($use_disk && $buf_size > $chunk_size)) {
+ $flush->();
+}
+```
-Small repo:
+Benchmarks on T490 (i7-10510U, OpenBSD 7.8) against git v2.51.0:
<pre class="pre-no-style">
=============================================================
@@ -128,11 +119,7 @@ Page faults | Maj:0 / Min:0 | Maj:0 / Min:0
Inodes | 347 | 397
Repo size | 4212 KB | 3496 KB
-------------------------------------------------------------
-</pre>
-
-Larger repo:
-<pre class="pre-no-style">
=============================================================
BENCHMARK: 5000 files @ 50 depth
=============================================================
@@ -180,11 +167,7 @@ Page faults | Maj:0 / Min:0 | Maj:0 / Min:0
Inodes | 5264 | 5342
Repo size | 91620 KB | 70592 KB
-------------------------------------------------------------
-</pre>
-### Impact of commits over time
-
-<pre class="pre-no-style">
=============================================================
REBASE BENCHMARK: 1000 files (100 commits)
CONDITIONS: Depth=2, Files Mod=5%, Change=50%
@@ -249,12 +232,16 @@ Repo size | 20864 KB | 37008 KB
TOTAL URN REBASES: 273
</pre>
-Git wins on speed and memory. Cold start is the exception — urn's add + commit
-beats git there. Git's zlib compression wins on initial disk usage.
+Git wins on speed and memory. On small repositories, Urn is competitive.
+
+In high-revision workloads with modest per-file churn, Urn beats git on
+storage. Over 80 commits, git wrote 17 MB to track a 17 MB repo. Urn only wrote
+0.5 MB despite 273 rebases. Git's inode count went from 2,334 to 6,495. Urn's
+went from 1,578 to 1,693.
-Over time the picture flips. 80 commits: git wrote 17 MB, urn wrote 0.5 MB.
-Git's inode count went from 2,334 to 6,495. Urn's went from 1,578 to 1,693. The
-thing it was built to do, it does.
+Git GC reclaims inodes, but doesn't save much space. On a 36.6 MB repo, git
+used up 1819 inodes and 59 MB pre-GC. After GC inode count dropped to 1514, but
+space only shrank by 6 MB.
Commit: <a
href="https://git.asciimx.com/urn/commit/?id=49ae7748e4a95afa1fd9d08f4886952dfc1deca4"