diff options
| author | Sadeep Madurange <sadeep@asciimx.com> | 2026-04-24 08:10:24 +0800 |
|---|---|---|
| committer | Sadeep Madurange <sadeep@asciimx.com> | 2026-04-24 09:13:15 +0800 |
| commit | c0e09507547ff80ff93d85989c8954fdcb5292f3 (patch) | |
| tree | d77d99af23736e6105a1731a7d37b187d6c2173a | |
| parent | bc38c10bfdd23fd51c7c3c320de1d84ac98b8dfd (diff) | |
| download | www-c0e09507547ff80ff93d85989c8954fdcb5292f3.tar.gz | |
Tightened prose in VCS post.
| -rw-r--r-- | _log/site-search.md | 5 | ||||
| -rw-r--r-- | _log/vcs-1.md | 173 |
2 files changed, 42 insertions, 136 deletions
diff --git a/_log/site-search.md b/_log/site-search.md index 50de7b2..4f965f9 100644 --- a/_log/site-search.md +++ b/_log/site-search.md @@ -5,7 +5,7 @@ layout: post --- Developed a suffix-array-based search engine for the site today. While a simple -regex search was enough, I couldn't resist the technical elegance of a proper +regex search was enough, couldn't resist the technical elegance of a proper index. Indexer: Implemented the indexer in Perl to crawl the HTML, lowercase the text, @@ -97,7 +97,8 @@ using lock-file semaphores, and capped the query length (64 B) and the result set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape. At my current pace of twelve articles a year, the search should work without -intervention for the next 833 years. Penciled in the next release for 2859 AD. +intervention for the next 833 years. Penciled in the next release for Anno +Domini 2859. Commit: <a href="https://git.asciimx.com/www/commit/?h=term&id=6da102d6e0494a3eac3f05fa3b2cdcc25ba2754e" diff --git a/_log/vcs-1.md b/_log/vcs-1.md index d6ae0b2..3b60ff7 100644 --- a/_log/vcs-1.md +++ b/_log/vcs-1.md @@ -1,5 +1,5 @@ --- -title: Implemented an experimental SSD-friendly VCS +title: Built an experimental SSD-friendly VCS date: 2026-04-23 layout: post --- @@ -7,47 +7,20 @@ layout: post Implemented init, status, add, commit, log, show, and diff. Tracks regular files, symlinks. Didn't bother with collaborative workflows. -Moved away from the initial work tree mirroring with symlinks to an index to -minimize inode churn and opening directories on every status/add command. +Moved away from the initial work tree mirroring with symlinks to a path-sorted +index to minimize inode churn and opening directories on every status/add +command. -Implemented path-sorted index to track staged, commit, and base SHA-1 hashes, -mtime, and size. Like git, mtime and size are used to skip entries that didn't -change. Excluded file permissions. +Implemented the index to track staged, commit, and base SHA-1 hashes, mtime, +and size. Like git, used mtime and size to skip entries that didn't change. +Excluded file permissions. -When a new file is committed, it’s saved in the object store as the base. When -the file changes, diff generates a patch against the base. If the patch is -larger than the file, the file becomes the new base. +Designed work/commit tree scans around a two-finger walk with the index. Linear +index access trades random-access speed for sequential IO; keeps memory +footprint low. -Unix diff doesn't compute binary deltas. Rolled a basic binary diff that works -well enough except for small changes that shift bytes: - -``` -my $patch = pack("Q", $new_size); -while (1) { - my $read_new = sysread($f_new, my $buf_new, $blk_size); - my $read_old = sysread($f_old, my $buf_old, $blk_size); - last if !$read_new && !$read_old; - - # If blocks differ, record the change - if (($buf_new // '') ne ($buf_old // '')) { - # Format: Offset (Q), Length (L), raw data - $patch .= pack("QL", $offset, length($buf_new)) . $buf_new; - } - $offset += $blk_size; -} -``` - -A commit is a tree (list of paths and their base hashes) and a patch set. -Patches are stored as tarballs. Gzipped tarballs larger than 512 bytes. -Objects in the store are content-addressable. To reconstruct a file, look up -the revision file, follow the base hash, and apply the patch. - -Status and add commands scan the work tree, sort entries by path, and perform a -two-finger walk with the index. Linear index access trades random-access speed -for sequential IO. - -Operations are performed in memory—often using text streams and pipes. -MEM_LIMIT can be used to fall back on the disk for large repositories: +Operations run in memory, using text streams and pipes wherever possible. Left +MEM_LIMIT configurable to fall back to disk for large repositories: ``` my $flush = sub { @@ -70,105 +43,37 @@ if ((!$use_disk && $tot_size > MEM_LIMIT) || } ``` -Performed benchmarks on T490 (i7-10510U, OpenBSD 7.8) against git v2.51.0: - -<pre class="pre-no-style"> -============================================================= - BENCHMARK: 200 files @ 20 depth -============================================================= - -ACTION: Status -------------------------------------------------------------- -METRIC | URN | GIT -----------------+----------------------+--------------------- -Time | 0.10s | 0.00s -Max RSS | 0.02 MB | 0.00 MB -Page faults | Maj:0 / Min:0 | Maj:0 / Min:0 -Inodes | 6 | 27 -Repo size | 20 KB | 116 KB -------------------------------------------------------------- +Implemented the commit command to atomically save (rename) staged files, the +tree, and the deltas to the object store. Bundled deltas into tarballs to +conserve inodes, and gzipped tarballs larger than 512 bytes. Objects in the +store are content-addressable. -ACTION: Add -------------------------------------------------------------- -METRIC | URN | GIT -----------------+----------------------+--------------------- -Time | 0.16s | 0.17s -Max RSS | 0.02 MB | 0.00 MB -Page faults | Maj:0 / Min:0 | Maj:0 / Min:0 -Inodes | 225 | 360 -Repo size | 3700 KB | 3348 KB -------------------------------------------------------------- +Computed deltas against the first version of a file (base) to simplify +reconstruction via application of a single patch instead of delta chains. When +the delta outgrows the file, the file becomes the new base. -ACTION: Commit -------------------------------------------------------------- -METRIC | URN | GIT -----------------+----------------------+--------------------- -Time | 0.17s | 0.04s -Max RSS | 0.02 MB | 0.01 MB -Page faults | Maj:0 / Min:0 | Maj:0 / Min:0 -Inodes | 347 | 397 -Repo size | 4212 KB | 3496 KB -------------------------------------------------------------- - -ACTION: Status(Clean) -------------------------------------------------------------- -METRIC | URN | GIT -----------------+----------------------+--------------------- -Time | 0.10s | 0.01s -Max RSS | 0.02 MB | 0.00 MB -Page faults | Maj:0 / Min:0 | Maj:0 / Min:0 -Inodes | 347 | 397 -Repo size | 4212 KB | 3496 KB -------------------------------------------------------------- - -============================================================= - BENCHMARK: 5000 files @ 50 depth -============================================================= - -ACTION: Status -------------------------------------------------------------- -METRIC | URN | GIT -----------------+----------------------+--------------------- -Time | 0.26s | 0.00s -Max RSS | 0.02 MB | 0.00 MB -Page faults | Maj:0 / Min:0 | Maj:0 / Min:0 -Inodes | 6 | 27 -Repo size | 20 KB | 116 KB -------------------------------------------------------------- +Unix diff doesn't compute binary deltas. Rolled a basic binary diff that works +well enough except for small changes that shift bytes: -ACTION: Add -------------------------------------------------------------- -METRIC | URN | GIT -----------------+----------------------+--------------------- -Time | 2.82s | 4.62s -Max RSS | 0.02 MB | 0.01 MB -Page faults | Maj:0 / Min:0 | Maj:0 / Min:0 -Inodes | 5055 | 5284 -Repo size | 89444 KB | 70360 KB -------------------------------------------------------------- +``` +my $patch = pack("Q", $new_size); +while (1) { + my $read_new = sysread($f_new, my $buf_new, $blk_size); + my $read_old = sysread($f_old, my $buf_old, $blk_size); + last if !$read_new && !$read_old; -ACTION: Commit -------------------------------------------------------------- -METRIC | URN | GIT -----------------+----------------------+--------------------- -Time | 1.18s | 0.93s -Max RSS | 0.03 MB | 0.01 MB -Page faults | Maj:0 / Min:0 | Maj:0 / Min:0 -Inodes | 5264 | 5342 -Repo size | 91620 KB | 70592 KB -------------------------------------------------------------- + # If blocks differ, record the change + if (($buf_new // '') ne ($buf_old // '')) { + # Format: Offset (Q), Length (L), raw data + $patch .= pack("QL", $offset, length($buf_new)) . $buf_new; + } + $offset += $blk_size; +} +``` -ACTION: Status (Clean) -------------------------------------------------------------- -METRIC | URN | GIT -----------------+----------------------+--------------------- -Time | 0.34s | 0.10s -Max RSS | 0.02 MB | 0.01 MB -Page faults | Maj:0 / Min:0 | Maj:0 / Min:0 -Inodes | 5264 | 5342 -Repo size | 91620 KB | 70592 KB -------------------------------------------------------------- +Benchmarks on T490 (i7-10510U, OpenBSD 7.8) against git v2.51.0: +<pre class="pre-no-style"> ============================================================= REBASE BENCHMARK: 1000 files (100 commits) CONDITIONS: Depth=2, Files Mod=5%, Change=50% @@ -241,8 +146,8 @@ storage. Over 80 commits, git wrote 17 MB to track a 17 MB repo. Urn only wrote went from 1,578 to 1,693. Git GC reclaims inodes, but doesn't save much space. On a 36.6 MB repo, git -used up 1819 inodes and 59 MB pre-GC. After GC, inode count dropped to 1514, -but space only shrank by 6 MB. +used up 1819 inodes and 59 MB pre-GC. After GC, the inode count dropped to +1514, but the size only shrank by 6 MB. Commit: <a href="https://git.asciimx.com/urn/commit/?id=49ae7748e4a95afa1fd9d08f4886952dfc1deca4" |
