diff options
Diffstat (limited to '_log/vcs-1.md')
| -rw-r--r-- | _log/vcs-1.md | 65 |
1 files changed, 27 insertions, 38 deletions
diff --git a/_log/vcs-1.md b/_log/vcs-1.md index ac0cc97..63addda 100644 --- a/_log/vcs-1.md +++ b/_log/vcs-1.md @@ -1,19 +1,17 @@ --- -title: 'Urn: Exploring a SSD-friendly VCS architecture' +title: What would VCS look like if SSD wear was the primary constraint? date: 2026-04-20 layout: post --- -Git takes up 1819 inodes and 59 MB to track a 36.59 MB repository pre-GC. GC -collected 1514 inodes. Packing only reclaimed 6 MB. Can we do better? +Git: 1819 inodes, 59 MB to track a 36.59 MB repo pre-GC. GC collected 1514 +inodes. Packing reclaimed 6 MB. Can we do better? -PoC implements status, add, commit, log, show, and diff; Supports symlinks, -binary files. Architecture tries to balance speed, memory, and storage; -Prioritizes SSD longevity—sequential read/writes, reduced TBW/WA—to the extent -Perl and the OS let us. +PoC implements status, add, commit, log, show, diff. Supports symlinks, binary +files. Optimized for SSD longevity — sequential reads/writes, reduced TBW/WA. -A sorted index tracks files. Staging an object copies it to staging area, adds -path, mtime, size, content hash (SHA-1) to index: +Architecture: sorted index tracks files. Staging copies object to staging area, +records path, mtime, size, SHA-1: ``` my $p = $wrk_entry->{path}; @@ -31,24 +29,17 @@ printf $out "%-40s\t%-40s\t%-40s\t%-12d\t%-10d\t%s\n", $wrk_entry->{size}, $p; ``` -Urn doesn't use delta chains or a CAS. It stores one copy of a file (base). -Revisions are recorded as patches relative to the base. When the patch outgrows -the file, urn snapshots that file and tracks future modifications relative to -the new base. +No CAS, no delta chains. One base file per tracked object. Revisions are +patches against the base. When the patch outgrows the file, snapshot and +rebase. Commits record directory structure and patchset. Patches stored in a +tarball—one inode per commit. Tarballs >512 bytes gzipped. Trees deduplicated +by content hash. -Commits record the directory structure and patchset. Patches are stored in a -tarball to minimize inode usage. Tarballs larger than 512 bytes are gzipped. -Compressing anything less isn't worth the effort. Trees are deduplicated using -the content hash. Multiple commits point to a single tree. +Diff/show: look up revision, find tree and patchset, apply patch. O(1) +checkout. Single corrupt patch can't poison history. -Diff/show commands look up revision, find tree and patchset in the object -store, and apply the patch. Single-patch model keeps checkout operations at -O(1) and resists a single corrupt patch corrupting history. - -External tools used like sort, diff, patch, tar, gzip are in the base system. -Pipes and streams are used to communicate with them to keep the memory -footprint low. Most operations are executed in memory. MEM_LIMIT can be used to -fallback to disk when working with large repositories: +External tools: sort, diff, patch, tar, gzip—all base system. Pipes and streams +throughout. MEM_LIMIT falls back to disk for large repos: ``` use constant MEM_LIMIT => 64 * 1024 * 1024; @@ -78,8 +69,8 @@ if (!$use_disk) { } ``` -Index/tree processing is O(N)—two-finger walk. To keep IO transparent, they are -streamed line-by-line, using carefully calibrated buffers, instead of mmap. +Index and tree processing: O(N) two-finger walk. Streamed line-by-line, +calibrated buffers. No mmap. ## Benchmarks @@ -193,8 +184,7 @@ Repo size | 91620 KB | 70592 KB ### Impact of commits over time -Each commit modifies 2% of files. Modifications simulate small patches: a few -lines added, and a few lines deleted. +Each commit modifies 2% of files — a few lines added, a few deleted. ``` ============================================================= @@ -257,15 +247,14 @@ Repo Size | 19868 KB | 49840 KB ------------------------------------------------------------- ``` -Overall, git has the speed and memory advantage. Curiously though, on a cold -start, urn's add + commit time beats git's, while git's aggressive zlib -compression beats urn's disk usage. +Git wins on speed and memory. Cold start is the exception — urn's add + commit +beats git there. Git's zlib compression wins on initial disk usage. -Performance profile over many commits tells a different story, however. Git's -optimized C core eventually and consistently outperforms urn. Urn's disk usage, -which is what it was optimized for, is more stable than git's. In 80 commits, -git wrote 27 MB of data to the disk, while urn only wrote 0.6 MB. Git's inode -use exploded from 2,122 to 5,341. Urn only went from 1,302 to 1,464. +Over time the picture flips. 80 commits: git wrote 27 MB, urn wrote 0.6 MB. +Git's inode count went from 2,122 to 5,341. Urn's went from 1,302 to 1,464. The +thing it was built to do, it does. -Verdict: viable. +Commit: <a +href="https://git.asciimx.com/urn/commit/?id=57eb41d13914c2fdadcb863d36d73848a5fd589b" +class="external" target="_blank" rel="noopener noreferrer">57eb41d</a> |
