Practical Merkle Trees

I've been thinking a lot about Merkle trees lately. Mostly because I recently had to implement one for a data pipeline re-processing system, and I quickly realized that most of the explainers out there are either deeply academic or hopelessly obsessed with blockchains.

Usually, the explanations stop right after "you hash the leaves, and then you hash the parents". That part is trivial. It's also usually the least interesting part of making a Merkle tree actually useful in a real system.

I built a tiny playground to get a better feel for what actually matters in practice. Drag across the data stream to insert an anomaly and watch what happens.

Drag across the character stream to insert an anomaly.

If you played around with the slider, you probably noticed the glaring issue.

The top half splits the input exactly every n bytes. The bottom half uses a tiny, naive rolling boundary heuristic (Content-Defined Chunking). It's far from a production-grade CDC algorithm, but it's good enough to illustrate the point: a Merkle tree only localizes change if the leaves have a chance to line back up.

With fixed-size chunking, inserting a single byte shifts every boundary after it. The hash function did nothing wrong, you just picked leaves that are too brittle. With content-defined chunking, the boundaries drift locally and then settle down again. That means later leaf hashes, and therefore later subtrees, can still be reused.

Domain separation is cheap

The mechanical part of hashing is boring, but there is one detail worth calling out. Prefix your leaf hashes and your internal node hashes differently.

go
package merkle
 
import "crypto/sha256"
 
type Hash [32]byte
 
func hashLeaf(data []byte) Hash {
	return sha256.Sum256(append([]byte("leaf:"), data...))
}
 
func hashNode(left, right Hash) Hash {
	buf := make([]byte, 0, len("node:")+64)
	buf = append(buf, "node:"...)
	buf = append(buf, left[:]...)
	buf = append(buf, right[:]...)
	return sha256.Sum256(buf)
}

That leaf: / node: prefix is not just domain separation. It's effectively type-embedding in a typeless system. It prevents structurally different things from accidentally collapsing onto the exact same byte representation. It's cheap, so just do it.

Same object, different root

Another easy way to make Merkle trees annoying is unstable serialization.

json
{"name":"lime","count":2}
{"count":2,"name":"lime"}

These two objects mean the exact same thing. But if you hash their raw bytes directly, they might not produce the same root. That's not a Merkle problem, that's an encoding problem.

Canonicalization matters way more than most "tree of hashes" explanations admit. If equivalent inputs don't collapse to equivalent bytes, your cache hit rate gets worse, your sync gets noisier, and your "incremental" system starts acting suspiciously full-rebuild-ish.

Trees explain it, DAGs pay the bills

A plain Merkle tree is fine for an explainer. But real systems usually want a Merkle DAG.

If the same subtree appears twice, a tree stores it twice. A DAG interns it once and lets both parents point at it. That's a much better fit for chunk stores, build caches, package stores, and Git-style object graphs. If you look closely at something like IPFS and its UnixFS, files are actually stored as arbitrary-byte-sequence DAGs. The default chunking is 256kb (with a 2MB cap per block), meaning large files that share significant data don't actually duplicate storage space or transfer time.

There's nothing mystical going on here. The DAG just stops pretending repeated structure is new structure.

Some notes and thoughts

Chunking strategy matters more than most people expect. If the chunking is bad, your Merkle tree is bad.
Canonicalization is not optional if you want stable roots.
Domain separation is cheap. Do it.
Trees are great for explaining the concept. DAGs are what usually make the storage story pay rent.

That's the part I really like about Merkle trees. The math itself is simple. Most of the real behavior lives in the boring, practical implementation details.