Data citing with fused hashing.
Version: 0.5.0-draft
Status: Active development
Last updated: 2026-02-27
Dacite is a system for distributed immutable data structures with content-addressed nodes. It enables:
All hashes are 256-bit values represented as 4 × 64-bit words (most significant first):
hash = [c0, c1, c2, c3] (4 × i64, big-endian word order)
Word c0 contains the most mixed bits (from fuse) and is used first for HAMT navigation.
Fuse combines two 256-bit hashes using a 4×4 upper triangular matrix over 64-bit cells.
The output is ordered so that the most mixed bits appear first (most significant), optimizing for HAMT navigation which uses leading bits:
Input: a = [a0, a1, a2, a3] (256 bits as 4 × 64-bit words, MSB first)
b = [b0, b1, b2, b3]
Output: c = [c0, c1, c2, c3]
c0 = a0 + a3*b2 + b0 ← most bit mixing (MSB, used for HAMT)
c1 = a1 + b1
c2 = a2 + b2
c3 = a3 + b3 ← least bit mixing (LSB)
All arithmetic is mod 2^64 (unsigned wraparound). Total: 6 additions, 1 multiplication.
Properties:
[0, 0, 0, 0] is a two-sided identity: fuse(a, 0) = fuse(0, a) = aThe fuse operation forms a group over (Z/2^64)^4. Every hash has a unique two-sided inverse:
inv([a0, a1, a2, a3]) = [a3*a2 - a0, -a1, -a2, -a3]
Such that:
fuse(inv(a), a) = fuse(a, inv(a)) = [0, 0, 0, 0]
All arithmetic is mod 2^64. The inverse costs 1 multiply + 4 negations.
Unfuse removes a known component from a fused hash:
unfuse(fused, b) = fuse(fused, inv(b))
Given fused = fuse(a, b), then unfuse(fused, b) = a. To strip from the left instead: fuse(inv(a), fused) = b.
The group structure enables:
Dacite hashing is built on a precomputed table mapping each byte value (0–255) to a 256-bit hash:
byte_hash: byte → Hash (256 entries)
The default table is seeded using SHA-256: byte_hash[i] = sha256(byte_array([i])). However, any set of 256 distinct, high-quality 32-byte values may be used. This decouples dacite references from any specific hash function at runtime.
The byte hash table is a build-time constant — it is not stored inside the content-addressed space. Instead, the table’s own hash serves as a protocol identifier:
protocol_id = fuse_bytes(concat(table[0], table[1], ..., table[255]))
The table is hashed using itself: each row is a 32-byte value, the rows are concatenated into a 8,192-byte sequence, and fuse_bytes (which uses the table) produces the protocol ID.
Two stores are compatible if and only if they share the same protocol ID. Implementations check this on first contact:
DACITE_PROTOCOL file)The table itself is distributed out-of-band: hardcoded in implementations, published as a spec artifact, or fetched by protocol ID from a registry. The table is never stored as a content-addressed entry — it is infrastructure, not data.
Well-known location: The default byte hash table (SHA-256 seeded) is published as a spec artifact at a well-known URL and included in the reference implementation source. New implementations should obtain the table from one of:
spec/byte_hash_table.bin, 8,192 bytes)Implementations must verify the table by computing fuse_bytes(concat(table[0..255])) and comparing against the expected protocol ID before use.
All data hashing is performed by mapping bytes through the table and fusing:
fuse_bytes(bs) = reduce(unchecked_fuse, [0,0,0,0], map(byte_hash, bs))
fuse_str(s) = fuse_bytes(utf8_bytes(s))
Properties inherited from fuse:
fuse(fuse_str(a), fuse_str(b)) = fuse_str(a ++ b)The associativity property means composing fuse results is equivalent to fusing the concatenated bytes. This is both a feature (composability) and a constraint (domain separators are needed between fields — see Typed Values).
Fuse must reject low-entropy inputs and outputs. A hash is low-entropy when its lower 32 bits are zero in all four words:
low_entropy?(h) =
(h[0] & 0xFFFFFFFF) == 0 AND
(h[1] & 0xFFFFFFFF) == 0 AND
(h[2] & 0xFFFFFFFF) == 0 AND
(h[3] & 0xFFFFFFFF) == 0
The checked fuse operation:
fuse(a, b):
REJECT if low_entropy?(a)
REJECT if low_entropy?(b)
result = unchecked_fuse(a, b)
REJECT if low_entropy?(result)
return result
An unchecked variant (unchecked_fuse) is available for internal use where inputs are known to be valid (e.g., combining measures).
See: Hash Fusing — Detecting Low Entropy Failures
Dacite has exactly three primitive kinds. Everything in the system is built from these:
A scalar is an atomic, bounded-size value. It is the only primitive that contains raw data (bytes) rather than references to other values. Maximum size: 255 bytes (u8 length).
scalar_hash = fuse_bytes(canonical_bytes)
Scalars are untyped at the primitive level. A byte with value 97 and a char ‘a’ may be the same scalar (same bytes, same hash). Types give meaning to scalars (see Typed Values).
Examples of scalar data:
A seq is an ordered collection of references, implemented as a Finger Tree. It is the universal building block for ordered data.
A seq’s hash is its semantic hash, derived from the accumulated measure:
seq_hash = root.measure.elements_fuse
Two seqs with the same elements in the same order have the same hash, regardless of internal tree structure. Different tree shapes for the same logical sequence normalize to the same hash in the content-addressed store.
A map is an unordered collection of key-value pairs, implemented as a HAMT.
A map’s hash is its semantic hash, which is insertion-order independent because the HAMT traversal order is determined by key hashes (ascending):
map_hash = root.measure.elements_fuse
Where each entry contributes fuse(key_hash, value_hash) to the measure.
A typed value is a 2-element seq:
typed_value = seq(type_name, data)
Where:
This is a structural convention, not enforced by the storage layer. The system treats typed values as ordinary seqs; the “typed” interpretation is applied by consumers.
A type name is an untyped seq of char scalars:
type_name("string") = seq('s', 't', 'r', 'i', 'n', 'g')
Each character is a raw scalar (UTF-8 bytes). The type name’s hash is the seq’s semantic hash (elements-fuse of its char scalar hashes).
Type names are self-documenting: given a typed value, read position 0 to discover its type. No external registry needed.
Convention for built-in types: bare names (e.g., "string", "i64", "vector"). User-defined types should use namespaced names (e.g., "myapp/user") to avoid collisions.
The semantic hash of a typed value uses a null domain separator between the type name and data:
typed_hash = fuse_bytes(type_name_bytes ++ 0x00 ++ encode(data))
Equivalently (by fuse associativity):
typed_hash = fuse(fuse(fuse_str(type_name), fuse_bytes([0x00])), fuse_bytes(encode(data)))
The null separator is essential: without it, fuse(fuse_str(a), fuse_str(b)) = fuse_str(a ++ b) would cause boundary collisions (e.g., type "i64" with data "2" would collide with type "i6" with data "42"). Since type names cannot contain null bytes, the separator makes the encoding unambiguous.
Two typed values with different types but the same underlying data can be compared in O(1) by stripping the type name hash from the left:
content_hash(typed_value) = fuse(inv(h(type_name)), typed_hash)
Example: a string "abc" and a vector of chars ['a', 'b', 'c']:
string_content = fuse(inv(h("string")), string_hash)
vector_content = fuse(inv(h("vector")), vector_hash)
string_content == vector_content // true — same underlying data
This works because both have the same data seq (a seq of char scalars), and the group structure of fuse allows cleanly removing the type prefix.
All built-in types follow the seq(type_name, data) convention:
| Type Name | Data | Description |
|---|---|---|
"null" |
null scalar (0 bytes) | Unit type |
"bool" |
1-byte scalar | Boolean |
"i8" … "i256" |
1–32 byte scalar (big-endian signed) | Signed integers |
"u8" … "u256" |
1–32 byte scalar (big-endian unsigned) | Unsigned integers |
"f32", "f64" |
4 or 8 byte scalar (IEEE 754) | Floating point |
"char" |
1–4 byte scalar (UTF-8) | Unicode character |
"string" |
seq of char scalars | UTF-8 string |
"blob" |
seq of byte scalars | Binary data |
"vector" |
seq of arbitrary values | Ordered collection |
"map" |
map (HAMT) | Key-value collection |
A set is a map where every key maps to itself:
set({a, b, c}) = map({a: a, b: b, c: c})
There is no "set" type. Sets are ordinary "map" values. Content addressing means the duplicate key/value reference costs zero additional storage — both slots point to the same hash.
Membership test: get(set, x) != nil.
A negative set represents “everything except these elements.” It uses a distinguished sentinel element neg as a key:
negative_set({a, b}) = map({neg: neg, a: a, b: b})
Where neg is a typed value ["neg", null] — a null scalar with the type name "neg".
Membership test for a negative set: x is a member if get(set, x) == nil (inverted logic). The presence of the neg key signals the inversion.
Set operations with positive (pos) and negative (neg) sets:
In the table below, A and B refer to the stored maps (including the neg sentinel when present). All operations are plain map operations — the neg sentinel is just another key that flows through naturally.
| A | B | union (A ∪ B) | intersect (A ∩ B) | difference (A \ B) |
|---|---|---|---|---|
| pos | pos | merge(A, B) | keep(A, B) | remove(A, B) |
| neg | neg | keep(A, B) | merge(A, B) | remove(B, A) |
| pos | neg | remove(B, A) | remove(A, B) | keep(A, B) |
| neg | pos | remove(A, B) | remove(B, A) | merge(A, B) |
Where:
merge(A, B) — add B’s keys not already in Akeep(A, B) — keep only A’s keys that are also in Bremove(A, B) — remove A’s keys that are also in BOther operations:
| Operation | Definition |
|---|---|
complement(A) |
Toggle the neg key (add or remove) |
member?(A, x) |
pos: get(A, x) != nil · neg: get(A, x) == nil |
This is a pure convention — the store and core types know nothing about sets or neg. Utility functions implement the convention.
Seqs and maps are implemented using tree structures. The internal nodes of these trees are not user-facing values — they are implementation machinery stored in the content-addressed store.
Internal nodes are hashed using their type name (with a null domain separator) and semantic content:
node_hash = fuse(fuse_str(node_type_name ++ 0x00), node.measure.elements_fuse)
The null byte terminates the type name, preventing collisions with longer type names. This means nodes with the same type and the same logical elements produce the same hash, regardless of internal tree shape. Different structural arrangements of the same sequence normalize to a single hash in the store. This is correct because nodes with the same elements are functionally interchangeable.
Exception — HAMT bitmap nodes: Bitmap nodes include the bitmap value in their hash because it determines routing structure. Two bitmaps with the same elements but different bitmaps route lookups differently and are NOT interchangeable:
hamt_bitmap_hash = fuse(node_hash, fuse_bytes(bitmap_as_8_bytes))
Without this, single-child bitmap nodes at different HAMT levels would collide (same elements-fuse, same type), creating self-referential loops in the store.
Collision resistance: The fuse-based hash chain has ~2^96 birthday-bound collision resistance (from the additive structure of fuse components c1–c3). This is weaker than SHA-256’s ~2^128 but astronomically beyond practical attack. The security bound is independent of how the byte hash table is seeded.
Threat model: Fuse-based hashes are designed for cooperative environments where all participants are trusted to produce honest data. In adversarial contexts — where untrusted peers contribute data to a shared store — fuse hashes alone should not be relied upon for integrity. Deployments accepting data from untrusted sources should pair dacite hashes with a cryptographic hash (e.g., SHA-256) for verification. A practical approach: store {dacite_hash, sha256} pairs at trust boundaries and verify the cryptographic hash on ingest.
| Node Type | Description | Data Fields |
|---|---|---|
"ft/empty" |
Empty seq | {measure} |
"ft/single" |
Single element wrapper | {value_hash, measure} |
"ft/digit" |
Finger (1–32 children) | {children: [hash...], measure} |
"ft/node" |
Internal node (2–32 children) | {children: [hash...], measure} |
"ft/deep" |
Deep tree | {left, spine, right, measure} |
| Node Type | Description | Data Fields |
|---|---|---|
"hamt/empty" |
Empty map | {measure} |
"hamt/entry" |
Single key-value pair | {key_hash, key_ref, val_ref, measure} |
"hamt/bitmap" |
Sparse internal node | {bitmap, children: [hash...], measure} |
All child references are hashes pointing to other nodes in the content-addressed store. This ensures every node has bounded size regardless of collection size.
Every internal node caches a measure of its subtree:
Measure = {
count: u64, // number of scalar elements
size_bytes: u64, // total size in bytes of scalar data
elements_fuse: Hash // running fuse of all element hashes
}
Measures combine as a monoid:
combine(m1, m2) = {
count: m1.count + m2.count,
size_bytes: m1.size_bytes + m2.size_bytes,
elements_fuse: unchecked_fuse(m1.elements_fuse, m2.elements_fuse)
}
identity = { count: 0, size_bytes: 0, elements_fuse: [0, 0, 0, 0] }
The elements_fuse field uses unchecked_fuse because the identity element [0, 0, 0, 0] is technically low-entropy; this is safe since measures are internal bookkeeping.
For seq scalars, elements_fuse equals the element’s hash. For map entries, elements_fuse equals fuse(key_ref, val_ref).
The root’s measure gives O(1) access to count, total size, and semantic hash.
This high branching factor keeps trees shallow, minimizing network round trips. A tree of 1M elements is only ~4 levels deep.
Note: Classic finger trees use a 2–3 branching factor, where amortized O(1) push/pop is proven via a potential-based argument. The 2–32 branching factor used here trades the classic amortized guarantee for shallower trees, reducing network round trips at the expense of more work during node splits and merges. Random access and splits remain O(log n). In practice, the constant factors are favorable because shallow trees mean fewer store fetches — the dominant cost in a distributed setting.
All nodes have bounded size because children are stored as hashes (32 bytes each), not inline:
The key’s hash is consumed 5 bits at a time, from most significant to least:
Level 0: bits 255–251 of key_hash (upper 5 bits of c0)
Level 1: bits 250–246
...
Level 51: bits 4–0 of key_hash (lower 5 bits of c3)
Each 5-bit chunk selects one of 32 possible child positions.
Internal nodes use a 32-bit bitmap to represent which child positions are occupied. The actual children array is compressed — only occupied positions have entries:
idx = popcount(bitmap & ((1 << chunk) - 1))
HAMT traversal visits children in bitmap order (ascending chunk index), which corresponds to ascending key hash order. This makes elements_fuse deterministic regardless of insertion order.
All nodes are stored and retrieved by their content hash through a minimal protocol:
s-get(hash) → value | nil // retrieve a value by hash
s-put(hash, value) → store // store a value at its hash (idempotent)
s-has?(hash) → boolean // check existence
s-snapshot() → map // bulk read of all entries (for construction/debugging)
s-merge(entries) → store // bulk insert of {hash: value} pairs
s-reset() → store // clear all entries
Since values are immutable and content-addressed, s-put is idempotent — storing the same value twice is a no-op.
Three implementations compose to form a store hierarchy:
Memory store — atom-backed in-memory map. Fast, ephemeral. The default for construction and testing.
File store — content-addressed filesystem with directory sharding (like git objects). Hashes map to file paths: base-dir/ab/cdef0123.... Provides durable persistence.
Layered store — composes an ordered list of stores with read-through semantics. Reads walk the layers top-to-bottom, returning the first hit. Writes propagate to all layers. This enables transparent caching hierarchies:
LayeredStore([MemStore, FileStore])
Reads check memory first, then disk. Writes go to both. A remote store (planned) slots in as another layer — reads that miss locally fetch from the network, and the result is cached in upper layers automatically.
A CacheMap wraps a store as a standard associative map interface:
get(hash) → fetches from store on demand (lazy loading)assoc(hash, value) → commits to store immediately (write-through)merge(cm1, cm2) → no-op when sharing the same backing storeData structures operate on [dacite-map, root-hash] tuples where dacite-map can be either a plain in-memory map or a CacheMap. This abstraction enables:
The server uses the node’s accumulated size_bytes to decide the response format:
Request:
GET /node/{hash}?inline_under={bytes}
Response modes:
If node.measure.size_bytes > inline_under:
StructureResponse {
kind: "structure",
children: [Hash], // hashes only, client fetches lazily
measure: Measure
}
If node.measure.size_bytes <= inline_under:
InlineResponse {
kind: "inline",
values: [Value], // all scalar values, fully materialized
measure: Measure
}
Default threshold: 1024 bytes (~1 TCP packet). Client controls threshold based on network conditions.
For collections whose leaves are small scalars (strings, blobs), the adaptive fetch mechanism can be extended with leaf coalescing: instead of returning individual scalar hashes, the server materializes contiguous runs of leaf values into raw byte chunks.
Request:
GET /node/{hash}?inline_under={bytes}&leaf_chunk={bytes}
The leaf_chunk parameter specifies the maximum chunk size for coalesced leaves. When the server encounters a subtree whose leaves are uniform small scalars (e.g., chars or bytes), it may respond with:
CoalescedResponse {
kind: "coalesced",
chunks: [
{ offset: 0, data: bytes[...] }, // raw leaf data, concatenated
{ offset: 1024, data: bytes[...] }
],
measure: Measure
}
Each chunk contains the raw scalar data for a contiguous range of leaves, without per-element framing. The receiver can reconstruct individual scalars (and their hashes) from the raw bytes, or hold the chunk as-is for sequential access.
Example: A 1024-byte blob stored as 1024 individual [:u8 b] scalars in a finger tree. Without coalescing, fetching the blob requires retrieving ~32 internal nodes plus 1024 scalar entries (though only 256 are unique). With a leaf_chunk=1024 parameter, the server sends a single 1024-byte chunk — one response, one packet.
The chunk size is client-controlled. Constrained devices may request small chunks (512 bytes); high-bandwidth clients may accept 64 KB or more. The tree structure provides natural split points at any node boundary.
Leaf coalescing is an optimization hint — servers may ignore it and respond with structural or inline modes. Clients must handle all response modes.
Immutable content-addressed data is ideal for caching:
Since all data is immutable and content-addressed, stores operate as caches at every layer. There is no cache invalidation — a hash always maps to the same value. Retention is purely an eviction policy concern.
Store layers form a hierarchy:
Local memory cache → Local disk store → Peer stores → Origin server
Each layer may independently evict entries using LRU or other policies. A cache miss at any layer triggers a fetch from the next layer upstream. This is safe because values are immutable — a re-fetched value is identical to the evicted one.
Root pinning: The bottom layer (origin store with no upstream) is the authoritative store. At this layer, eviction means data loss. Implementations should support root pinning — marking specific collection root hashes as retained. Pinned roots and all nodes reachable from them are exempt from eviction.
Purging a collection: To remove a specific collection, delete its root entry. The orphaned internal nodes will be evicted naturally as the store adds new data and applies its eviction policy. Nodes shared with other collections are not harmed — they remain reachable from their other roots and will not be evicted while referenced.
For immediate purging (e.g., removing sensitive data), a store may walk the collection’s tree and delete all reachable nodes. Shared nodes will be re-fetched on demand when other collections access them. This is the “nuclear option” — correct but potentially costly for other structures that shared subtrees.
Dacite defines two serialization formats:
Both formats represent the same logical data. The binary format is authoritative for hashing and storage; the JSON format is a convenience layer.
The binary format is designed for:
Every serialized node begins with a 1-byte kind tag:
| Tag | Kind | Description |
|---|---|---|
0x00 |
Scalar | Raw bytes (atomic value) |
0x01 |
Seq node | Finger tree internal node |
0x02 |
Map node | HAMT internal node |
0x03 |
Collection | Top-level typed collection header |
A scalar is the simplest node: a kind tag, a u8 length, and raw bytes.
scalar = 0x00 ++ u8(len) ++ bytes[len]
Maximum scalar size: 255 bytes. The scalar’s hash is fuse_bytes(bytes) — computed over the raw data bytes only, not the framing.
Size examples:
| Scalar | Serialized size |
|---|---|
| Null (0 bytes) | 2 bytes (tag + u8 zero) |
| Boolean | 3 bytes |
| i64 | 10 bytes |
| i256 | 34 bytes |
| UTF-8 char | 3–6 bytes |
Seq nodes represent finger tree internals. Each node carries its structural type, measure, and child references:
seq_node = 0x01
++ u8(node_type) // structural subtype
++ measure // cached measure (48 bytes)
++ u8(n_children) // number of child references (0–32)
++ hash[n_children] // child hashes (32 bytes each)
Seq node subtypes:
| Subtype | Value | Children |
|---|---|---|
| Empty | 0x00 |
0 |
| Single | 0x01 |
1 (the element hash) |
| Digit | 0x02 |
1–32 |
| Internal node | 0x03 |
2–32 |
| Deep | 0x04 |
3 (left, spine, right) |
Size range: 50 bytes (empty) to 1,074 bytes (32 children).
Map nodes represent HAMT internals:
map_node = 0x02
++ u8(node_type) // structural subtype
++ measure // cached measure (48 bytes)
++ <type-specific fields>
Map node subtypes:
| Subtype | Value | Fields after measure |
|---|---|---|
| Empty | 0x00 |
(none) |
| Entry | 0x01 |
hash(key_hash) ++ hash(key_ref) ++ hash(val_ref) |
| Bitmap | 0x02 |
u32(bitmap) ++ u8(n_children) ++ hash[n_children] |
For entries, key_hash is the routing hash (used for HAMT navigation), while key_ref and val_ref point to the stored key and value nodes.
Size range: 50 bytes (empty) to 1,079 bytes (32-child bitmap).
Collections are top-level typed values whose structure is stored as a tree of seq or map nodes. A collection header records the collection type, a root hash pointing into the tree, and the total size in bytes:
collection = 0x03
++ u8(collection_type) // which collection
++ hash(root) // root of the internal tree
++ u64(count) // number of elements
++ u64(size_bytes) // total materialized size in bytes
Collection subtypes:
| Subtype | Value | Root points to | Description |
|---|---|---|---|
| Vector | 0x00 |
Finger tree root | Ordered collection of arbitrary values |
| String | 0x01 |
Finger tree root | Seq of char scalars |
| Blob | 0x02 |
Finger tree root | Seq of byte scalars |
| Map | 0x03 |
HAMT root | Key-value collection |
The root hash references an existing seq node (for vector/string/blob) or map node (for map) already in the store. The collection header does not duplicate any element references — it is a thin typed wrapper over the tree root.
The count field caches the number of elements in the collection. The size_bytes field caches the total byte size of all leaf scalars. Both are available in O(1) from the root’s measure, and are stored here so the collection header is self-contained — no tree traversal needed for basic queries.
Fixed size: 1 + 1 + 32 + 8 + 8 = 50 bytes for every collection, regardless of element count.
The collection’s hash is computed as:
collection_hash = fuse(fuse_str(type_name ++ 0x00), root.measure.elements_fuse)
This is the standard typed value hash — the collection type name fused with the semantic hash of the underlying tree.
Measures appear inline in seq and map nodes:
measure = u64(count) ++ u64(size_bytes) ++ hash(elements_fuse)
Fixed size: 8 + 8 + 32 = 48 bytes.
Scalar typed values (e.g., ["i64" 42]) are serialized as kind 0x00 (scalar) — the type information lives in the store mapping (hash → typed value), not in the serialized bytes.
Collection typed values (string, vector, blob, map) are serialized as kind 0x03 (collection) — a thin header with the collection subtype, root hash, and cached size.
In both cases, the type name is part of the hash computation (via the null-separated domain prefix), not the serialized payload. The store maps each hash to its full [type_name, data] entry; serialization encodes only the data portion.
Recovering the type: Although the serialized bytes do not carry the type name, the type is always recoverable from the store. A typed value is a 2-element seq; position 0 is the type name (itself a seq of char scalars stored in the content-addressed space). Given a typed value’s hash, fetch the seq, then fetch position 0 to discover the type. No out-of-band schema is needed.
A node’s hash is not computed from its serialized bytes. Hashes are computed from the logical content:
fuse_bytes(data_bytes)fuse(fuse_str(node_type_name ++ 0x00), elements_fuse) (with the HAMT bitmap exception)fuse_bytes(type_name_bytes ++ 0x00 ++ encode(data))This means two different serialization formats (or versions) can interoperate as long as they compute the same logical hashes. The serialization is a transport/storage concern; the hash is a semantic concern.
For deterministic serialization:
The JSON format provides a human-readable representation of dacite values. It is intended for:
"kind" field: "scalar", "seq", or "map"Structural JSON mirrors the content-addressed store. Each node is a self-contained object with hash references to children:
Scalar:
{
"kind": "scalar",
"hash": "a1b2c3d4...",
"bytes": "base64-encoded-data",
"size": 8
}
Seq node:
{
"kind": "seq",
"subtype": "deep",
"hash": "e5f6a7b8...",
"measure": {
"count": 1000,
"size_bytes": 8000,
"elements_fuse": "1234abcd..."
},
"children": ["hash1...", "hash2...", "hash3..."]
}
Map node:
{
"kind": "map",
"subtype": "bitmap",
"hash": "c9d0e1f2...",
"measure": {
"count": 5,
"size_bytes": 40,
"elements_fuse": "5678efab..."
},
"bitmap": 671088640,
"children": ["hash1...", "hash2..."]
}
Map entry:
{
"kind": "map",
"subtype": "entry",
"hash": "f3a4b5c6...",
"measure": { "count": 1, "size_bytes": 16, "elements_fuse": "..." },
"key_hash": "routing-hash...",
"key_ref": "stored-key-hash...",
"val_ref": "stored-val-hash..."
}
Materialized JSON inlines values recursively, producing familiar nested structures. This is the format web clients and JSON consumers will typically use:
Typed scalar:
{
"kind": "typed",
"type": "i64",
"value": 42
}
String (typed seq of char scalars):
{
"kind": "typed",
"type": "string",
"value": "hello world"
}
Vector:
{
"kind": "typed",
"type": "vector",
"value": [
{"kind": "typed", "type": "i64", "value": 1},
{"kind": "typed", "type": "string", "value": "two"},
{"kind": "typed", "type": "bool", "value": true}
]
}
Map:
{
"kind": "typed",
"type": "map",
"value": {
"name": {"kind": "typed", "type": "string", "value": "Alice"},
"age": {"kind": "typed", "type": "u8", "value": 30}
}
}
Note: In materialized map JSON, keys are rendered as strings for JSON compatibility. The actual dacite key is a typed value; the string rendering is lossy but practical.
For large trees, a hybrid approach inlines small subtrees and uses hash references for large ones. The inline_under parameter from the Distribution Model controls the threshold:
{
"kind": "seq",
"subtype": "deep",
"hash": "e5f6a7b8...",
"measure": {"count": 1000000, "size_bytes": 8000000, "elements_fuse": "..."},
"children": [
{"kind": "typed", "type": "string", "value": "small inline value"},
{"kind": "ref", "hash": "large-subtree-hash...", "measure": {"count": 500000, "...": "..."}}
]
}
The "ref" kind signals an un-fetched subtree. Clients can request materialization by following the hash.
Formal JSON Schema definitions are provided alongside this spec:
structural.schema.json — validates structural mode nodes (scalars, seq nodes, map nodes with hash references)materialized.schema.json — validates materialized mode values (typed values with inlined data, including hybrid ref nodes)Notes:
hash field is optional in materialized mode (clients may not need it)ref kind in materialized schema supports hybrid mode (partially materialized trees)typed_custom with a freeform type stringA value round-tripped through JSON and binary must produce the same hash. The JSON format is purely a presentation concern; the authoritative hash is always computed from the logical content per the binary format rules.
Implementations should provide:
to_json(hash, store, mode) → JSON stringfrom_json(json_string) → store entries + root hashneg sentinel for cofinite setselements_fuseseq(type_name, data)| Language | Status | Location |
|---|---|---|
| Clojure | Reference impl | impl/clojure/ |
| Node.js | Planned | impl/node/ |
| C++ | Planned | impl/cpp/ |