If you’ve ever tapped into the atproto firehose (subscribeRepos), you know it’s less of a stream and more of a digital tsunami. Every "Like," "Post," and "Follow" on the network is broadcast as a pair of concatenated DAG-CBOR objects.

The challenge? While DAG-CBOR is more compact than JSON, it is still self-describing. This means every single 200-byte "Like" event carries the weight of its own schema: strings like app.bsky.feed.like, did:plc:, and rev are repeated millions of times a day.

In this post, we’re going to look at how to use Zstandard (Zstd) Dictionaries in Rust to achieve up to ~80% compression ratios, effectively turning a 40Mbps firehose into a 4Mbps trickle.


The "Small Data" Problem

Traditional compression (like Gzip or standard Zstd) relies on finding patterns within a single block of data. If you’re compressing a 50MB log file, the algorithm has plenty of time to learn that the word commit appears frequently.

However, firehose frames are tiny. By the time a standard compressor realizes app.bsky.feed.post is a repeating pattern, the frame is already over. The compressor starts every frame with "amnesia."

The Solution: Pre-trained DNA

A Zstd Dictionary allows us to provide the compressor with "prior knowledge." We train the dictionary on thousands of real atproto frames so that it already knows the common strings, Lexicon paths, and MST (Merkle Search Tree) structures.

When the compressor sees app.bsky.feed.post, it doesn't store the string; it stores a tiny bit-pointer to the dictionary.


Phase 1: Harvesting Samples

To train a dictionary, you need a representative "corpus." For atproto, this means capturing the raw bytes of the WebSocket frames—specifically the concatenated Header + Body.

Using the zstd crate in Rust, we first need to collect around 10,000 to 20,000 samples (roughly 20MB of raw data).

Rust

use zstd::dict::from_samples;
use std::fs::File;
use std::io::Write;

/// Takes a collection of raw DAG-CBOR frames and bakes a dictionary.
fn bake_atproto_dictionary(samples: Vec<Vec<u8>>) -> Result<(), Box<dyn std::error::Error>> {
    // 110KB is the 'Goldilocks' size for Zstd dictionaries.
    let dict_size = 110 * 1024; 
    
    println!("Analyzing {} samples for common patterns...", samples.len());
    
    // This runs the 'Cover' algorithm to find the most frequent sequences.
    let dict_bytes = from_samples(&samples, dict_size)?;

    let mut file = File::create("atproto_v1.zdict")?;
    file.write_all(&dict_bytes)?;
    
    Ok(())
}

Phase 2: The Zero-Copy Bulk Pipeline

In a high-velocity environment, you cannot afford to reload the dictionary or re-allocate compression contexts for every message. This is where Rust’s type system and the zstd::bulk module shine.

We use an EncoderDictionary to "prepare" the dictionary in memory. This pre-computes the internal tables so that compression is nearly instantaneous.

Rust

use zstd::bulk::{Compressor, EncoderDictionary};
use std::fs;

pub struct FirehoseOptimizer<'a> {
    // We store the 'prepared' dictionary for reuse across threads.
    dict: EncoderDictionary<'a>,
}

impl<'a> FirehoseOptimizer<'a> {
    pub fn new(dict_path: &str) -> Self {
        let raw_dict = fs::read(dict_path).expect("Dict not found");
        // Prepare at compression level 3 (the best speed/ratio balance for firehoses).
        let dict = EncoderDictionary::copy(&raw_dict, 3);
        Self { dict }
    }

    pub fn compress(&self, frame: &[u8]) -> Vec<u8> {
        // 'with_prepared_dictionary' avoids the heavy lifting of context setup.
        let mut compressor = Compressor::with_prepared_dictionary(&self.dict)
            .expect("Failed to init compressor");
        
        compressor.compress(frame).expect("Compression failed")
    }
}

Why this is a Game Changer

When we apply this to the atproto firehose, the numbers are staggering. Because the "schema" of the message is moved into the dictionary, the compressed output contains almost nothing but the actual unique data (the specific CID or the text of a post).

The Math of the Firehose

If Sraw​ is the size of a raw "Like" event (~1.2 KB including all signatures and paths) and Sdict​ is the compressed size using a dictionary:

In some highly optimized cases, "Like" records can drop to as low as 17–20 bytes.


Operational Lessons Learned

  • Dictionary IDs: Zstd embeds a Dict_ID in the header. If you update your dictionary but the consumer is using an old one, decompression will fail with a "Dictionary Mismatch" error. Always version your dictionaries (e.g., atproto_v1.zdict).

  • The "Cold Start" Problem: If you are building a Relay, you should offer the dictionary as a static download. New clients can't decode the stream without it.

  • Rust Performance: Avoid the zstd::stream API for this specific use case. The firehose consists of discrete messages; zstd::bulk is specifically designed for "one-shot" compression of many small blocks, significantly reducing memory thrashing.


Conclusion

By training Zstd on the "DNA" of the AT Protocol, we move the cost of the schema from the network wire to the local CPU memory. For developers building mirrors, indexers, or big-data archives of the Bluesky ecosystem, this isn't just an optimization—it's a great way to scale sustainably.

Ready to try it?

Zstd has levels of compression(defaulting to 3). For long term cold storage anything above 10-12 has diminishing returns.

Dictionary size impacts compression ratios. In my testing 1MB was the sweet spot.

Compression ratios vary due to traffic patterns. YMMV.

Deduplication strategies and clustering is where you'll start to see ~80% compression.

Clustered vertical logging strategies required a custom dag-cbor slicer as well as other optimizations to keep true with heavy traffic or backfilling.

Index files for searchability/decompression become mandatory if you want to be able to search the data without actually decompressing whole archive files.

Still have questions? Contact me on Bluesky: