Proving the Firehose: I Audited 2.9 Million Bluesky Messages and Found What the Relays Don't Show You

February 10, 2026

On February 9th, 2025, during Super Bowl LX, I ran an experiment. While my Ryzen 5 3600 hummed away running the Truth Engine, I wasn't watching the game. I was watching YouTube, listening to music, and occasionally glancing at terminal output as my system processed 1.5 million AT Protocol events in real-time.

The system was connected to one major relay and about 150 hand-picked PDS endpoints. It was deliberately constrained - a stress test to see if the architecture would hold under high traffic.

In one hour, it verified 3 million events.

And it caught 151 discrepancies.

These were events my system saw directly from PDS endpoints that never appeared in the relay firehose - or at least, not within the 3-second window I was measuring. Posts, likes, blocks - all kinds of social activity that simply didn't make it through.

I was surprised. Not because I thought the system was perfect, but because I didn't expect to find anything concrete so quickly.

What I Thought I Knew

Before building the Truth Engine, I assumed the AT Protocol firehose was basically a dumb pipe - it aggregated events from PDS nodes and broadcast them. Simple aggregation, no editorial decisions.

Turns out I was wrong.

The firehose does a lot more than aggregate. It filters. It optimizes. It makes decisions about what to broadcast and when. Sometimes it "squashes" events - if a user blocks and immediately unblocks someone, those events might get collapsed or dropped entirely. I discovered this later through additional testing with a 5-minute observation window, where I found scattered likes, posts, and especially blocks being filtered out.

There's also entire categories of events that never make it through official relays. Take podping events - they're part of the AT Protocol network, but if you're listening to the official Bluesky relay, you'll never see them. They explicitly filter them out. Their documentation even mentions this.

Is this a bug? A feature? I genuinely don't know. My best guess is that it's a design decision to support their AppView infrastructure - if events that rapidly cancel each other out create backpressure in their ingest pipeline, squashing them makes sense for stability.

Should anyone care? Probably not, unless you're a weirdo like me who wants to see everything happening on the network.

But it's certainly interesting.

The 206ms Advantage

Here's where it gets technical: by connecting directly to PDS endpoints instead of going through a relay, the Truth Engine sees events 150-250ms faster during high-traffic periods.

I have terminal screenshots showing my system "winning" races against the official relay during the Super Bowl audit. When the relay's infrastructure is getting hammered during black swan events, that latency gap widens.

Does any developer actually care about 206 milliseconds? Probably not. Unless you're building real-time audit systems or need to guarantee you're seeing events before they potentially get filtered.

The more important point is this: a single consumer-grade machine can outpace enterprise relay infrastructure by going direct to source.

The Anti-Database Approach

I've heard repeatedly to "just use Postgres" for identity lookups. I said no.

The AT Protocol has a problem: every event needs its DID (decentralized identifier) resolved to its public key to verify authenticity. Hitting a database for every lookup would be a bottleneck. Pinging the PLC directory in real-time? Impossible at scale.

So I built a 14.7GB memory-mapped file containing every identity in the network.

It just... worked. From the start. Cold lookups take about 250 nanoseconds. After the OS pages the frequently-accessed identities into RAM, average lookup time drops to 69 nanoseconds. At that speed, a single CPU core can perform over 14 million identity checks per second.

The OS handles the paging automatically, loading only what's needed. And because social graphs have high-volume handles that tend to be active repeatedly, the whole identity resolution layer is ungodly fast.

Clustered Vertical Logging

The raw DAG-CBOR frames from the AT Protocol are compressed using zstd with a custom-trained 1MB dictionary. The goal was to create an archival system that could reduce egress costs for people running relays or archives.

The clustering strategy is simple: instead of compressing events in the order they arrive, I group them by DID (user). This deduplication strategy lets the compression dictionary see repeated byte patterns across a user's activity, achieving 56% compression on real-world data - and up to 86% on high-volume periods.

With synthetic worst-case data (40% uncompressible cryptographic noise), it still hits 53%. At the Shannon entropy floor - the theoretical compression limit.

The more events you cluster together, the better the compression.

2,700 Simultaneous WebSocket Connections

Most people would say running thousands of persistent WebSocket connections on a single machine is insane.

I just... did it.

The system currently maintains connections to every functional PDS in the AT Protocol network that isn't privately piped or dead - about 2,700 endpoints. Getting past the default 1,024 file descriptor limit required some system tuning, but keeping the connections alive and healthy was surprisingly straightforward.

Nobody ever told me it was impossible, so I didn't know I shouldn't try.

The result: a sovereign mesh aggregator that bypasses centralized relays entirely. You become your own relay, pulling directly from the source.

The Docker Cage Test

Want to know if this actually works on modest hardware?

I caged the verification logic in a Docker container with a 1-core CPU limit and 2GB RAM cap. That's a 7.3x memory overcommit against the 14.7GB identity cache, relying entirely on the OS to page what's needed.

It verified every event from the relay in real-time without lag or backpressure buildup.

My actual hardware is nothing special: Ryzen 5 3600, 32GB DDR4-3200 RAM, built for $500-600. I have a Sapphire Nitro+ RX 6700 XT, but it sits idle - everything happens on the CPU.

Running at full blast with all 2,700 connections active, the system uses maybe 30% of available CPU. Put a beefier processor behind it and it would laugh at black swan event traffic.

The Human Cost

I need to be honest here: I wrote a whole postmortem about burnout while building this.

Not because the work was technically overwhelming - honestly, most of it felt too easy. The mmap file worked on the first try. The compression strategy hit target numbers immediately. The WebSocket mesh scaled without drama.

That was the problem.

I kept pushing, thinking "this should be harder," waiting for the moment where I'd hit something truly impossible. The burnout wasn't from difficulty - it was from constantly judging myself, feeling like if something didn't break me, it wasn't impressive enough.

Hindsight: I was just working too much and beating myself up for no reason.

Why did I build this for free? Honestly, I wanted to prove I could do something hard. And maybe - I'm being real here - I was desperate to build something that might change my trajectory in life.

Turns out, I did this more for me than anyone else. I thought I needed external validation. I just needed to quit being so harsh on myself.

What "Truth" Actually Means

The Truth Engine's tagline is "Proving the truth of the global firehose."

That's not marketing. It's literal.

Truth means: the firehose is broadcasting what it's supposed to broadcast. And as it turns out, if you want to see everything happening on the AT Protocol network, you won't get it from the major relays.

Every archived segment contains a Blake3 Merkle root. This transforms a simple database into a verifiable ledger - clients can verify the archive's integrity without trusting the archivist.

This matters in a decentralized protocol. If we're building systems where users own their data and identity, we need ways to verify that infrastructure isn't silently filtering or dropping events.

The Truth Engine is the cheapest, most efficient way to broadcast the global truth of the entire active network.

What's Next

The code is open source under MIT license: github.com/ybzeek/sovereign-truth-engine

Running it yourself requires building the 14.7GB identity cache (20+ hour process) and discovering active PDS endpoints (also time-consuming). But all the tools are there. If someone wants to do it, they'll do it. It's not hard.

To be perfectly honest, I don't care what you do with this. Fork it, break it, contribute to it, fund it, ignore it - whatever. In the long run, I built this for myself more than anyone else.

But if this project disappeared tomorrow, the AT Protocol ecosystem would lose something specific: proof that consumer hardware can verify the entire network independently.

No enterprise infrastructure required. No trust in centralized relays. Just a $500 PC and the willingness to connect directly to source.

Support & Contribute

The project is live on GitHub: sovereign-truth-engine

If you believe decentralized protocols need independent verification:

⭐ Star the repo
💰 Github Sponsors Coming soon (verification pending)
🤝 Contribute code or documentation
📢 Share with infrastructure developers

The code is MIT licensed. The truth should be free.

Built by on a Ryzen 5 3600 while listening to YouTube.

Crushing the Firehose: High-Performance ATProto Compression with Zstd Dictionaries in Rust

#atdev

#atproto

#rust

#systemsengineering

#constraintengineering

Sovereign Logs

Sovereign Tech Logs