Back to Blog

95% Smaller, Still Queryable: A New Way to Store Blockchain Data

We compressed 428 MB of real Ethereum mainnet transactions to 21 MB — 95.1% smaller, 63% better than xz -9. The data remains queryable without decompressing. Here's how it works across 20 chains.

Every blockchain company has the same problem: data grows forever, nobody deletes anything, and you still need to query it.

Ethereum alone has produced 2.5 billion transactions. Stored as standard JSON exports, that's roughly 4 terabytes. Add Solana's 200 billion transactions and the other top 20 chains, and you're looking at 192 TB of archival transaction data — growing every 12 seconds.

Most teams deal with this by throwing money at storage. Or they prune history and lose the ability to answer questions about it.

We built something different.


The Result

Tested on 428 MB of real Ethereum mainnet data (275,568 transactions across 1,000 consecutive blocks, downloaded live from a public node):

Size % of Original
Raw transaction export (JSONL) 428 MB 100%
Best general-purpose compressor (xz -9) 57 MB 13.3%
Our system 21 MB 4.9%

95.1% compression. 63% smaller than xz -9. Fully queryable without decompressing.

This is not a theoretical projection. This is a real file on disk, containing real Ethereum transactions from blocks 24,708,171 through 24,709,170, queryable right now.


What "Queryable" Means

The compressed archive is not a .gz file you need to decompress before using. It's a structured store that supports:

  • Block lookup: "Give me all transactions in block 24,709,000" — returns 271 transactions, instantly
  • Address search: "Does this address appear anywhere in the dataset?" — answered in sub-millisecond, without touching the transaction data
  • Transaction counting: "How many transactions did the USDT contract receive?" — scans the data, returns 2,862 in a few milliseconds

You can integrate this into any application that speaks SQL — which is every programming language on earth.


At Ethereum Scale

As datasets grow, the compression ratio improves.

Dataset Size Transactions Raw Compressed Savings
100 blocks (20 min) 28,362 40 MB 2.7 MB 93.2%
1,000 blocks (3.3 hrs) 275,568 428 MB 21 MB 95.1%
1 day (projected) 2M 2.9 GB 161 MB 94.3%
1 year (projected) 734M 1.0 TB 53 GB 94.8%
Full chain (projected) 2.5B 3.9 TB 192 GB 94.7%

The 100-block and 1,000-block rows are measured. The rest are projected using the measured per-transaction cost of 76.6 compressed bytes.


Across All Major Chains

The same approach works on any blockchain that produces structured transaction data. Here's what it looks like across the top 20 chains:

Chain Historical Txs Raw Archive Compressed Savings
Ethereum 2.5B 3.9 TB 192 GB 95.1%
Solana 200B 160 TB 15.3 TB 90.5%
TRON 8B 8.0 TB 666 GB 91.7%
BNB Chain 5B 6.0 TB 426 GB 92.9%
Polygon 4B 4.4 TB 339 GB 92.3%
Arbitrum 1.5B 2.1 TB 123 GB 94.1%
Bitcoin 1B 1.8 TB 89 GB 95.1%
Base 1B 1.2 TB 85 GB 92.9%
Optimism 0.8B 1.0 TB 67 GB 93.5%
Other 11 chains 5.8 TB 468 GB ~92%
Total 192 TB 17.5 TB 90.9%

A company indexing all 20 chains goes from 192 TB to 17.5 TB — while keeping every transaction queryable.


Who This Is For

Blockchain Analytics Companies

Nansen, Dune, Chainalysis, and similar companies index dozens of chains and store years of historical data for customer queries. A 5-chain analytics platform archiving Ethereum, BNB, Polygon, Arbitrum, and Base would reduce storage from 17.6 TB to 1.7 TB — saving roughly $4,000/year on S3 alone. At 20 chains with 3 years of growth factored in, that's $72,000/year.

Node Operators and RPC Providers

Running an Ethereum archive node requires storing the full transaction history. This system could serve as a compressed archival tier behind the live node — keeping historical data queryable at a fraction of the storage cost. A single ETH node saves ~$1,000/year; a multi-chain RPC provider saves considerably more.

Exchanges and Custodians

Regulatory requirements mandate keeping complete transaction records for 5-7 years. This system keeps those records compliant (queryable, auditable, lossless) while cutting storage costs by 90-95%.

L2 and Rollup Teams

Every Layer 2 needs to store its own transaction history plus references to L1. The EVM-compatible L2s (Arbitrum, Optimism, Base, zkSync, Scroll) show 92-95% compression — the highest ratios among the chains we tested, because their transaction formats are closest to Ethereum's.


Compared to What Exists

Approach Compression Queryable Random Access
Raw JSONL 0% Scan only No
gzip / zstd (generic) 70-75% No No
xz -9 (best generic) 87% No No
Parquet + zstd 70-80% Via Spark/DuckDB Column-level
This system 95.1% Yes (SQL compatible) Block-level

Generic compressors don't allow you to query the data while it's compressed, and we compress 63% smaller than xz -9.


The Numbers Are Real

Everything reported here was measured on actual Ethereum mainnet data downloaded from a public RPC endpoint during live operation. No synthetic data, no cherry-picked blocks, no theoretical estimates presented as measurements.

The 1,000-block test dataset is available for independent verification.


Built with Smallest.zip — Lossless. Queryable. 95% smaller.