Every blockchain company has the same problem: data grows forever, nobody deletes anything, and you still need to query it.
Ethereum alone has produced 2.5 billion transactions. Stored as standard JSON exports, that's roughly 4 terabytes. Add Solana's 200 billion transactions and the other top 20 chains, and you're looking at 192 TB of archival transaction data — growing every 12 seconds.
Most teams deal with this by throwing money at storage. Or they prune history and lose the ability to answer questions about it.
We built something different.
The Result
Tested on 428 MB of real Ethereum mainnet data (275,568 transactions across 1,000 consecutive blocks, downloaded live from a public node):
| Size | % of Original | |
|---|---|---|
| Raw transaction export (JSONL) | 428 MB | 100% |
| Best general-purpose compressor (xz -9) | 57 MB | 13.3% |
| Our system | 21 MB | 4.9% |
95.1% compression. 63% smaller than xz -9. Fully queryable without decompressing.
This is not a theoretical projection. This is a real file on disk, containing real Ethereum transactions from blocks 24,708,171 through 24,709,170, queryable right now.
What "Queryable" Means
The compressed archive is not a .gz file you need to decompress before using. It's a structured store that supports:
- Block lookup: "Give me all transactions in block 24,709,000" — returns 271 transactions, instantly
- Address search: "Does this address appear anywhere in the dataset?" — answered in sub-millisecond, without touching the transaction data
- Transaction counting: "How many transactions did the USDT contract receive?" — scans the data, returns 2,862 in a few milliseconds
You can integrate this into any application that speaks SQL — which is every programming language on earth.
At Ethereum Scale
As datasets grow, the compression ratio improves.
| Dataset Size | Transactions | Raw | Compressed | Savings |
|---|---|---|---|---|
| 100 blocks (20 min) | 28,362 | 40 MB | 2.7 MB | 93.2% |
| 1,000 blocks (3.3 hrs) | 275,568 | 428 MB | 21 MB | 95.1% |
| 1 day (projected) | 2M | 2.9 GB | 161 MB | 94.3% |
| 1 year (projected) | 734M | 1.0 TB | 53 GB | 94.8% |
| Full chain (projected) | 2.5B | 3.9 TB | 192 GB | 94.7% |
The 100-block and 1,000-block rows are measured. The rest are projected using the measured per-transaction cost of 76.6 compressed bytes.
Across All Major Chains
The same approach works on any blockchain that produces structured transaction data. Here's what it looks like across the top 20 chains:
| Chain | Historical Txs | Raw Archive | Compressed | Savings |
|---|---|---|---|---|
| Ethereum | 2.5B | 3.9 TB | 192 GB | 95.1% |
| Solana | 200B | 160 TB | 15.3 TB | 90.5% |
| TRON | 8B | 8.0 TB | 666 GB | 91.7% |
| BNB Chain | 5B | 6.0 TB | 426 GB | 92.9% |
| Polygon | 4B | 4.4 TB | 339 GB | 92.3% |
| Arbitrum | 1.5B | 2.1 TB | 123 GB | 94.1% |
| Bitcoin | 1B | 1.8 TB | 89 GB | 95.1% |
| Base | 1B | 1.2 TB | 85 GB | 92.9% |
| Optimism | 0.8B | 1.0 TB | 67 GB | 93.5% |
| Other 11 chains | — | 5.8 TB | 468 GB | ~92% |
| Total | 192 TB | 17.5 TB | 90.9% |
A company indexing all 20 chains goes from 192 TB to 17.5 TB — while keeping every transaction queryable.
Who This Is For
Blockchain Analytics Companies
Nansen, Dune, Chainalysis, and similar companies index dozens of chains and store years of historical data for customer queries. A 5-chain analytics platform archiving Ethereum, BNB, Polygon, Arbitrum, and Base would reduce storage from 17.6 TB to 1.7 TB — saving roughly $4,000/year on S3 alone. At 20 chains with 3 years of growth factored in, that's $72,000/year.
Node Operators and RPC Providers
Running an Ethereum archive node requires storing the full transaction history. This system could serve as a compressed archival tier behind the live node — keeping historical data queryable at a fraction of the storage cost. A single ETH node saves ~$1,000/year; a multi-chain RPC provider saves considerably more.
Exchanges and Custodians
Regulatory requirements mandate keeping complete transaction records for 5-7 years. This system keeps those records compliant (queryable, auditable, lossless) while cutting storage costs by 90-95%.
L2 and Rollup Teams
Every Layer 2 needs to store its own transaction history plus references to L1. The EVM-compatible L2s (Arbitrum, Optimism, Base, zkSync, Scroll) show 92-95% compression — the highest ratios among the chains we tested, because their transaction formats are closest to Ethereum's.
Compared to What Exists
| Approach | Compression | Queryable | Random Access |
|---|---|---|---|
| Raw JSONL | 0% | Scan only | No |
| gzip / zstd (generic) | 70-75% | No | No |
| xz -9 (best generic) | 87% | No | No |
| Parquet + zstd | 70-80% | Via Spark/DuckDB | Column-level |
| This system | 95.1% | Yes (SQL compatible) | Block-level |
Generic compressors don't allow you to query the data while it's compressed, and we compress 63% smaller than xz -9.
The Numbers Are Real
Everything reported here was measured on actual Ethereum mainnet data downloaded from a public RPC endpoint during live operation. No synthetic data, no cherry-picked blocks, no theoretical estimates presented as measurements.
The 1,000-block test dataset is available for independent verification.
Built with Smallest.zip — Lossless. Queryable. 95% smaller.