Back to Blog

Compressing Neural Network Weights: 40% Smaller Safetensors

We achieved 40% compression on BF16 safetensor model weights - cutting egress costs for terabyte-scale models by thousands of dollars per month.

The Problem: Model Weights Are Expensive to Move

Large language models ship as massive safetensor files. A single frontier model can weigh in at 1 TB or more once you count all the shards. Every time you deploy to a new region, spin up an inference node, or distribute weights to edge servers, you're paying egress.

Here's what moving a 1 TB model looks like on major cloud providers:

Provider Egress Rate Cost per Transfer 10 Transfers/Month
AWS S3 $0.09/GB $90 $900
Google Cloud $0.12/GB $120 $1,200
Azure $0.087/GB $87 $870
Cloudflare R2 Free egress $0 $0

For teams deploying frequently — retraining cycles, multi-region inference, CI/CD pipelines pushing updated weights — egress alone can run $10,000–$15,000/year for a single model on AWS or GCP. And that's before storage costs.

Compress those weights by 40% and you're transferring 600 GB instead of 1 TB. That's $36 saved per transfer on AWS, or $4,320/year at 10 transfers per month.

BF16 Weights Below 60%: Achieved

BF16 (bfloat16) is the dominant format for modern model weights. There is a 1.06% relative error with our algorithm that is introduced , which is a +0.077% increase in perplexity ( which essentially has no impact ) .

NVIDIA calls FP8 "lossless" at +0.1-0.5% PPL. We're below that.

The 1.06% relative error is within noise for most inference workloads — we verified this across several open-weight models with negligible perplexity impact of +0.077% .

What This Means for Egress

Putting it together for a 1 TB model on AWS S3:

Scenario Transfer Size Cost per Transfer Annual (10/mo)
Uncompressed 1,000 GB $90.00 $10,800
Our Compression 470 GB $42.30 $5,076

That's a $5,724/year saving on egress alone for a single model with the our approach — and the weights decompress to full BF16 on load with minimal quality impact.

Try It

Safetensor compression is available now through our API. Upload your .safetensors file and get back a compressed archive that decompresses to the original format. Works with any BF16 model — Llama, Mistral, Qwen, or your own fine-tunes.

For models over 100 GB, contact us for bulk pricing and dedicated transfer infrastructure.