FASTQ vs BAM: The Storage Trade-Off
Every sequencing pipeline produces both FASTQ (raw reads) and BAM (aligned reads). Most organizations keep both — FASTQ for reprocessing and BAM for downstream analysis. But which one costs more to store, and how much can you save by compressing each?
File Size Comparison
For a typical 30x whole-genome sample:
| Format | Typical Size | What It Contains |
|---|---|---|
| FASTQ (paired) | 200 GB | Raw reads + quality scores |
| BAM (aligned) | 120 GB | Aligned reads + CIGAR + metadata |
| CRAM 3.1 | ~50 GB | Reference-based BAM compression |
FASTQ is larger because it stores redundant quality score data. BAM is smaller but contains alignment information. Many labs store both, totaling 320 GB per sample.
Compression Ratios
Here's how our compressor handles each format:
| Format | Raw | gzip | Our Compression | Ratio |
|---|---|---|---|---|
| FASTQ | 200 GB | 50 GB | 9 GB | 4.5% |
| BAM | 120 GB | — | 17 GB | 14.3% |
| Total | 320 GB | 170 GB | 26 GB | 8.1% |
From 320 GB down to 26 GB per sample — a 92% reduction.
Cost at Scale
For 10,000 whole-genome samples on S3:
| Storage Strategy | Total Size | Annual S3 Cost |
|---|---|---|
| FASTQ + BAM (uncompressed) | 3.2 PB | $883,200 |
| FASTQ (gzip) + BAM | 1.7 PB | $469,200 |
| FASTQ (4BIN) + BAM (compressed) | 260 TB | $71,760 |
That's $811,440/year saved by compressing both formats vs storing uncompressed.
Should You Keep Both?
Keep FASTQ if:
- You may need to re-align with a newer reference genome
- Your pipeline requires raw reads for custom processing
- Regulatory requirements mandate raw data retention
Keep only BAM if:
- Your alignment is final and won't be re-run
- You need fast access to aligned reads for variant calling
- Storage is your primary bottleneck
Our recommendation: Keep both, compressed. At 26 GB per sample (compressed FASTQ + BAM), you can store 10,000 genomes for $71,760/year on S3 — less than what most labs pay for uncompressed BAM alone.
Getting Started
Compress both FASTQ and BAM files through our API. Both formats are fully lossless — the decompressed output is bit-identical to the original.
Sign up for free or contact us for enterprise volumes.