Why Compress FASTQ Before Archiving to S3?
Raw FASTQ files from whole-genome sequencing are enormous — typically 100–300 GB per sample. If you're archiving thousands of samples to S3 or Glacier, compression directly reduces your monthly bill.
The question isn't whether to compress — it's which compressor to use.
Comparing FASTQ Compressors
Here's how the major options stack up on a typical whole-genome FASTQ file:
| Tool | Compressed Size | Compression Speed | Decompression | Lossless |
|---|---|---|---|---|
| gzip | ~25% | Fast | Fast | Yes |
| zstd | ~22% | Very fast | Very fast | Yes |
| Genozip | ~7% | Moderate | Fast | Yes |
| PetaGene | ~5.3% | Moderate | Fast | Yes |
| 4BIN | 4.5% | Moderate | Fast | Yes |
All are lossless. The difference is purely in how small they get.
The Cost Impact
For a 10,000-sample archive (average 200 GB raw per sample = 2 PB total):
| Tool | Archive Size | S3 Standard (\(/yr) | S3 Glacier (\)/yr) | |
|---|---|---|---|
| gzip | 500 TB | $138,000 | $24,000 |
| Genozip | 140 TB | $38,640 | $6,720 |
| PetaGene | 106 TB | $29,256 | $5,088 |
| 4BIN | 90 TB | $24,840 | $4,320 |
Even on Glacier's cheap rates, the difference between gzip and 4BIN is $19,680/year for a 10K-sample archive.
How to Compress with 4BIN
Via API
curl -X POST https://smallest.zip/api/compress \
-H "x-api-key: YOUR_API_KEY" \
-F "file=@sample.fastq.gz" \
-o sample.4bin
The API accepts both raw FASTQ and gzip-compressed FASTQ. It returns a .4bin compressed file.
Decompression
curl -X POST https://smallest.zip/api/decompress \
-H "x-api-key: YOUR_API_KEY" \
-F "file=@sample.4bin" \
-o sample.fastq
The decompressed output is bit-identical to the original.
S3 Archival Workflow
A typical archival pipeline:
- Receive FASTQ from sequencer
- Run QC (FastQC, MultiQC)
- Compress with 4BIN via API
- Upload to S3 (Standard for recent data, Glacier for archives)
- Delete raw FASTQ from local storage
When you need the data again, download from S3 and decompress via the API. The round-trip is lossless.
Tips for S3 Storage Classes
- S3 Standard: For data accessed regularly (active projects). $0.023/GB/month.
- S3 Infrequent Access: For data accessed a few times per year. $0.0125/GB/month.
- S3 Glacier Deep Archive: For long-term retention. $0.00099/GB/month — but retrieval takes 12+ hours.
With 4BIN compression, even S3 Standard becomes affordable for large archives. A 2 PB archive at 4.5% = 90 TB = $2,070/month on Standard.
Get Started
Create a free account to get API access, or check the full API documentation.