Back to Blog

How to Compress FASTQ Files for S3 Archival

Step-by-step guide to compressing FASTQ files for long-term S3 storage. Compare gzip, Genozip, PetaGene, and 4BIN compression ratios and costs.

Why Compress FASTQ Before Archiving to S3?

Raw FASTQ files from whole-genome sequencing are enormous — typically 100–300 GB per sample. If you're archiving thousands of samples to S3 or Glacier, compression directly reduces your monthly bill.

The question isn't whether to compress — it's which compressor to use.

Comparing FASTQ Compressors

Here's how the major options stack up on a typical whole-genome FASTQ file:

Tool Compressed Size Compression Speed Decompression Lossless
gzip ~25% Fast Fast Yes
zstd ~22% Very fast Very fast Yes
Genozip ~7% Moderate Fast Yes
PetaGene ~5.3% Moderate Fast Yes
4BIN 4.5% Moderate Fast Yes

All are lossless. The difference is purely in how small they get.

The Cost Impact

For a 10,000-sample archive (average 200 GB raw per sample = 2 PB total):

Tool Archive Size S3 Standard (\(/yr) | S3 Glacier (\)/yr)
gzip 500 TB $138,000 $24,000
Genozip 140 TB $38,640 $6,720
PetaGene 106 TB $29,256 $5,088
4BIN 90 TB $24,840 $4,320

Even on Glacier's cheap rates, the difference between gzip and 4BIN is $19,680/year for a 10K-sample archive.

How to Compress with 4BIN

Via API

curl -X POST https://smallest.zip/api/compress \
  -H "x-api-key: YOUR_API_KEY" \
  -F "file=@sample.fastq.gz" \
  -o sample.4bin

The API accepts both raw FASTQ and gzip-compressed FASTQ. It returns a .4bin compressed file.

Decompression

curl -X POST https://smallest.zip/api/decompress \
  -H "x-api-key: YOUR_API_KEY" \
  -F "file=@sample.4bin" \
  -o sample.fastq

The decompressed output is bit-identical to the original.

S3 Archival Workflow

A typical archival pipeline:

  1. Receive FASTQ from sequencer
  2. Run QC (FastQC, MultiQC)
  3. Compress with 4BIN via API
  4. Upload to S3 (Standard for recent data, Glacier for archives)
  5. Delete raw FASTQ from local storage

When you need the data again, download from S3 and decompress via the API. The round-trip is lossless.

Tips for S3 Storage Classes

  • S3 Standard: For data accessed regularly (active projects). $0.023/GB/month.
  • S3 Infrequent Access: For data accessed a few times per year. $0.0125/GB/month.
  • S3 Glacier Deep Archive: For long-term retention. $0.00099/GB/month — but retrieval takes 12+ hours.

With 4BIN compression, even S3 Standard becomes affordable for large archives. A 2 PB archive at 4.5% = 90 TB = $2,070/month on Standard.

Get Started

Create a free account to get API access, or check the full API documentation.