Seismic Data Compression — Cut Petabyte SEG-Y Storage by 95%

The math at 1 petabyte / year

Active O&G surveys routinely produce 1–10 PB per program. Here's what 1 PB of SEG-Y costs to store for a year, before any compression on smallest.zip:

Option	How it's sold	Typical 1 PB-year cost	Notes
Raw SEG-Y on S3 Standard	$0.023 / GB-mo¹	~$282,000 / yr	No compression, no dedup, no sub-volume access
Raw SEG-Y on S3 Glacier Deep Archive	$0.00099 / GB-mo¹	~$12,150 / yr	Cold-only, 12+ hr restore, no random access, retrieval fees
TerraSpark Compression	Per-survey licence, contact sales²	~$$$$ (six figures typical)	Industry standard for active seismic; closed format, per-seat licensing
ZGY (Schlumberger)	Bundled with Petrel	Not separately priced	Internal format; not generally available as a standalone codec
smallest.zip Seismic Archive	$0.10 / GB processed + $0.005 / GB-mo stored (compressed)	~$103,000 / yr first survey, dropping fast with dedup	Per-GB SaaS or on-prem enterprise; lossless / lossy / wavelet in one codec

1 PB ingest @ $0.10 = $100,000 one-time + 1 PB compressed to ~70 TB at $0.005 = ~$4,300 / yr storage. Second similar survey adds only the recipe — at 100:1 dedup the marginal storage cost on a re-archive run is <$50.
¹AWS S3 us-east-1 list prices, ignoring egress and retrieval. ²TerraSpark public materials do not list per-survey pricing — figure based on customer reports.

Three modes, one codec

Pick the right trade-off per workflow stage. All three share the same on-disk store, so cross-mode dedup is automatic.

L

Lossless

Byte-exact roundtrip on int16 SEG-Y (format 3). For the regulatory archive — what the regulator hands back is bit-identical to what you sent. Delta + zstd + content-addressed dedup. Float-format SEG-Y (formats 1, 5) routes through a high-fidelity wavelet path; ask sales about our format-5 native lossless beta.

Q

Lossy

Typical 20:1 single-file ratio at 55+ dB PSNR. For active processing where the seismic interpreter cares about reflectivity, not the last quantization bit. 6/7/8-bit profiles let you tune the curve.

W

Wavelet

3-D Daubechies-4 DWT with 64×64×N bricks. Random-access sub-volume decode — pull a 64-trace inline strip without unpacking the whole survey. Powers interactive viewers on top of compressed archives.

Cross-survey deduplication — the compound win

SEG-Y archives are full of duplication: regulatory re-submissions, time-lapse acquisitions over unchanged geology, regional overlap zones, dev/test replicas of production data. Our codec hashes each compressed sub-volume; identical content stores once across the whole tenant.

Scenario (5.8 MB int16 F3 slice, lossless mode)	Recipes total	Store DB	System total	Effective ratio
1 survey (first upload)	6 KB	4.0 MB	4.0 MB	0.69 (1.4×)
5 re-uploads of the same survey	31 KB	4.1 MB	4.1 MB	0.14 (7.1× / 86% smaller)

Verified in our codec audit: each duplicate after the first adds only ~6 KB (the recipe), regardless of file size. At petabyte fleet scale this is where the codec earns its keep.

Honest caveat: dedup triggers on identical compressed sub-volumes — replicas, overlap zones, re-archives, time-lapse over unchanged geology. Two independently-acquired surveys of nominally-similar terrain rarely dedup, because acquisition noise differs at every trace. We don't sell fuzzy dedup; we sell content-addressed dedup that's mathematically exact.

Benchmarks

All numbers from codec-audit/segy/VALIDATION-REPORT.md — reproducible from the audit script.

Input	Mode	Recipe	Encode time	PSNR	Byte-exact
F3 slice (5.8 MB, int16, 5000 traces)	Lossless	6.2 KB	0.1 s	∞	byte-exact
F3 slice (5.8 MB, int16, 5000 traces)	Lossy 7-bit	91 KB	1.7 s	55.2 dB	n/a
F3 slice (5.8 MB, int16, 5000 traces)	Wavelet (db4, medium)	6.6 KB	10.2 s	53.8 dB	n/a
Synthetic seismic (2.0 MB, 900 traces, Ricker + noise)	Wavelet medium	1.2 KB	0.6 s	~52 dB	n/a
5 × F3 slice (29.1 MB total, identical re-uploads)	Lossless + dedup	31 KB + 4.08 MB store	~0.5 s	∞	byte-exact

Encode is CPU-intensive on the first survey, especially in wavelet mode (~2 MB/s single-threaded; we shard for higher throughput on enterprise nodes). Decompression is much faster — typical 30–50 MB/s. The win compounds across surveys that share sub-volumes.

See it on your own seismic

Drop a SEG-Y file (up to 200 MB). Pick a mode. Get the recipe + store back as a single bundle.

Compress a SEG-Y file →

No signup. Rate-limited to 1 per hour per IP because the encode is heavy. Larger files? Talk to sales.

Frequently asked questions

Which SEG-Y trace formats are supported?

All five: IBM float (format 1), int32 (2), int16 (3), fixed-point (4), and IEEE float (5). Lossless is bit-exact for int16 today; int32 and float formats round-trip through a high-fidelity wavelet path (53–60 dB PSNR). Native float-lossless is in beta — contact sales for early access.

Are headers preserved?

Yes. The 3200-byte EBCDIC textual header and the 400-byte binary file header are stored verbatim. Per-trace 240-byte headers are template-compressed (we extract the constant fields and delta-encode the varying ones), then restored byte-for-byte on decode.

Will it work with Petrel, Kingdom, OpenSeisWorks?

Yes — the output of decompression is a fully-conformant SEG-Y file (rev1 / rev2). You decompress on the way out and feed the resulting .segy to any interpretation package. There is nothing proprietary in the output.

How does wavelet mode enable sub-volume access?

We bricks each survey into 64×64×N cubes and DWT each brick independently. To extract a 64-inline strip, we pull only the bricks that intersect it, inverse-DWT, and return a NumPy array — no need to decompress the whole survey. Powers responsive viewers on top of cold storage.

Can I run this on-prem / air-gapped?

Yes — enterprise tier ships a single binary (or Docker image) you run in your data center. No callbacks, no telemetry, no internet required. Bring your own object store (S3, MinIO, Ceph, Azure Blob, NFS).

How does this compare to TerraSpark?

TerraSpark is the incumbent for active seismic compression — high quality, well-trusted, but priced per-survey with closed format and per-seat licensing. We're priced per-GB processed and per-GB-month stored, with the cross-survey-dedup compound savings as the main differentiator at petabyte scale. The decoder is single-binary, open-format, and yours forever on enterprise.

What about ZGY?

ZGY is great if you're all-in on Petrel, but it's not generally available as a standalone codec — and it's not designed for cross-survey deduplication across a heterogeneous fleet. We complement Petrel rather than replace it: ZGY in the workstation, smallest.zip in the archive.

Encode speed?

Single-thread: lossless ~50 MB/s, lossy ~3 MB/s, wavelet ~2 MB/s. We shard per-brick and per-survey for parallel ingest — typical enterprise box does 1 TB/hour wavelet encode. Decompression is faster: 30–80 MB/s single-thread, much higher in parallel.

Compliance, residency, SLAs?

SOC 2 Type II aligned. TLS 1.3 in transit, AES-256 at rest. EU and US data residency. 99.9% on standard tier, 99.99% with multi-region replication on enterprise. Talk to us for specifics.

What happens if smallest.zip disappears?

The decompressor is a standalone binary; enterprise customers get a perpetual source-available licence to it. Your archive is never trapped — worst case, you spin up the binary, point it at the store and recipes, and pull out original SEG-Y.

Compress a petabyte of seismic for less than the cost of one TerraSpark licence

Drop a real SEG-Y file and see the compression on your own survey. No signup, no credit card.

Try it free, no signup Talk to sales

Seismic data, ~95% smaller at petabyte scale