HUPO-PSI Recommendation · Working Draft

File Format Specification¶

An open, columnar, random-access file format for storing mass spectra, chromatograms, and rich instrument metadata — designed to succeed mzML for modern, large-scale mass spectrometry.

Status: Draft Version 0.9 · Draft 5 Built on Apache Parquet CC-BY-ND 4.0

This is a working draft

This document is a DRAFT under active revision and has not yet been ratified through the HUPO-PSI Document Process. Content, schemas, and controlled-vocabulary terms may change. Feedback is explicitly solicited — please open an issue.

Abstract¶

mzPeak is an open, next-generation file format for mass spectrometry data. Advances in MS instrumentation — higher resolution, faster scan rates, greater sensitivity, and the growing use of ion mobility and imaging — have sharply increased the volume and dimensionality of the data produced in proteomics, metabolomics, and lipidomics. Established open formats such as mzML and imzML increasingly struggle with the resulting file sizes, data-access patterns, and metadata demands, while vendor-specific formats offer faster access at the cost of interoperability and long-term archival stability. mzPeak is designed to address these needs as a community standard; the motivation and design rationale are set out in the accompanying white paper (1).

The format takes a hybrid approach: an archive of Apache Parquet tables — packaged in an uncompressed ZIP container, or an unpacked directory, together with a small JSON index — that stores numerical signal data as compact, columnar binary while keeping metadata both human- and machine-readable and anchored in the PSI-MS controlled vocabulary. Building on the widely available Apache Parquet and Apache Arrow libraries rather than a single core implementation, the on-disk structure is language-independent. Uncompressed archive members and a required Parquet page index let a reader locate and decode a single spectrum without parsing an entire run, the basis for random access from local disk, object storage, or over HTTP. This document specifies that format: its container, signal-data layouts, index file, and metadata model.

What you'll find here¶

Introduction

Why mzPeak exists, what problems it solves, and a high-level tour of an archive.

Overview & motivation
Foundations

A short Parquet primer and how multiple Parquet files are bundled into one archive.

Parquet & containers
Data Layouts

Packed metadata tables, the array index, and the point & chunked signal layouts.

Layouts
Archive Organization

The mzpeak_index.json file, data kinds, and entity types that organise a run.

The index file
File Schemas

Concrete column-level schemas for spectra, chromatograms, and wavelength spectra.

Schemas
Implementations

The read/write libraries — Rust, Python, R, C#, Java, JavaScript/TypeScript, and C++.

Implementations
Tools

Converters, the conformance validator, browser viewers, and example data.

Tools
Reference

Glossary, bibliography, authorship, and the intellectual-property statement.

Reference

Anatomy of an archive¶

An mzPeak archive (.mzpeak) is a small set of named files:

File	Contents
`mzpeak_index.json`	Manifest describing every file in the archive by entity type and data kind.
`spectra_metadata.parquet`	Spectrum-level and file-level metadata (descriptions, scans, precursors, selected ions).
`spectra_data.parquet`	Spectrum profile signal data, in point or chunked layout.
`spectra_peaks.parquet` (optional)	Spectrum centroid signal, stored separately from profile data.
`chromatograms_metadata.parquet`	Chromatogram-level and file-level metadata.
`chromatograms_data.parquet`	Chromatogram signal data, in point or chunked layout.

Additional Parquet files may be added to cover further modalities (for example wavelength spectra) as the format grows.

Reference implementations¶

mzPeak is backed by seven independent, from-scratch implementations (not bindings to a single core):

Rust — read/write reference implementation · HUPO-PSI/mzPeak
Python — read-only, zero-copy Arrow/Pandas API · mzpeak_prototyping
R — read-only, dplyr-compatible access · mzpeak_prototyping
C# — read/write · HUPO-PSI/mzPeak.NET
Java — read/write demonstrator · okohlbacher/mzPeakJ
JavaScript / TypeScript — read-only, runs in the browser, Node, and Deno · online viewer
C++ — read-only, work in progress · OpenMS/mzpeak

See Implementations for details, and Tools for converters, the conformance validator, and viewers.

Status of this document¶

This document provides information to the proteomics community about the mzPeak file format. Distribution is unlimited. It will be ratified via the HUPO Proteomics Standards Initiative (PSI) Document Process, and any alterations must also follow the HUPO-PSI Document Process.

Version: Draft 5 of version 0.9