PSI Recommendation
PSI Mass Spectrometry and Proteomics Informatics Working Groups Status: DRAFT
Status of this document
This document provides information to the proteomics community about a TODO. Distribution is unlimited. This document will be ratified via the HUPO Proteomics Standards Initiative (PSI) Document Process. Any alterations of this document MUST also follow the HUPO PSI Document Process.
Version Draft 5 of version 0.9
mzpeak_index.json
spectra_data.parquet
spectra_peaks.parquetchromatograms_data.parquet
chromatograms_metadata.parquetwavelength_spectra_metadata.parquetThe key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” are to be interpreted as described in RFC 2119 (2).
This document describes both a file format and a set of suggested algorithms for preparing data to be stored in that format. The original author (Joshua Klein), includes snippets of Python code to do these operations under the assumption that at this time most technical programmers will know Python, that Python is effectively executable pseudocode, save that the code snippets use three components, abstract base classes from the standard library for type annotations, NumPy v2.1 for certain array operations which are assumed to be understandable, and PyArrow v20.0 for operations on Apache Arrow arrays which are conceptually equivalent for in-memory representations of the data stored in Parquet files.
mzPeak is an archive of multiple Parquet files, stored directly in an uncompressed ZIP archive or unpacked directory/prefix. Each Parquet file describes a different facet of the stored mass spectrometry run. While the the data model draws on prior art like mzML (https://peptideatlas.org/tmp/mzML1.1.0.html), it is not a direct re-implementation in a Parquet table. It does attempt to re-use concepts like controlled vocabularies where feasible as well as arbitrary additional user metadata.
Components of an mzPeak archive:
mzpeak_index.json: Definition of the files present in
the archive, encoded as JSON. This makes resolving files by controlled
terms easier than matching file names.spectra_metadata.parquet: Spectrum level metadata and
file-level metadata. Includes spectrum descriptions, scans, precursors,
and selected ions using packed parallel tables.spectra_data.parquet: Spectrum signal data in either
profile or centroid mode. May be in point layout or chunked layout which
have different size and random access characteristics.spectra_peaks.parquet (optional): Spectrum centroids
stored explicitly separately from whatever signal is in
spectra_data.parquet, such as from instrument vendors who
store both profile and centroid versions of the same spectra. This file
may not always be present.chromatograms_metadata.parquet: Chromatogram-level
metadata and file-level metadata. Includes chromatogram descriptions, as
well as precursors and selected ions using packed parallel tables.chromatograms_data.parquet: Chromatogram signal data.
May be in point layout or chunked layout which have different size and
random access characteristics. Intensity measures with different units
may be stored in parallel.and other Parquet files may be added to cover additional modalities as needed like the wavelength spectra.
This is a minimal overview of Parquet. For more details, please see https://parquet.apache.org/ for further explanation.
Parquet files contain a physical data schema defining how their data columns are encoded in bytes on disk. This schema supports arbitrary levels of nullability, nesting (groups) or repetition (lists). These physical data types may also be mapped to one or more “logical types”.
There is a broader many-to-many mapping between Parquet schemas and Apache Arrow schemas. Arrow supports many types that Parquet does not, but they share a common abstraction of columnar data storage with a notion of per-value nullability, and while they store these concepts differently, it is straight-forwards to convert from one to the other.
At the end of a Parquet file is a footer containing user-defined metadata along with the file’s schema, offsets and search indices. This user-defined metadata is stored in key-value pairs, which makes it amenable to serializing light-weight, immediately interesting metadata there that do not make sense to force into the data columns.
Parquet is a strongly typed binary columnar data format with layered blocked compression that permits a degree of random access.
As some languages do not have a concept of 64-bit addresses, all
implementations MUST handle both large_* and not
large_* variants of collection arrays, list,
string, and binary types.
TODO: write more here regarding compression vs. random access granularity. Lots of knobs to twist
When writing an mzPeak archive, the writer MUST write a page index. Most libraries that write Parquet support writing the page index, even if they do not directly support reading informed by the page index.
TODO: write more here
In order to pack multiple Parquet tables together under a single file name on disk, we need a container file format. To that end we use the ZIP archive to bundle multiple files together. ZIP files start with a header containing the magic bytes followed by a sequence of blocks of (header, file) pairs, terminating with a central directory listing how to find each file in the archive. Files saved in a ZIP may be stored compressed or uncompressed. When mzPeak is stored in a ZIP it MUST store its member files uncompressed. Files stored uncompressed can be read directly without requiring some intervening decompression to occur to reveal the Parquet file, and the Parquet file itself contains layered compression that is superior to that of most plain ZIP compressors.
TAR archives are designed for a linear traversal. In order to know all of the files in the archive, you must jump from header entry to header entry until you reach the end of the archive. Compared to ZIP’s central directory index, this is less efficient and more expensive for object stores. TAR does not support per file encryption either, making protecting parts of the archive that are not in Parquet more difficult.
If an mzPeak archive is stored in an unpacked directory, the directory name is treated as the name of the name of the run file.
The format specification described in this document is not being developed in isolation; indeed, it is designed to be complementary to, and thus used in conjunction with, several existing and emerging models. Related specifications include the following:
The spectra_metadata.parquet and
chromatograms_metadata.parquet store multiple schemas in
parallel. In these Parquet files, the root schema is made up of several
branched “group” or “struct” (Parquet vs. Arrow nomenclature) that may
be null at any level. We use relational database language, specifically,
“primary key” and “foreign key” to describe the interconnections between
the different tables that are packed together here.
Here is a stripped down example where two rows of related MS1 and MS2
spectra. Treat scan.source_index,
precursor.source_index,
precursor.precursor_index,
selected_ion.source_index, and
selected_ion.precursor_index as a foreign key with respect
to spectrum.index, a primary key.
precursor.source_index refers to the spectrum
which this precursor record belongs to and
precursor.precursor_index refers to the
spectrum that is that is the precursor of the
spectrum referenced by precursor.source_index,
and (precursor.source_index,
precursor.precursor_index) forms a compound primary key.
Any of these columns may be null which means that such a
record does not exist in the table. This also applies to the
selected_ion facet.
| spectrum | scan | precursor | selected_ion | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| index | id | time |
MS_1000511_ ms_level |
source_ index |
MS_1000616_preset_ scan_configuration |
source_ index |
precursor_ index |
isolation_ window |
source_ index |
precursor_ index |
MS_1000744_selected_ ion_mz |
| … | |||||||||||
| 502 | scan=502 | 20.51 | 1 | 502 | 3 | 503 | 502 | {…} | 503 | 502 | 233.5 |
| 503 | scan=503 | 20.531 | 2 | 503 | 2 | 504 | 502 | {…} | 504 | 502 | 562.3 |
| … | |||||||||||
Like mzML, mzPeak makes heavy use of controlled vocabularies for representing rich metadata. mzPeak uses controlled vocabulary terms in several ways:
has_value_type) OR the a CURIE for
a child of the column name. For example
MS_1000525_spectrum_representation
would have values that are CURIEs for a child term,
MS:1000127 “centroid spectrum” or MS:1000128
“profile spectrum”, as appropriate for the spectrum the row is
describing.MS_1000511_ms_level
would hold an integer value, as appropriate for the spectrum the row is
describing.parameters arrays,
akin to <cvParam /> in mzML. For every schema facet
of a metadata table, a parameters column is allowed. See
the parameters list for details.parameters listThe parameters column may be present in any facet of a
metadata table. It MUST be a list of the following schema:
optional group field_id=-1 parameters (List) {
repeated group field_id=-1 list {
optional group field_id=-1 item {
optional group field_id=-1 value {
optional int64 field_id=-1 integer;
optional double field_id=-1 float;
optional binary field_id=-1 string (String);
optional boolean field_id=-1 boolean;
}
optional binary field_id=-1 accession (String);
optional binary field_id=-1 name (String);
optional binary field_id=-1 unit (String);
}
}
}The parameters.list.item.value group must have a column
for each data type so a parameter can take on one of these value types.
Unused type slots MUST be null. (QUESTION: Should
this support lists/maps?). parameters entries MAY
have unit defined by a controlled vocabulary term CURIE stored in the
parameters.list.item.unit column. As in mzML’s
<userParam />, uncontrolled parameters may also be
included in the parameters list by simply storing a
parameter WITHOUT a value in the
parameters.list.item.accession column.
NOTE: Parquet columns MUST be uniquely
named, so if a parameter is present multiple times in single entry it
MUST be stored in the parameters column.
NOTE: Writers are encouraged to, when sufficient context is available, encode parameters that are present in most rows of a table as columns. This is more space efficient and opens the door to easy predicate filtering.
When a term has a value, if it is stored in the
parameters list, the value MUST be stored using
one of the provided value types as given by the fixed schema. When
storing a term that has been encoded as a separate column, values can be
stored in any data type supported by Parquet. This flexibility allows us
to pick a physical type that uses an appropriate size and precision for
the value being stored, but can also create a lot of redundant code to
handle a column that we expect contains an integer but whether the file
was written using an 8-, 16-, 32-, or 64-bit integer, and whether it was
stored signed or unsigned. Some languages can handle these naturally
using dynamic typing or templates, while others require manually repeat
implementations for each type a column might use.
When representing a controlled vocabulary term concept as a column in the table, the column name SHOULD use the following inflection rules to construct the column name:
${CV_CODE}_${CV_ACCESSION}_${CLEANED_NAME} where:
${CV_CODE} is the identifier for the controlled
vocabulary itself, MS for PSI-MS or UO for the
unit ontology.${CV_ACCESSION} is the accession number.
MS:1000016 for “scan start time” would be
1000016.${CLEANED_NAME} is the term’s name with any non-Parquet
column compatible characters replaced with _. The regular
expression /[^a-zA-Z0-9_\\\\-]+/ is sufficient to match all
of these characters in ASCII. For MS:1000016 for “scan
start time”, this would be MS_1000016_scan_start_timemz to avoid unnecessary additional underscores.
For MS:1000504 for “base peak m/z”, this would be written
MS_1000504_base_peak_mz._unit_${UNIT_CV_CODE}_${UNIT_CV_ACCESSION} to the inflected
name. MS_1000528_lowest_observed_mz_unit_MS_1000040 would
correspond to MS:1000528 “lowest observed m/z” with a unit
of MS:1000040 “m/z”._unit appended to the
column name whose value is the unit’s CURIE.A row value that is null should be treated as being
absent, having no value. If a foreign key column is null,
assume the entry does not exist in the table, as in the case where an
MS2 spectrum is stored without MS1 spectra as in MGF files, or a slice
of an MS run. If it is the primary key of the table, the reader
SHOULD skip of the columns in that row for that table.
A writer implementation is SHOULD to minimize the number of
interspersed rows that are null, but this is not strictly
required. Minimizing the interspersed nulls improves compressability.
See the images below, the “Packed Tables” has all of the rows of each
parallel table contiguous, while the “Sparse Tables” diagram shows rows
of nulls intermixed

It is common in mass spectrometry to talk about a spectrum
having an m/z array as synonymous with having been measured in
the m/z dimension, and those m/z values are represented using some kind
of physical data type in memory, likewise having an intensity array
corresponding to the abundance of the signal parallel to the m/z array.
In mzML, it is possible to use different physical data types for these
two dimensions on different spectra in the same file, and there may well
be legitimate use-cases for that. mzPeak can store array data in two
ways. One way to store the arrays as columns in a signal data layout,
burning the column into the schema and added to the
array index. Another way is to store it as an
auxiliary array which will be stored in the associated
metadata table’s *.auxiliary_arrays value for that entity’s
row. Auxiliary data arrays can be individually configured by the writer,
have custom compression or data type decoding or cvParams, but it cannot
be searched or sliced (read a segment of) without decoding the entire
array, just as in mzML. By contrast, any array that is written as a
column is encoded directly in Parquet, is part of the schema, and
subject to its adaptive encoding process and compression. Currently, we
assume that the first sorted array is the axis around which all other
values are arranged, sorting rank 0, and any arrays that are shorter or
longer SHOULD instead be stored in
auxiliary_arrays as well.
When writing, if an array with a sorting rank is unsorted, the entry’s data arrays MUST be re-sorted accordingly. Failure to do so introduces integrity errors.
In order to properly annotate what kind of array a column
is, we include a JSON-serialized array index in
the Parquet key-value
metadata, an list of data structures that describe each array in
controlled vocabulary. A column is part of the Parquet file’s schema and
must always exist and have a homogenous type of value or a be marked
null for each row. The array index is stored
in the Parquet metadata for the data arrays or peaks files under
<entity_type>_array_index,
e.g. spectrum_array_index for spectra or
chromatogram_array_index.
{
"prefix": "point",
"entries": [
{
"context": "spectrum", // This is an array describing a spectrum
"path": "point.mz", // The path to the column for this array in the Parquet schema
"data_type": "MS:1000523", // The controlled vocabulary term for the data type of this array
"array_type": "MS:1000514", // The controlled vocabulary term for the array itself
"array_name": "m/z array", // A human readable name and a place to store a custom name through `non-standard array`
"unit": "MS:1000040", // The values in this array have the unit m/z
"buffer_format": "point", // This column uses the point layout
"transform": null, // No transformation was applied to this data
"data_processing_id": null, // No specific data processing pipeline was applied, use the default data processing method
"buffer_priority": "primary", // This is the primary m/z array, default all queries to read this column when looking for m/z values
"sorting_rank": 0 // This array's column is assumed to be sorted within entries, with all other arrays' columns sorted afterwards
},
{
"context": "spectrum",
"path": "point.intensity",
"data_type": "MS:1000521",
"array_type": "MS:1000515",
"array_name": "intensity array",
"unit": "MS:1000131", // The unit of this array is detector counts
"buffer_format": "point",
"transform": null,
"data_processing_id": null,
"buffer_priority": "primary",
"sorting_rank": null // This array does not impose any sorting order on the data
}
]
}This array index describes the table shown below for the point layout
Governed by schema/array_index.json.
Depending upon the signal data layout being used, arrays will be stored in different formats.
Available formats: - point: This array is stored using
the point layout. The
point layout is all-or-nothing, every array MUST
be in that format. - chunk_values: This array is part of
the chunked layout. It
contains the list of values of the “main” axis bounded between the
chunk’s start and end point. These values are encoded to be more
compressable. This encoding is in addition to the Parquet
encoding step. - chunk_start: This array is part of the chunked layout. It contains the
starting value of the “main” axis for the chunk, inclusive. -
chunk_end: This array is part of the chunked layout. It contains the
ending value of the “main” axis for the chunk, inclusive. It sjould be
less than the next chunk’s chunk_start value. -
chunk_encoding: This array is part of the chunked layout. It contains a
CURIE indicating how chunk_values was encoded. -
chunk_secondary: This array is part of the chunked layout. It contains the
list of values of an array other than the main axis for the chunk. -
chunk_transform: This array is part of the chunked layout. It contains the
list of raw byte contents of an array in the chunk that was opaquely
transformed, e.g. using MS-Numpress. It may be present in addition to a
referenced chunk_values or chunk_secondary
column.
Currently the chunked layout only supports a single
chunking dimension. A single file in the chunked layout
MUST use chunk_start,
chunk_end, chunk_encoding, and
chunk_values EXACTLY once.
In Parquet, all column names and types need to be known before you
can begin writing, and no two columns can have the same name + path.
Normally, we have a coordinate array column (e.g. m/z or time) and an
intensity array column. If you have intensity arrays with different
units or different data types, they would need to be defined as separate
arrays in the array index and thus have
distinct names. While this case may be uncommon for spectra, when
working with diagnostic traces stored as chromatograms this can be
unavoidable. For ergonomics, we want to use simple column names most of
the time, and it would be ideal if the most common columns had
consistent names as this makes using files from raw Parquet tools
easier. To that end, the most common (as defined by the implementation)
version of each array type SHOULD have a
buffer_priority property of primary and
receive a short and consistent name. The table below lists recommended
short names:
| accession | name | column name |
|---|---|---|
| MS:1000514 | m/z array | mz |
| MS:1000515 | intensity array | intensity |
| MS:1000516 | charge array | charge |
| MS:1000517 | signal to noise array | signal_to_noise |
| MS:1000595 | time array | time |
| MS:1000617 | wavelength array | wavelength |
| MS:1002530 | baseline array | baseline |
| MS:1002529 | resolution array | resolution |
| MS:1002893 | ion mobility array | ion_mobility |
| MS:1003007 | raw ion mobility array | raw_ion_mobility |
| MS:1002816 | mean ion mobility array | mean_ion_mobility |
| MS:1003154 | deconvoluted ion mobility array | deconvoluted_ion_mobility |
| MS:1003008 | raw inverse reduced ion mobility array | raw_inverse_reduced_ion_mobility |
| MS:1003006 | mean inverse reduced ion mobility array | mean_inverse_reduced_ion_mobility |
| MS:1003155 | deconvoluted inverse reduced ion mobility array | deconvoluted_inverse_reduced_ion_mobility |
| MS:1003153 | raw ion mobility drift time array | raw_drift_time |
| MS:1002477 | mean ion mobility drift time array | mean_drift_time |
| MS:1003156 | deconvoluted ion mobility drift time array | deconvoluted_ion_mobility_drift_time |
Parquet can write page indices on any column that is a leaf node in the schema based upon the value being stored prior to applying encoding and compression. To that effect, we must take care when trying to store data cleverly. The following section may refer to spectra, but these are applicable more broadly.
When storing spectrum data, some vendors will produce arrays with lots of “empty” regions filled with zero intensity values along a semi-regularly spaced m/z axis. These regions hold little information, so all but the first and last zero intensity points are removed. This is only meaningful for profile data. Readers SHOULD assume that zero runs have been stripped.
For spectra with many small gaps, even zero run stripping leaves too
much unhelpful information in the data. We can instead replace the
flanking zero intensity points with null m/z and intensity
values and Parquet will skip storing the expensive 32- and/or 64-bit
values, retaining only the validity buffer bit flag. We can separately
fit a simple m/z spacing model using weighted least squares of the
form:
δmz ∼ β0 + β1mz + β2mz2 + ϵ
or using the following Python code:
class DeltaCurveRegressionModel:
beta: np.ndarray
def __init__(self, beta: np.ndarray):
self.beta = beta
@classmethod
def fit(
cls,
mz_array,
delta_array,
weights: np.ndarray | None = None,
threshold: float | None = None,
rank: int = 2,
):
if weights is None:
weights = np.ones(len(mz_array))
else:
weights = weights
if threshold is None:
threshold = 1.0
# Drop all entries where the gap between m/z values > threshold
raw = mz_array[1:][delta_array <= threshold]
w = weights[1:][delta_array <= threshold]
y = delta_array[delta_array <= threshold]
# Build the design matrix
data = [data.append(np.ones_like(raw))]
for i in range(1, rank + 1):
data.append(raw**i)
data = np.stack(data, axis=-1)
# Use the QR decomposition to solve the weighted least squares problem
# to estimate weights predicting δ m/z.
# https://stats.stackexchange.com/a/490782/59613
chol_w = np.sqrt(w)
qr = np.linalg.qr(chol_w[:, None] * data)
v = qr.Q.T.dot(chol_w * y)
beta = solve_triangular(qr.R, v)
# Numerically equivalent to and more stable than the direct inversion
# beta = np.linalg.inv((data.T * w).dot(data)).dot(data.T * w).dot(y)
return cls(beta)
def predict(self, mz: float) -> float:
acc = self.beta[0]
for i in range(1, len(self.beta)):
acc += self.beta[i] * mz ** i
return accThen when reading the the null-marked data, use either the local 2nd
median δmz
or the learned model for that spectrum to compute the m/z spacing for
singleton points to achieve a very accurate reconstruction. Because the
non-zero m/z points remain unchanged, the reconstructed signal’s peak
apex or centroid should be unaffected. If the peak is composed of only
three points including the two zero intensity spots, no meaningful peak
model can be fit in any case so the miniscule angle change this would
induce are still effectively lossless. The parameters of the model
learned for each entry MUST be stored in the relevant entry’s
row in the associated metadata table as mz_delta_model
(QUESTION: should this become a CV param and dissociate it from m/z?
Probably).

Keep in mind that all Numpress compression methods are still available and still provide superior size reduction, but carry this slightly larger loss of accuracy. Using a Numpress compression is a transformation that requires the Chunked Layout.
A zero run is defined explicitly as a sequence of 3 or more zero
values that should be reduced to just the first and last positions in
the run. Zero runs can be very long and outside of certain scenarios
which assume a complete grid of coordinate values, provide no value. If
zero runs need to be reconstructed beyond the flanking points left at
the end of these runs, the same method used to fill in
nulls here can be used to extend zero runs.
def find_where_not_zero_run(data: Sequence[Number]) -> Sequence[int]:
"""
Construct a list of positions that are not part of a zero run.
A zero run is any position *i* such that:
1. ``x[i] == 0``
2. ``(i == 0) or (x[i - 1] == 0)``
3.``(i == (len(x) - 1)) or (x[i + 1] == 0)``
We build a position list here because we need to extract these positions
from ALL dimension arrays for this entity, not just the current array.
Parameters
----------
data : :class:`Sequence` of :class:`Number`
The numerical data to traverse
Returns
-------
:class:`np.ndarray` of :class:`np.uintp`
"""
n = len(data)
n1 = n - 1
was_zero = False
acc = []
i = 0
while i < n:
v = data[i]
if v is not None:
if v == 0:
if (was_zero or (len(acc) == 0)) and ((i < n1 and data[i + 1] == 0) or i == n1):
pass
else:
acc.append(i)
was_zero = True
else:
acc.append(i)
was_zero = False
else:
acc.append(i)
was_zero = False
i += 1
return np.array(acc, dtype=np.uintp)Zero intensity points on the sides of peaks still use non-trivial
amounts of storage in sparse datasets. This step does not match
all zero intensity points, only those that occur on the flanks of
profile peaks. Once these indices are found, they can be used to
construct the null mask or “validity bitmap” of an Arrow
array which is equivalent to how a Parquet column chunk would be
constructed.
def is_zero_pair_mask(data: Sequence[Number]) -> "np.typing.NDArray[np.bool_]":
'''
Create a boolean mask for positions that are composed of two zeroes in a row.
Parameters
----------
data : :class:`Sequence` of :class:`Number`
The numerical data to traverse
Returns
-------
:class:`np.ndarray` of :class:`bool`
'''
n = len(data)
n1 = n - 1
was_zero = False
acc = []
for i, v in enumerate(data):
if v == 0:
if was_zero or (i < n1 and data[i + 1] == 0):
acc.append(True)
else:
acc.append(False)
was_zero = True
else:
acc.append(False)
was_zero = False
return np.array(acc)Decoding null pairs, the process of undoing null marking involves finding
regions bounded between two nulls, or the start of the
array and a null, or a null and the end of the
array, then filling the null values with either a locally estimated
value when you have more than one value to estimate the median delta
from, or to use the regression model described ealier to impute the
value for a single point. Unpaired null values MAY
only be found as the first or last null in the array, any
other unpaired nulls are unrecoverable errors. A run of
three or more null values is encountered, it MAY
be recoverable but should not occur under normal operation.
The locally estimated value SHOULD be the second median of
the spacing of the current segment’s non-null values. The
regression model is used to predict the spacing from the non-null value
within a segment with only one non-null value.
def find_pairs(mask: Sequence[bool]) -> Sequence[int]:
"""
Construct index ranges between pairs of :const:`True` values in
``mask``.
The first and last index range will include the beginning and ending
of the array respectively, even if the mask does not start/end with a
:const:`True` value.
The resulting array will have two columns, the start and end indices
of the spans between two :const:`True` values (or the termini of the array).
.. warning::
This function *can* fail or produce incorrect output if there are runs of
:const:`True` values longer than 2 in the ``mask``. It is also necessary to
buffer all data for an entry before invoking this function so that batches
do not artificially disrupt pairs.
Parameters
----------
mask : :class:`Sequence` of :class:`bool`
Returns
-------
np.typing.NDarray[int]
"""
parts = []
indices = np.where(mask)[0]
if len(indices) == 0:
return np.array([[0, len(mask)]])
if indices[0] != 0:
parts.append([0])
parts.append(indices)
if indices[-1] != len(mask) - 1:
parts.append([len(mask) - 1])
indices = np.concat(parts)
indices = indices.reshape((-1, 2))
indices[:, 1] += 1
return indices
def estimate_median_delta(data: Sequence[Number]) -> tuple[Number, np.typing.NDArray]:
"""
Find the 2nd median of ``np.diff(data)``.
This is a relatively crude spacing estimate for continuous profile data.
Returns
-------
:class:`Number`
The 2nd median of ``np.diff(data)``
:class:`np.ndarray`
The values from which the previous return values were estimated
"""
deltas = np.diff(data)
median = np.median(deltas)
deltas_below = deltas[deltas <= median]
median = np.median(deltas_below)
return median, deltas_below
def fill_nulls(
data: pa.Array, common_delta: DeltaModelBase
) -> "np.typing.NDArray":
"""
Fill ``null`` values in ``data`` using the ``common_delta`` model or the locally estimated
median delta if sufficient data are available.
Parameters
----------
data : :class:`pyarrow.Array`
The data array to fill nulls in with ``common_delta``
common_delta : :class:`DeltaModelBase` or :class:`Number`
The common spacing model, either specified as a model instance or as a single constant spacing term.
Returns
-------
np.ndarray
"""
if not isinstance(common_delta, DeltaModelBase):
if isinstance(common_delta, Number):
common_delta = ConstantDeltaModel(common_delta)
else:
common_delta = DeltaCurveRegressionModel(np.asarray(common_delta))
pair_indices = find_pairs(data.is_null())
chunks = []
for (start, end) in pair_indices:
# Get the values in the array between start and end
chunk = np.asarray(data.slice(start, end - start))
n = len(chunk)
# The set of values that are not null
has_real = chunk[~np.isnan(chunk)]
n_has_real = len(has_real)
if n_has_real == 1:
# If there is only one non-null value, this is a singleton
# point, but it might only have one or two sides to pad
if n == 2:
if np.isnan(chunk[0]):
chunk[0] = chunk[1] - common_delta(chunk[1])
else:
chunk[1] = chunk[0] + common_delta(chunk[0])
elif n == 3:
dx = common_delta(chunk[1])
chunk[0] = chunk[1] - dx
chunk[2] = chunk[1] + dx
else:
raise Exception()
else:
# Otherwise this is a run of values, so we can estimate a more accurate
# delta directly from the data
dx, _ = estimate_median_delta(has_real)
if np.isnan(chunk[0]):
chunk[0] = chunk[1] - dx
if np.isnan(chunk[-1]):
chunk[-1] = chunk[-2] + dx
chunks.append(chunk)
return np.concat(chunks)Unless otherwise noted, readers SHOULD treat
null values in the sorting rank 0 array of the entry as
governed by this model with parallel null values in any
intensity arrays as 0. The former should have a
transformation value of MS:1003901 and the latter should have a
transformation value MS:1003902. All other values for those points
should be read as-is with null semantics meaning that the value was
absent. Writers using null marking SHOULD only use
null for the first sorting dimension and associated
intensity value, all other columns should be written as-is.
When an array is present in an entry, but is not encoded as a column
in the schema, it must be stored as an auxiliary array. This can happen
when mixing different kinds of detectors in a single collection, or
especially with diagnostic traces where every array might have different
dimensions along a shared time axis or subsampled arrays. Auxiliary data
arrays have a schema similar to binaryDataArray
in mzML, encoded in Parquet. They are described in JSON schema at schema/auxiliary_array.json
optional group auxiliary_arrays (List) {
repeated group list {
optional group item {
optional group data (List) {
repeated group list {
required int32 item (Int(bitWidth=8, isSigned=false));
}
}
optional group name {
optional group value {
optional int64 integer;
optional double float;
optional binary string (String);
optional boolean boolean;
}
optional binary accession (String);
optional binary name (String);
optional binary unit (String);
}
optional binary data_type (String);
optional binary compression (String);
optional binary unit (String);
optional group parameters (List) {
repeated group list {
optional group item {
optional group value {
optional int64 integer;
optional double float;
optional binary string (String);
optional boolean boolean;
}
optional binary accession (String);
optional binary name (String);
optional binary unit (String);
}
}
}
optional binary data_processing_ref (String);
}
}
}
When storing data arrays, the point layout stores the data as-is in
parallel arrays alongside a repeated index column. The top-level node is
named point and it is a group with an arbitrary number of
columns. The entity index column MUST be the first column under
point.

| point | ||
|---|---|---|
| spectrum_index | mz | intensity |
| 1 | 213.2 | 1002 |
| 1 | 506.9 | 500 |
| 1 | 758 | 405 |
| … | … | … |
| 2 | 329.1 | 50 |
| 2 | 516.5 | 5002 |
| 2 | 783.8 | 302 |
This layout is simple, but carries several advantages. Scalar columns are easily filtered along the page-level range index. This makes multi-dimensional queries easier to write and optimize. The arrays are transparently encoded and compressed by Parquet, so the data may still be stored compactly. The data must be stored as-is in order to use the page index so no additional obscuring transformations can be used.
When storing data arrays, the chunked layout treats one array, which
must be sorted, as the “main” axis, cutting the array into chunks of a
fixed size along that coordinate space (e.g. steps of 50 m/z) and taking
the same segments from parallel arrays. The main axis chunks’ start,
end, and a repeated index are recorded as columns, and then each array
may be encoded as-is or with an opaque transform (e.g. delta-encoding,
Numpress). The start and end interval permits granular random access
along the main axis as well as the source index. The top-level node is
named chunk and it has a layout as shown below. The entity
index column MUST be the first column under
chunk.

| chunk | |||||
|---|---|---|---|---|---|
| spectrum_index | mz_chunk_start | mz_chunk_end | mz_chunk_values | chunk_encoding | intensity |
| 1 | 200 | 250 | [0.0013, …, 0.0013] | MS:1003089 | […] |
| 1 | 250 | 300 | [0.0014, …, 0.0014] | MS:1003089 | […] |
| 1 | 500 | 550 | [0.0014, …, 0.0015] | MS:1003089 | […] |
| … | … | … | … | … | … |
| 2 | 200 | 250 | [0.0013, …, 0.0013] | MS:1003089 | […] |
| 2 | 350 | 400 | [0.0014, …, 0.0014] | MS:1003089 | […] |
| 2 | 400 | 450 | [0.0013, …, 0.0014] | MS:1003089 | […] |
This example uses a delta-encoding for the m/z array chunks’ values,
which can be efficiently reconstructed with very high precision for
64-bit floats. The m/z values within the mz_chunk_values
list aren’t accessible to the page index, but the
_chunk_start and _chunk_end columns are. The
chunk values are still subject to Parquet encodings so they can be byte
shuffled as well which further improves compression.
The chunked layout’s column naming rules:
<entity>_index (integer): The index key for the
entity this chunk belongs to.<array_name>_chunk_start (float64): The first
coordinate value in this chunk, where it starts at inclusively.
This array’s entry in the array index’s
buffer_format MUST be
chunk_start.<array_name>_chunk_end (float64): The last
coordinate value in this chunk, where it ends at inclusively.
This array’s entry in the array index’s
buffer_format MUST be chunk_end.<array_name>_chunk_values (list): The encoded
coordinates from array_name according to the
chunk_encoding column. This array’s entry in the array index’s
buffer_format MUST be
chunk_values.chunk_encoding (CURIE): The method by which
<array_name>_chunk_values were encoded. See Chunk Encodings for more details. This
array’s entry in the array index’s
buffer_format MUST be
chunk_encoding.All other columns are expected to be list arrays whose
names are simply their array_name as described in the index
with buffer_format chunk_secondary, or
surrogate arrays with the buffer format
chunk_transform.
The process for constructing a chunk table for a signal entry may
break in any pattern so long as the chunks are non-overlapping and
ascending. The chunking procedure needs to be null-aware,
particularly aware of null pairs used to denote masked
regions. An algorithm for producing equal width chunks is given below.
The granularity of the chunking is configurable, trading off random
access granularity vs. compression efficiency.
import pyarrow as pa
def null_chunk_every(data: pa.Array, width: float) -> list[tuple[int, int]]:
"""
Partition a sorted numerical array into segments spanning `width` units.
This operation is null-aware, so sparse arrays can be partitioned.
Parameter
---------
data : pa.Array
The data to be partitioned
width : float
The spacing (in units along the data dimension) between chunks
Returns
-------
list[tuple[int, int]]
The start and end index of each chunk
"""
start = None
n = len(data)
i = 0
# Find the first non-null position
while i < n:
v = data[i]
if v.is_valid:
start = v.as_py()
break
else:
i += 1
# If we never found a non-null position, just return a single chunk
if start is None:
return [(0, n)]
chunks = []
offset = 0
threshold = start + width
i = 0
while i < n:
v = data[i]
if v.is_valid:
v = v.as_py()
if v > threshold:
if ((i + 1) < n) and (not data[i + 1].is_valid):
while ((i + 1) < n) and (not data[i + 1].is_valid):
i += 1
# We don't want to create a chunk of length 1, especially not if it is a null
# point. If not, we have to relax the width requirement.
if i - offset > 1:
chunks.append((offset, i))
offset = i
# Update the threshold. We might need to update multiple times if the next value
# is far away.
while threshold < v:
threshold += width
# Look ahead and see if the next value is not null since this one is.
elif ((i + 1) < n) and (data[i + 1].is_valid):
i += 1
v = data[i].as_py()
if v > threshold:
i -= 1
chunks.append((offset, i))
offset = i
# Update the threshold. We might need to update multiple times if the next value
# is far away.
while threshold < v:
threshold += width
i += 1
if offset != n:
chunks.append((offset, n))
return chunksChunk Encoding Controlled Vocabulary: MS:1000576|no compression
When storing centroids or data that are not similarly spaced as is usually the case for pre-centroided spectra, but still want to use the chunked layout, no special encoding of the chunked values is necessary. The values within each chunk are written as-is to the chunk values array. This doesn’t improve compressibility but it maintains a consistent schema for other entries that would benefit from a different encoding.
Note: The start point is excluded from the chunk values array.
Chunk Encoding Controlled Vocabulary: MS:1003089|truncation, delta prediction and zlib compression
When working with data that is laid out on a locally (almost) uniform grid using 64-bit floats, it is possible to improve compression by computing a delta encoding of the coordinates.
Note: The start point is excluded from the chunk values array.
import pyarrow as pa
def null_delta_encode(data: pa.Array) -> pa.Array:
"""
Delta-encode an Arrow array containing nulls. Nulls are encoded as null values, and treated as 0.0
for the purposes of computing the next delta.
Parameters
----------
data : pa.Array
The data to delta encode
Returns
-------
pa.Array
"""
acc = []
it = iter(data)
# Get the first entry in the array. It will be the first point of reference but not part
# of the delta sequence unless it is `null`
last = next(it)
if not last.is_valid:
acc.append(last)
for item in it:
# If the value isn't `null`,
if item.is_valid:
val = item.as_py()
# Compute a delta relative to the last item if it was not `null`
if last.is_valid:
acc.append(pa.scalar(val - last.as_py()))
# otherwise treat the last value as 0.0, the additive identity
else:
acc.append(item)
# Update last item
last = item
else:
# Append the `null` unmodified and update the last item.
acc.append(item)
last = item
return pa.array(acc)
def null_delta_decode(data: pa.Array, start: pa.Scalar) -> pa.Array:
"""
Decode an Arrow array that was delta-encoded *with* nulls.
This is necessarily a copying operation.
Parameters
----------
data : pa.Array
The data to be decoded.
start : pa.Scalar
The starting value, an offset
Returns
-------
pa.Array
"""
acc = []
# If the first value is `null`,
if not data[0].is_valid:
# and the second value is `null`,
if not data[1].is_valid:
# then append the `start` value, we started at a non-null value immediately followed by a null pair.
acc.append(start)
start = pa.scalar(None, data.type)
else:
# otherwise use the starting point
acc.append(start.as_py())
last = start
for item in data:
# if the current point is valid
if item.is_valid:
val = item.as_py()
# and the last is valid
if last.is_valid:
# reconstitute the delta encoded value at this position
last = pa.scalar(val + last.as_py())
acc.append(last)
else:
# otherwise the last value is assumed to be zero so it does
# not need to be adjusted
acc.append(item)
last = item
else:
# otherwise this position is null and we carry it forward as such
acc.append(item)
last = item
return pa.array(acc)Chunk Encoding Controlled Vocabulary: MS:1002312|MS-Numpress linear prediction compression
This uses the Numpress linear prediction method (10.1074/mcp.O114.037879) to compress the chunk’s values as raw bytes. Numpress creates a buffer containing an 8 byte fixed point, 4 byte value 0, 4 byte value 1, followed by 2 byte residuals for all subsequent values in the array. This means that the array is by definition not alignable to a 4- or 8-byte type. It also has no concept of nullity, which means it is not compatible with null marking.
To store Numpress linear encoded arrays, an extra column is added
<array_name>_numpress_linear_bytes is added alongside
the <array_name>_chunk_values column. It is a list of
byte arrays (large_list<u8> in Arrow parlance,
not large_binary, see discussion
of string type “optimization”). The array index entry for this
column MUST have buffer_format of
chunk_transform and the same array type, array name, data
type, unit, and data processing ID as the _chunk_values
column. The transform field in the index MUST be
MS:1002312.
Note: The start point is INCLUDED from the chunk values array. It is a specific component of the Numpress encoded bytes.
In some cases, we might prefer to store data lossily in non-uniform,
unaligned, or otherwise non-standard data types that don’t have a
physical representation in Parquet. MS-Numpress’s short logged float
(SLOF) and positive integer encodings are good examples of these cases.
While the Numpress-Linear chunk encoding works for the coordinate
dimension, we also support using other opaque transformations to encode
the secondary arrays in chunks. These columns are also recorded in the
array index with buffer_format =
chunk_transform, the transform field in the
index must be the CURIE for the relevant encoding method, such as MS:1002314
for MS-Numpress’s SLOF encoding. The column’s data type MUST be
a list of byte arrays, though the type in the array index
MUST be the decoded array’s real type after decoding. The
column names SHOULD be of the form
<array_name>_<transform_name>_bytes,
e.g. intensity_numpress_slof_bytes.
To read a single entry (e.g. spectrum, chromatogram) that is stored in chunks, the following procedure may be used:
chunk_start,
chunk_end, chunk_encoding, and
chunk_values in the array index. The
<entity_type>_index column MUST be
the first column in the table, so it always has index 0.<entity_type>_index
column equals the entry’s index.chunk_start column’s value in appropriate order, usually
ascending, for the quantity being measured.chunk_values
column according to the chunk_encoding column and any
transform listed in the relevant array index.
Unpack the chunk_secondary columns and process any
tranforms as necessary, accumulating each column’s data arrays across
rows. Processing transforms may require additional information from the
entry_type’s metadata table.Couldn’t we just unwrap the top-level struct and move on with things?
Perhaps, but the top-level structure leaves the door open for two use-cases:
point at the root
of the schema, you know this is a point
layout, not a chunked layout
file.mzpeak_index.jsonAn mzPeak archive is made up of multiple named files. To leave room for future files and avoid having to do complicated file name resolution, we use an index file that identifies the contents of each file. This broadly defines the kinds of schemas those files might have. The file MUST be serialized with UTF8.
TODO: Add wavelength files to examples
{
"files": [
{
"name": "spectra_data.parquet",
"entity_type": "spectrum",
"data_kind": "data arrays"
},
{
"name": "spectra_metadata.parquet",
"entity_type": "spectrum",
"data_kind": "metadata"
},
{
"name": "chromatograms_data.parquet",
"entity_type": "chromatogram",
"data_kind": "data arrays"
},
{
"name": "chromatograms_metadata.parquet",
"entity_type": "chromatogram",
"data_kind": "metadata"
}
],
"metadata": {
"version": "0.9.0",
"cv_list": [
{"id": "MS", "full_name": "Proteomics Standards Initiative Mass Spectrometry Ontology", "uri": "http://purl.obolibrary.org/obo/ms/4.1.248/ms.obo", "version": "4.1.248"},
{"id": "UO", "full_name": "Units of measurement ontology", "uri": "http://purl.obolibrary.org/obo/uo/releases/2026-01-16/uo.obo", "version": "2026-01-16"}
],
"file_description": {...},
}
}Governed by JSONSchema schema/mzpeak_index.json
The data_kind and entity_kind fields are
loose enumerations. They are expected to grow over time.
The file level metadata
SHOULD be stored in mzpeak_index.metadata and the
metadata Parquet files’ key-value pairs as JSON-encoded metadata
governed by the associated schemas below:
cv_listfile_descriptioninstrument_configuration_listdata_processing_method_listsoftware_listsample_listscan_settings_listrunQUESTION: Anything we put in the mzpeak_index.json is
necessarily shown in cleartext to all readers unless ZIP encryption is
used, but ZIP encryption is well known to be flawed and inconsistent.
Anything in a Parquet file’s footer key-value pairs is encryptable. The
index is JSON for convenience. We could use Parquet for that
too, but it’s overkill and harder for scripting languages to get at.
The data_kind field tells the reader the semantics of
the data stored in this file, and approximately what kind of schema to
expect.
There are currently 5 controlled values for
data_kind:
data arrays: Expected to use one of the point or chunked
layout. These files contain the signal data, usually in its “raw” form,
for the entity_type being described.peaks: This, like data arrays, is expected
to use the point or chunked layout as well. Where
data arrays might store any kind of signal data,
peaks implies that the data are processed and that there
exists an entry in data arrays that is less refined. This
is useful when storing both profile and centroid signal for a spectrum,
for example.metadata: Expected to use the packed parallel table
layout. This describes the entity’s metadata, everything but the
homogenous signal arrays stored in the data arrays file.
This file may still be large.proprietary: The layout and schema of this file is
entirely the purview of the writer which may be an instrument vendor.
These files should be ignored unless the reader is for that
instrument vendor. It may not be a Parquet file. Instrument vendors are
encouraged to use this classification on binary files or other difficult
to digest contents. Text or XML configuratin files may still be of
interest to the broader community in an evolving metadata
landscape.other: The file is none of the other listed types. This
may not be a Parquet file.Any value outside of these is assumed to be treated as
other. Files labeled as other. Any files
treated as other data kinds are implementation defined, as
are proprietary files, but other files may be
still be of interest to non-vendor readers.
Data KindThis list is necessarily incomplete as new use cases are likely to emerge. For instance, it might be desirable to store extracted LC-(IM)-MS feature bounding boxes as a separate file.
feature map for extracted featuresspectrum, but there isn’t a one-to-one or one-to-many
relationship between spectra and LC-MS, so a new entity type might be
needed.The entity_type tells the reader what is being
described in this file, in concert with the data_kind. This
makes helps the reader connect the right file to the right API.
There are currently 3 controlled values for
entity_type
spectrum: The file describes mass spectra, entities
defined as occuring at a singular point in time, or as semantically
close to this as possible in the face of framed or cycled acquisition,
with a mass-related unit like m/z or neutral mass as the coordinate of
measure.chromatogram: The file describes chromatograms or other
measurements over time like diagnostic traces. QUESTION: should
these be called “traces” instead?wavelength spectrum: Similar to spectrum,
except that the unit of measure is an electromagnetic wavelength
measurement. There is substantially more heterogeneity in the types of
analyzers that measure wavelengths compared to mass analyzers.
Additionally, time series may be constructed around wavelengths but
stored as chromatogram entries.other: The file is none of the other listed types. This
may describe something not yet covered by the living specification.Any value outside of these is assumed to be treated as
other.
Entity TypeTODO: Expand this
spectra_data.parquetFile index entry:
{
"name": "spectra_data.parquet",
"entity_type": "spectrum",
"data_kind": "data arrays"
}The spectrum signal data is encoded using either point layout or chunked layout. The entity index
column MUST be named spectrum_index, and if a time
column is written alongside it, it SHOULD be named
spectrum_time. Non-mass spectra like UV or DAD spectra
should be written in the wavelength_spectra_data.parquet
file.
When using null marking, follow the null semantics for signal data with care for profile data.
Only profile spectra should be written to this file, centroid
spectra, or processed, centroid views of profile spectra when storing
both modes should be written to the peak data file.
The number of points written to this file for a particular spectrum
MUST be written to the spectrum.MS_1003060_number_of_data_points
column in the spectra_metadata.parquet
file to facilitate appropriate reading operation planning.
When selecting a Parquet encoding for columns, favor:
spectrum_index: delta
encoding. This encoding is ideal for repetitive or slowly increasing
integer series.spectrum_time: byte
stream splitspectra_peaks.parquetFile index entry:
{
"name": "spectra_peaks.parquet",
"entity_type": "spectrum",
"data_kind": "peaks"
}The spectrum peak lists separately stored from the raw signal stored
in spectra_data.parquet.
The entity index column MUST be named
spectrum_index, and if a time column is written alongside
it, it SHOULD be named spectrum_time. Any centroid
spectra MUST be written to this file, not spectra_data.parquet.
The number of peaks written for a given spectrum in this file
MUST be written to the spectrum.MS_1003059_number_of_peaks
column in the spectra_metadata.parquet
file to facilitate reading operation planning.
File index entry:
{
"name": "spectra_metadata.parquet",
"entity_type": "spectrum",
"data_kind": "metadata"
}This metadata table uses the packed parallel metadata
table schema. The parallel schemas are shown below. The general
order of columns in unspecified, but spectrum.index,
scan.source_index, precursor.source_index, and
selected_ion.source_index MUST be the first column
of their respective schemas. Wherever these lists say MAY, that
value may either be stored as a column or as an entry in the parameter list but a column tends to
make more sense if it is usually present.
spectrum (group)
index (uint64): The ascending 0-based index. This
MUST be incrementally increasing by 1 per entry and
SHOULD be written in time-sorted ascending order (QUESTION:
Should there be a CV to denote this state?). This is the “primary key”
for the spectrum schema, making it the root unit of
addressability.id (string): The “nativeID” string identifier for the
spectrum, formatted according to to a native identifier
format. The specific nativeID format SHOULD be specified in
the file level metadata under
.file_description.source_files[0].parameters, as in
mzML.time (float64): The starting time for data acquisition
for this spectrum. This value SHOULD be replicated from the
parallel scan schema for simpler filtering, but when a
spectrum has multiple scans, it SHOULD refer to the minimum
value if the run is defined in acquisition time order.MS_1000511_ms_level
(integer): Stage number achieved in a multi stage mass spectrometry
acquisition as an integer, or null for non-mass
spectra.data_processing_ref (string): The identifier for a
data_processing that governs this spectrum if it deviates
from the default method specified in file
level metadata under .run.default_data_processing_id,
null otherwise.parameters (list): A list of controlled or uncontrolled
parameters that describe this spectrum. See the parameter list section for more
details.number_of_auxiliary_arrays (integer): The number of
auxiliary arrays that are stored with this row’s
auxiliary_arrays column. This is useful for quickly telling
if a reader needs to go through the more expensive process of reading
these and decoding them.auxiliary_arrays (list): A list of structures that
describe a data array that did not fit within the constraints as
described in the arrays and columns
section. These may be large and care should be used in deciding to
eagerly load them (or not). They are described in the auxiliary data arraysmz_delta_model (list of float64): A list of
double-precision floating point values that parameterize the m/z delta
model for reconstructing null marked
dataMS_1000525_spectrum_representation (CURIE):
MS_1000465_scan_polarity
(integer): The polarity of the spectrum represented as either a
1 (positive), -1 (negative), or
null. (QUESTION: This could also just be a CURIE)MS_1000559_spectrum_type
(CURIE): A child of MS:1000559:
MS_1003060_number_of_data_points
(integer): The number of profile data points stored for this spectrum in
the spectra_data.parquet
file.MS_1003059_number_of_peaks
(integer): The number of discrete peaks stored for this spectrum in the
spectra_peaks.parquet
file.scan (group): Scan or acquisition from original raw
file used to create a spectrum.
source_index (uint64): The index of the
spectrum this scan belongs to. This is a foreign key.scan_index (uint64): The ascending 0-based index. This
MUST be incrementally increasing by 1 per entry. This is used
to uniquely identify a scan, especially when there are multiple scans
per spectrum as in some summing or averaging scenarios.spectrum_reference (string): Refers to another spectrum
that corresponds to this scan. For spectra local to this archive, its
id attribute is used. For spectra from external
sources, a USI SHOULD
be used. When the collection is not published, use
USI000000 for the collection identifier and the
id of a source file in the
@file_description.source_filesinstrument_configuration_ref (integer): The identifier
for instrument_configuration that governs this scan.parameters (list): A list of controlled or uncontrolled
parameters that describe this scan. See the parameter list section for more
details.ion_mobility_value (float64): Optional, the ion
mobility measurement for this scan.ion_mobility_type (CURIE): Optional, the kind of ion
mobility being measured using a child of MS:1002892.scan_windowsprecursor (group): The method of precursor ion
selection and activation
source_index (uint64): The index of the
spectrum this precursor record belongs to. This is a foreign key.precursor_index (uint64): The index of the
spectrum that is the precursor record was created from. This is a
foreign key.precursor_id (string): The id of the
spectrum referenced by precursor_index.isolation_window (group): The isolation (or
‘selection’) window configured to isolate one or more ions
parameters (list)activation (group): The type and energy level used to
activate or dissociate this precursor.
parameters (list): A list of controlled or uncontrolled
parameters that describe this precursor’s activation. See the parameter list section for more
details.selected_ion (group): An ion isolated for dissociation.
source_index (uint64): The index of the
spectrum this selected ion belongs to. This is a foreign key.precursor_index (uint64): The index of the
spectrum that is the selected ion was created from. This is a foreign
key.ion_mobility_value (float64): Optional, the ion
mobility measurement for this ion.ion_mobility_type (CURIE): Optional, the kind of ion
mobility being measured using a child of MS:1002892.parameters (list): A list of controlled or uncontrolled
parameters that describe this selected ion. See the parameter list section for more
details.QUESTION: Is there a better way to make ion mobility storage generic over type (“ion mobility drift time”, “inverse reduced ion mobility”, “FAIMS compensation voltage”)?
chromatograms_data.parquetFile index entry:
{
"name": "chromatograms_data.parquet",
"entity_type": "chromatogram",
"data_kind": "data arrays"
}The chromatogram signal data is encoded using either point layout or chunked layout. The entity index
column MUST be named chromatogram_index. The
default primary axis for this type of signal data is a MS:1000595|time array,
though the unit is up to the writer.
When selecting a Parquet encoding for columns, favor:
chromatogram_index: delta
encoding. This encoding is ideal for repetitive or slowly increasing
integer series.chromatograms_metadata.parquetFile index entry:
{
"name": "chromatograms_metadata.parquet",
"entity_type": "chromatogram",
"data_kind": "metadata"
}chromatogram (group)
index (integer): The ascending 0-based index. This
MUST be incrementally increasing by 1 per entry and
SHOULD be written in time-sorted ascending order (QUESTION:
Should there be a CV to denote this state?). This is the “primary key”
for the chromatogram schema, making it the root unit of
addressability.id (string): A unique string identifier for the
chromatogram.MS_1000465_scan_polarity
(integer): The polarity of the chromatogram represented as either a
1 (positive), -1 (negative), or
null. (QUESTION: This could also just be a CURIE)data_processing_ref (integer): The identifier for a
data_processing that governs this chromatogram if it
deviates from the default method specified in file level metadata under
.run.default_data_processing_id, null
otherwise.parameters (list): A list of controlled or uncontrolled
parameters that describe this chromatogram. See the parameter list section for more
details.number_of_auxiliary_arrays (integer): The number of
auxiliary arrays that are stored with this row’s
auxiliary_arrays column. This is useful for quickly telling
if a reader needs to go through the more expensive process of reading
these and decoding them.auxiliary_arrays (list): A list of structures that
describe a data array that did not fit within the constraints as
described in the arrays and columns
section. These may be large and care should be used in deciding to
eagerly load them (or not). They are described in the auxiliary data arraysMS_1003060_number_of_data_points
(integer): The number of data points stored for this chromatogram in the
chromatograms_data.parquet
file.precursor (group): The method of precursor ion
selection and activation
source_index (integer): The index of the
chromatogram this precursor record belongs to. This is a foreign
key.precursor_index (integer): The index of
the chromatogram that is the precursor record was created from. This is
a foreign key.precursor_id (string): The id of the
chromatogram referenced by precursor_index.isolation_window (group): The isolation (or
‘selection’) window configured to isolate one or more ions
parameters (list)activation (group): The type and energy level used to
activate or dissociate this precursor.
parameters (list): A list of controlled or uncontrolled
parameters that describe this precursor’s activation. See the parameter list section for more
details.selected_ion (group): An ion isolated for dissociation.
source_index (integer): The index of the
chromatogram this selected ion belongs to. This is a foreign key.precursor_index (integer): The index of
the chromatogram that is the selected ion was created from. This is a
foreign key.ion_mobility_value (float): Optional, the ion mobility
measurement for this ion.ion_mobility_type (CURIE): Optional, the kind of ion
mobility being measured using a child of MS:1002892.parameters (list): A list of controlled or uncontrolled
parameters that describe this selected ion. See the parameter list section for more
details.File index entry:
{
"name": "wavelength_spectra_data.parquet",
"entity_type": "wavelength spectrum",
"data_kind": "data arrays"
}The wavelength spectrum signal data is encoded using either point layout or chunked layout. The entity index
column MUST be named wavelength_spectrum_index,
and if a time column is written alongside it, it SHOULD be
named wavelength_spectrum_time. This SHOULD only
be present if wavelength spectra are included in the mzPeak archive.
When using null marking, follow the null semantics for signal data with care for profile data.
wavelength_spectra_metadata.parquetFile index entry:
{
"name": "wavelength_spectra_metadata.parquet",
"entity_type": "wavelength spectrum",
"data_kind": "metadata"
}This metadata table uses the packed parallel metadata
table schema. This should only be present if wavelength spectra are
included in the mzPeak archive. The parallel schemas are shown below.
The general order of columns in unspecified, but
spectrum.index and scan.source_index
MUST be the first column of their respective schemas. Wherever
these lists say MAY, that value may either be stored as a
column or as an entry in the parameter
list but a column tends to make more sense if it is usually present.
It mirrors the spectrum
metadata layout, but does not include a precursor or
selected_ion facet because EMR spectra have not been
observed with isolation and fragmentation yet.
This metadata is stored separately from the mass spectra, allowing the two different data modalities to have divergent schemas without inflating the number of empty columns, and for ease of searching, so the reader does not need to sort through mass spectra while looking for EMR spectra and vice-versa.
spectrum (group)
index (uint64): The ascending 0-based index. This
MUST be incrementally increasing by 1 per entry and
SHOULD be written in time-sorted ascending order (QUESTION:
Should there be a CV to denote this state?). This is the “primary key”
for the spectrum schema, making it the root unit of
addressability.id (string): The “nativeID” string identifier for the
spectrum, formatted according to to a native identifier
format. The specific nativeID format SHOULD be specified in
the file level metadata under
.file_description.source_files[0].parameters, as in
mzML.time (float64): The starting time for data acquisition
for this spectrum. This value SHOULD be replicated from the
parallel scan schema for simpler filtering, but when a
spectrum has multiple scans, it SHOULD refer to the minimum
value if the run is defined in acquisition time order.data_processing_ref (string): The identifier for a
data_processing that governs this spectrum if it deviates
from the default method specified in file
level metadata under .run.default_data_processing_id,
null otherwise.parameters (list): A list of controlled or uncontrolled
parameters that describe this spectrum. See the parameter list section for more
details.number_of_auxiliary_arrays (integer): The number of
auxiliary arrays that are stored with this row’s
auxiliary_arrays column. This is useful for quickly telling
if a reader needs to go through the more expensive process of reading
these and decoding them.auxiliary_arrays (list): A list of structures that
describe a data array that did not fit within the constraints as
described in the arrays and columns
section. These may be large and care should be used in deciding to
eagerly load them (or not). They are described in the auxiliary data arraysMS_1000559_spectrum_type
(CURIE): A child of MS:1000559:
MS_1003060_number_of_data_points
(integer): The number of profile data points stored for this spectrum in
the wavelength_spectra_data.parquet
file.scan (group): Scan or acquisition from original raw
file used to create a spectrum.
source_index (uint64): The index of the
spectrum this scan belongs to. This is a foreign key.instrument_configuration_ref (integer): The identifier
for instrument_configuration that governs this scan.parameters (list): A list of controlled or uncontrolled
parameters that describe this scan. See the parameter list section for more
details.scan_windows: See equivalent substructure for spectraJoshua A. Klein Boston MA, USA joshua.adam.klein@gmail.com
Tim Van Den Bossche, Ghent University, Ghent, Belgium; VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium Tim.VanDenBossche@ugent.be
Samuel Wein Wissenschaftlicher Mitarbeiter, Institute for Bioinformatics and Medical Informatics, University of Tübingen samuel.wein@uni-tuebingen.de
Oliver Kohlbacher Professor, Applied Bioinformatics, Dept. of Computer Science, University of Tübingen; Director, Institute for Bioinformatics and Medical Informatics, University of Tübingen; Director, Institute for Translational Bioinformatics, University Hospital Tübingen oliver.kohlbacher@uni-tuebingen.de
TODO: Fill in with more people from the mailing list
TODO: Fill in with more people from the mailing list
The PSI takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the PSI Chair.
The PSI invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this recommendation. Please address the information to the PSI Chair (see contact information at PSI website).
Copyright (C) 2026 by the Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) under the CC-BY-ND 4.0 license (https://creativecommons.org/licenses/by-nd/4.0/).
TODO: The mzPeak name is currently held in trust by the OpenMS Inc. The details of the trademark are described here