mzPeak File Format

PSI Recommendation

PSI Mass Spectrometry and Proteomics Informatics Working Groups Status: DRAFT


Status of this document

This document provides information to the proteomics community about a TODO. Distribution is unlimited. This document will be ratified via the HUPO Proteomics Standards Initiative (PSI) Document Process. Any alterations of this document MUST also follow the HUPO PSI Document Process.

Version Draft 5 of version 0.9

Abstract

Introduction

Description of the need

Issues to be addressed

Notational conventions

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” are to be interpreted as described in RFC 2119 (2).

Code samples

This document describes both a file format and a set of suggested algorithms for preparing data to be stored in that format. The original author (Joshua Klein), includes snippets of Python code to do these operations under the assumption that at this time most technical programmers will know Python, that Python is effectively executable pseudocode, save that the code snippets use three components, abstract base classes from the standard library for type annotations, NumPy v2.1 for certain array operations which are assumed to be understandable, and PyArrow v20.0 for operations on Apache Arrow arrays which are conceptually equivalent for in-memory representations of the data stored in Parquet files.

Overview

What is mzPeak?

mzPeak is an archive of multiple Parquet files, stored directly in an uncompressed ZIP archive or unpacked directory/prefix. Each Parquet file describes a different facet of the stored mass spectrometry run. While the the data model draws on prior art like mzML (https://peptideatlas.org/tmp/mzML1.1.0.html), it is not a direct re-implementation in a Parquet table. It does attempt to re-use concepts like controlled vocabularies where feasible as well as arbitrary additional user metadata.

Components of an mzPeak archive:

and other Parquet files may be added to cover additional modalities as needed like the wavelength spectra.

Anatomy of a Parquet file

This is a minimal overview of Parquet. For more details, please see https://parquet.apache.org/ for further explanation.

The schema

Parquet files contain a physical data schema defining how their data columns are encoded in bytes on disk. This schema supports arbitrary levels of nullability, nesting (groups) or repetition (lists). These physical data types may also be mapped to one or more “logical types”.

There is a broader many-to-many mapping between Parquet schemas and Apache Arrow schemas. Arrow supports many types that Parquet does not, but they share a common abstraction of columnar data storage with a notion of per-value nullability, and while they store these concepts differently, it is straight-forwards to convert from one to the other.

The metadata key-value pairs

Parquet schematic

At the end of a Parquet file is a footer containing user-defined metadata along with the file’s schema, offsets and search indices. This user-defined metadata is stored in key-value pairs, which makes it amenable to serializing light-weight, immediately interesting metadata there that do not make sense to force into the data columns.

The columnar data

Parquet is a strongly typed binary columnar data format with layered blocked compression that permits a degree of random access.

As some languages do not have a concept of 64-bit addresses, all implementations MUST handle both large_* and not large_* variants of collection arrays, list, string, and binary types.

Row Groups and Pages

TODO: write more here regarding compression vs. random access granularity. Lots of knobs to twist

Index levels

When writing an mzPeak archive, the writer MUST write a page index. Most libraries that write Parquet support writing the page index, even if they do not directly support reading informed by the page index.

TODO: write more here

Container

ZIP archives

In order to pack multiple Parquet tables together under a single file name on disk, we need a container file format. To that end we use the ZIP archive to bundle multiple files together. ZIP files start with a header containing the magic bytes followed by a sequence of blocks of (header, file) pairs, terminating with a central directory listing how to find each file in the archive. Files saved in a ZIP may be stored compressed or uncompressed. When mzPeak is stored in a ZIP it MUST store its member files uncompressed. Files stored uncompressed can be read directly without requiring some intervening decompression to occur to reveal the Parquet file, and the Parquet file itself contains layered compression that is superior to that of most plain ZIP compressors.

Why not TAR?

TAR archives are designed for a linear traversal. In order to know all of the files in the archive, you must jump from header entry to header entry until you reach the end of the archive. Compared to ZIP’s central directory index, this is less efficient and more expensive for object stores. TAR does not support per file encryption either, making protecting parts of the archive that are not in Parquet more difficult.

Unpacked archives

If an mzPeak archive is stored in an unpacked directory, the directory name is treated as the name of the name of the run file.

Relationship to other specifications

The format specification described in this document is not being developed in isolation; indeed, it is designed to be complementary to, and thus used in conjunction with, several existing and emerging models. Related specifications include the following:

  1. PSI Universal Spectrum Identifier (http://psidev.info/USI). The PSI Universal Spectrum Identifier is designed to provide a universal mechanism for referring to a specific spectrum in public repositories.
  2. mzML (https://www.psidev.info/mzml). The current PSI recommendation for storing raw or processed mass spectrometry data.
  3. SDRF (https://www.psidev.info/sdrf-sample-data-relationship-format).
  4. imzML (https://github.com/imzML/imzML/).

Controlled Vocabularies (CV)

Data Layouts

Packed Parallel Metadata Tables

The spectra_metadata.parquet and chromatograms_metadata.parquet store multiple schemas in parallel. In these Parquet files, the root schema is made up of several branched “group” or “struct” (Parquet vs. Arrow nomenclature) that may be null at any level. We use relational database language, specifically, “primary key” and “foreign key” to describe the interconnections between the different tables that are packed together here.

Here is a stripped down example where two rows of related MS1 and MS2 spectra. Treat scan.source_index, precursor.source_index, precursor.precursor_index, selected_ion.source_index, and selected_ion.precursor_index as a foreign key with respect to spectrum.index, a primary key. precursor.source_index refers to the spectrum which this precursor record belongs to and precursor.precursor_index refers to the spectrum that is that is the precursor of the spectrum referenced by precursor.source_index, and (precursor.source_index, precursor.precursor_index) forms a compound primary key. Any of these columns may be null which means that such a record does not exist in the table. This also applies to the selected_ion facet.

spectrum scan precursor selected_ion
index id time MS_1000511_
ms_level
source_
index
MS_1000616_preset_
scan_configuration
source_
index
precursor_
index
isolation_
window
source_
index
precursor_
index
MS_1000744_selected_
ion_mz
502 scan=502 20.51 1 502 3 503 502 {…} 503 502 233.5
503 scan=503 20.531 2 503 2 504 502 {…} 504 502 562.3

Controlled Vocabulary Terms

Like mzML, mzPeak makes heavy use of controlled vocabularies for representing rich metadata. mzPeak uses controlled vocabulary terms in several ways:

  1. As columns. When a term is used as a column name, that column’s values are either the defined value of an expected type for the term (e.g. the term has_value_type) OR the a CURIE for a child of the column name. For example
    1. The column MS_1000525_spectrum_representation would have values that are CURIEs for a child term, MS:1000127 “centroid spectrum” or MS:1000128 “profile spectrum”, as appropriate for the spectrum the row is describing.
    2. The column MS_1000511_ms_level would hold an integer value, as appropriate for the spectrum the row is describing.
  2. As structural elements. In several places in the format, like the array index, we use CURIEs to reference named concepts that explain the semantics of the data structure without changing the shape of the data structure.
  3. As pluggable metadata carriers in parameters arrays, akin to <cvParam /> in mzML. For every schema facet of a metadata table, a parameters column is allowed. See the parameters list for details.

The parameters list

The parameters column may be present in any facet of a metadata table. It MUST be a list of the following schema:

optional group field_id=-1 parameters (List) {
  repeated group field_id=-1 list {
    optional group field_id=-1 item {
      optional group field_id=-1 value {
        optional int64 field_id=-1 integer;
        optional double field_id=-1 float;
        optional binary field_id=-1 string (String);
        optional boolean field_id=-1 boolean;
      }
      optional binary field_id=-1 accession (String);
      optional binary field_id=-1 name (String);
      optional binary field_id=-1 unit (String);
    }
  }
}

The parameters.list.item.value group must have a column for each data type so a parameter can take on one of these value types. Unused type slots MUST be null. (QUESTION: Should this support lists/maps?). parameters entries MAY have unit defined by a controlled vocabulary term CURIE stored in the parameters.list.item.unit column. As in mzML’s <userParam />, uncontrolled parameters may also be included in the parameters list by simply storing a parameter WITHOUT a value in the parameters.list.item.accession column.

NOTE: Parquet columns MUST be uniquely named, so if a parameter is present multiple times in single entry it MUST be stored in the parameters column.

NOTE: Writers are encouraged to, when sufficient context is available, encode parameters that are present in most rows of a table as columns. This is more space efficient and opens the door to easy predicate filtering.

Typing Parameter Values

When a term has a value, if it is stored in the parameters list, the value MUST be stored using one of the provided value types as given by the fixed schema. When storing a term that has been encoded as a separate column, values can be stored in any data type supported by Parquet. This flexibility allows us to pick a physical type that uses an appropriate size and precision for the value being stored, but can also create a lot of redundant code to handle a column that we expect contains an integer but whether the file was written using an 8-, 16-, 32-, or 64-bit integer, and whether it was stored signed or unsigned. Some languages can handle these naturally using dynamic typing or templates, while others require manually repeat implementations for each type a column might use.

Column Name Inflection

When representing a controlled vocabulary term concept as a column in the table, the column name SHOULD use the following inflection rules to construct the column name:

  1. The base column name is ${CV_CODE}_${CV_ACCESSION}_${CLEANED_NAME} where:
    1. ${CV_CODE} is the identifier for the controlled vocabulary itself, MS for PSI-MS or UO for the unit ontology.
    2. ${CV_ACCESSION} is the accession number. MS:1000016 for “scan start time” would be 1000016.
    3. ${CLEANED_NAME} is the term’s name with any non-Parquet column compatible characters replaced with _. The regular expression /[^a-zA-Z0-9_\\\\-]+/ is sufficient to match all of these characters in ASCII. For MS:1000016 for “scan start time”, this would be MS_1000016_scan_start_time
    4. The string “m/z” appears so frequently it SHOULD be rewritten mz to avoid unnecessary additional underscores. For MS:1000504 for “base peak m/z”, this would be written MS_1000504_base_peak_mz.
  2. IF there is a single unit kind specified for all values in the column it SHOULD be specified by appending _unit_${UNIT_CV_CODE}_${UNIT_CV_ACCESSION} to the inflected name. MS_1000528_lowest_observed_mz_unit_MS_1000040 would correspond to MS:1000528 “lowest observed m/z” with a unit of MS:1000040 “m/z”.
  3. IF the unit for this column varies it SHOULD be specified as a separate column with _unit appended to the column name whose value is the unit’s CURIE.

Null Semantics for Metadata

A row value that is null should be treated as being absent, having no value. If a foreign key column is null, assume the entry does not exist in the table, as in the case where an MS2 spectrum is stored without MS1 spectra as in MGF files, or a slice of an MS run. If it is the primary key of the table, the reader SHOULD skip of the columns in that row for that table.

A writer implementation is SHOULD to minimize the number of interspersed rows that are null, but this is not strictly required. Minimizing the interspersed nulls improves compressability. See the images below, the “Packed Tables” has all of the rows of each parallel table contiguous, while the “Sparse Tables” diagram shows rows of nulls intermixed

Signal Data Layouts

Arrays and Columns

It is common in mass spectrometry to talk about a spectrum having an m/z array as synonymous with having been measured in the m/z dimension, and those m/z values are represented using some kind of physical data type in memory, likewise having an intensity array corresponding to the abundance of the signal parallel to the m/z array. In mzML, it is possible to use different physical data types for these two dimensions on different spectra in the same file, and there may well be legitimate use-cases for that. mzPeak can store array data in two ways. One way to store the arrays as columns in a signal data layout, burning the column into the schema and added to the array index. Another way is to store it as an auxiliary array which will be stored in the associated metadata table’s *.auxiliary_arrays value for that entity’s row. Auxiliary data arrays can be individually configured by the writer, have custom compression or data type decoding or cvParams, but it cannot be searched or sliced (read a segment of) without decoding the entire array, just as in mzML. By contrast, any array that is written as a column is encoded directly in Parquet, is part of the schema, and subject to its adaptive encoding process and compression. Currently, we assume that the first sorted array is the axis around which all other values are arranged, sorting rank 0, and any arrays that are shorter or longer SHOULD instead be stored in auxiliary_arrays as well.

When writing, if an array with a sorting rank is unsorted, the entry’s data arrays MUST be re-sorted accordingly. Failure to do so introduces integrity errors.

The Array Index

In order to properly annotate what kind of array a column is, we include a JSON-serialized array index in the Parquet key-value metadata, an list of data structures that describe each array in controlled vocabulary. A column is part of the Parquet file’s schema and must always exist and have a homogenous type of value or a be marked null for each row. The array index is stored in the Parquet metadata for the data arrays or peaks files under <entity_type>_array_index, e.g. spectrum_array_index for spectra or chromatogram_array_index.

{
  "prefix": "point",
  "entries": [
    {
      "context": "spectrum", // This is an array describing a spectrum
      "path": "point.mz", // The path to the column for this array in the Parquet schema
      "data_type": "MS:1000523", // The controlled vocabulary term  for the data type of this array
      "array_type": "MS:1000514", // The controlled vocabulary term for the array itself
      "array_name": "m/z array", // A human readable name and a place to store a custom name through `non-standard array`
      "unit": "MS:1000040", // The values in this array have the unit m/z
      "buffer_format": "point", // This column uses the point layout
      "transform": null, // No transformation was applied to this data
      "data_processing_id": null, // No specific data processing pipeline was applied, use the default data processing method
      "buffer_priority": "primary", // This is the primary m/z array, default all queries to read this column when looking for m/z values
      "sorting_rank": 0 // This array's column is assumed to be sorted within entries, with all other arrays' columns sorted afterwards
    },
    {
      "context": "spectrum",
      "path": "point.intensity",
      "data_type": "MS:1000521",
      "array_type": "MS:1000515",
      "array_name": "intensity array",
      "unit": "MS:1000131", // The unit of this array is detector counts
      "buffer_format": "point",
      "transform": null,
      "data_processing_id": null,
      "buffer_priority": "primary",
      "sorting_rank": null // This array does not impose any sorting order on the data
    }
  ]
}

This array index describes the table shown below for the point layout

Governed by schema/array_index.json.

Buffer Format

Depending upon the signal data layout being used, arrays will be stored in different formats.

Available formats: - point: This array is stored using the point layout. The point layout is all-or-nothing, every array MUST be in that format. - chunk_values: This array is part of the chunked layout. It contains the list of values of the “main” axis bounded between the chunk’s start and end point. These values are encoded to be more compressable. This encoding is in addition to the Parquet encoding step. - chunk_start: This array is part of the chunked layout. It contains the starting value of the “main” axis for the chunk, inclusive. - chunk_end: This array is part of the chunked layout. It contains the ending value of the “main” axis for the chunk, inclusive. It sjould be less than the next chunk’s chunk_start value. - chunk_encoding: This array is part of the chunked layout. It contains a CURIE indicating how chunk_values was encoded. - chunk_secondary: This array is part of the chunked layout. It contains the list of values of an array other than the main axis for the chunk. - chunk_transform: This array is part of the chunked layout. It contains the list of raw byte contents of an array in the chunk that was opaquely transformed, e.g. using MS-Numpress. It may be present in addition to a referenced chunk_values or chunk_secondary column.

Currently the chunked layout only supports a single chunking dimension. A single file in the chunked layout MUST use chunk_start, chunk_end, chunk_encoding, and chunk_values EXACTLY once.

Buffer Priority, Naming

In Parquet, all column names and types need to be known before you can begin writing, and no two columns can have the same name + path. Normally, we have a coordinate array column (e.g. m/z or time) and an intensity array column. If you have intensity arrays with different units or different data types, they would need to be defined as separate arrays in the array index and thus have distinct names. While this case may be uncommon for spectra, when working with diagnostic traces stored as chromatograms this can be unavoidable. For ergonomics, we want to use simple column names most of the time, and it would be ideal if the most common columns had consistent names as this makes using files from raw Parquet tools easier. To that end, the most common (as defined by the implementation) version of each array type SHOULD have a buffer_priority property of primary and receive a short and consistent name. The table below lists recommended short names:

accession name column name
MS:1000514 m/z array mz
MS:1000515 intensity array intensity
MS:1000516 charge array charge
MS:1000517 signal to noise array signal_to_noise
MS:1000595 time array time
MS:1000617 wavelength array wavelength
MS:1002530 baseline array baseline
MS:1002529 resolution array resolution
MS:1002893 ion mobility array ion_mobility
MS:1003007 raw ion mobility array raw_ion_mobility
MS:1002816 mean ion mobility array mean_ion_mobility
MS:1003154 deconvoluted ion mobility array deconvoluted_ion_mobility
MS:1003008 raw inverse reduced ion mobility array raw_inverse_reduced_ion_mobility
MS:1003006 mean inverse reduced ion mobility array mean_inverse_reduced_ion_mobility
MS:1003155 deconvoluted inverse reduced ion mobility array deconvoluted_inverse_reduced_ion_mobility
MS:1003153 raw ion mobility drift time array raw_drift_time
MS:1002477 mean ion mobility drift time array mean_drift_time
MS:1003156 deconvoluted ion mobility drift time array deconvoluted_ion_mobility_drift_time
Array name recommendations

Data Arrays, Encoding, Transformations and Parquet

Parquet can write page indices on any column that is a leaf node in the schema based upon the value being stored prior to applying encoding and compression. To that effect, we must take care when trying to store data cleverly. The following section may refer to spectra, but these are applicable more broadly.

Zero Run Stripping

When storing spectrum data, some vendors will produce arrays with lots of “empty” regions filled with zero intensity values along a semi-regularly spaced m/z axis. These regions hold little information, so all but the first and last zero intensity points are removed. This is only meaningful for profile data. Readers SHOULD assume that zero runs have been stripped.

Null Marking

For spectra with many small gaps, even zero run stripping leaves too much unhelpful information in the data. We can instead replace the flanking zero intensity points with null m/z and intensity values and Parquet will skip storing the expensive 32- and/or 64-bit values, retaining only the validity buffer bit flag. We can separately fit a simple m/z spacing model using weighted least squares of the form:

δmz ∼ β0 + β1mz + β2mz2 + ϵ

or using the following Python code:

Python code for fitting the weighted least squares model
class DeltaCurveRegressionModel:
    beta: np.ndarray

    def __init__(self, beta: np.ndarray):
        self.beta = beta

    @classmethod
    def fit(
        cls,
        mz_array,
        delta_array,
        weights: np.ndarray | None = None,
        threshold: float | None = None,
        rank: int = 2,
    ):
        if weights is None:
            weights = np.ones(len(mz_array))
        else:
            weights = weights

        if threshold is None:
            threshold = 1.0

        # Drop all entries where the gap between m/z values > threshold
        raw = mz_array[1:][delta_array <= threshold]
        w = weights[1:][delta_array <= threshold]
        y = delta_array[delta_array <= threshold]

        # Build the design matrix
        data = [data.append(np.ones_like(raw))]
        for i in range(1, rank + 1):
            data.append(raw**i)
        data = np.stack(data, axis=-1)

        # Use the QR decomposition to solve the weighted least squares problem
        # to estimate weights predicting δ m/z.
        # https://stats.stackexchange.com/a/490782/59613
        chol_w = np.sqrt(w)
        qr = np.linalg.qr(chol_w[:, None] * data)
        v = qr.Q.T.dot(chol_w * y)
        beta = solve_triangular(qr.R, v)

        # Numerically equivalent to and more stable than the direct inversion
        # beta = np.linalg.inv((data.T * w).dot(data)).dot(data.T * w).dot(y)
        return cls(beta)

    def predict(self, mz: float) -> float:
        acc = self.beta[0]
        for i in range(1, len(self.beta)):
            acc += self.beta[i] * mz ** i
        return acc

Then when reading the the null-marked data, use either the local 2nd median δmz or the learned model for that spectrum to compute the m/z spacing for singleton points to achieve a very accurate reconstruction. Because the non-zero m/z points remain unchanged, the reconstructed signal’s peak apex or centroid should be unaffected. If the peak is composed of only three points including the two zero intensity spots, no meaningful peak model can be fit in any case so the miniscule angle change this would induce are still effectively lossless. The parameters of the model learned for each entry MUST be stored in the relevant entry’s row in the associated metadata table as mz_delta_model (QUESTION: should this become a CV param and dissociate it from m/z? Probably).

Thermo dataset with null marking Sciex dataset with delta encoding and null marking

Keep in mind that all Numpress compression methods are still available and still provide superior size reduction, but carry this slightly larger loss of accuracy. Using a Numpress compression is a transformation that requires the Chunked Layout.

Removing Zero Runs

A zero run is defined explicitly as a sequence of 3 or more zero values that should be reduced to just the first and last positions in the run. Zero runs can be very long and outside of certain scenarios which assume a complete grid of coordinate values, provide no value. If zero runs need to be reconstructed beyond the flanking points left at the end of these runs, the same method used to fill in nulls here can be used to extend zero runs.

Python code for finding zero runs
def find_where_not_zero_run(data: Sequence[Number]) -> Sequence[int]:
    """
    Construct a list of positions that are not part of a zero run.

    A zero run is any position *i* such that:
      1. ``x[i] == 0``
      2. ``(i == 0) or (x[i - 1] == 0)``
      3.``(i == (len(x) - 1)) or (x[i + 1] == 0)``

    We build a position list here because we need to extract these positions
    from ALL dimension arrays for this entity, not just the current array.

    Parameters
    ----------
    data : :class:`Sequence` of :class:`Number`
        The numerical data to traverse

    Returns
    -------
    :class:`np.ndarray` of :class:`np.uintp`
    """
    n = len(data)
    n1 = n - 1

    was_zero = False

    acc = []
    i = 0
    while i < n:
        v = data[i]
        if v is not None:
            if v == 0:
                if (was_zero or (len(acc) == 0)) and ((i < n1 and data[i + 1] == 0) or i == n1):
                    pass
                else:
                    acc.append(i)
                was_zero = True
            else:
                acc.append(i)
                was_zero = False
        else:
            acc.append(i)
            was_zero = False
        i += 1
    return np.array(acc, dtype=np.uintp)
Finding Flanking Zero Pairs

Zero intensity points on the sides of peaks still use non-trivial amounts of storage in sparse datasets. This step does not match all zero intensity points, only those that occur on the flanks of profile peaks. Once these indices are found, they can be used to construct the null mask or “validity bitmap” of an Arrow array which is equivalent to how a Parquet column chunk would be constructed.

Python code for finding flanking zeros
def is_zero_pair_mask(data: Sequence[Number]) -> "np.typing.NDArray[np.bool_]":
    '''
    Create a boolean mask for positions that are composed of two zeroes in a row.

    Parameters
    ----------
    data : :class:`Sequence` of :class:`Number`
        The numerical data to traverse

    Returns
    -------
    :class:`np.ndarray` of :class:`bool`
    '''
    n = len(data)
    n1 = n - 1
    was_zero = False
    acc = []
    for i, v in enumerate(data):
        if v == 0:
            if was_zero or (i < n1 and data[i + 1] == 0):
                acc.append(True)
            else:
                acc.append(False)
            was_zero = True
        else:
            acc.append(False)
            was_zero = False
    return np.array(acc)
Decoding Null Pairs

Decoding null pairs, the process of undoing null marking involves finding regions bounded between two nulls, or the start of the array and a null, or a null and the end of the array, then filling the null values with either a locally estimated value when you have more than one value to estimate the median delta from, or to use the regression model described ealier to impute the value for a single point. Unpaired null values MAY only be found as the first or last null in the array, any other unpaired nulls are unrecoverable errors. A run of three or more null values is encountered, it MAY be recoverable but should not occur under normal operation.

The locally estimated value SHOULD be the second median of the spacing of the current segment’s non-null values. The regression model is used to predict the spacing from the non-null value within a segment with only one non-null value.

Python code for filling null marked values

def find_pairs(mask: Sequence[bool]) -> Sequence[int]:
    """
    Construct index ranges between pairs of :const:`True` values in
    ``mask``.

    The first and last index range will include the beginning and ending
    of the array respectively, even if the mask does not start/end with a
    :const:`True` value.

    The resulting array will have two columns, the start and end indices
    of the spans between two :const:`True` values (or the termini of the array).

    .. warning::
      This function *can* fail or produce incorrect output if there are runs of
      :const:`True` values longer than 2 in the ``mask``. It is also necessary to
      buffer all data for an entry before invoking this function so that batches
      do not artificially disrupt pairs.

    Parameters
    ----------
    mask : :class:`Sequence` of :class:`bool`

    Returns
    -------
    np.typing.NDarray[int]
    """
    parts = []
    indices = np.where(mask)[0]
    if len(indices) == 0:
        return np.array([[0, len(mask)]])
    if indices[0] != 0:
        parts.append([0])
    parts.append(indices)
    if indices[-1] != len(mask) - 1:
        parts.append([len(mask) - 1])
    indices = np.concat(parts)
    indices = indices.reshape((-1, 2))
    indices[:, 1] += 1
    return indices


def estimate_median_delta(data: Sequence[Number]) -> tuple[Number, np.typing.NDArray]:
    """
    Find the 2nd median of ``np.diff(data)``.

    This is a relatively crude spacing estimate for continuous profile data.

    Returns
    -------
    :class:`Number`
        The 2nd median of ``np.diff(data)``
    :class:`np.ndarray`
        The values from which the previous return values were estimated
    """
    deltas = np.diff(data)
    median = np.median(deltas)
    deltas_below = deltas[deltas <= median]
    median = np.median(deltas_below)
    return median, deltas_below


def fill_nulls(
    data: pa.Array, common_delta: DeltaModelBase
) -> "np.typing.NDArray":
    """
    Fill ``null`` values in ``data`` using the ``common_delta`` model or the locally estimated
    median delta if sufficient data are available.

    Parameters
    ----------
    data : :class:`pyarrow.Array`
        The data array to fill nulls in with ``common_delta``
    common_delta : :class:`DeltaModelBase` or :class:`Number`
        The common spacing model, either specified as a model instance or as a single constant spacing term.

    Returns
    -------
    np.ndarray
    """
    if not isinstance(common_delta, DeltaModelBase):
        if isinstance(common_delta, Number):
            common_delta = ConstantDeltaModel(common_delta)
        else:
            common_delta = DeltaCurveRegressionModel(np.asarray(common_delta))

    pair_indices = find_pairs(data.is_null())

    chunks = []
    for (start, end) in pair_indices:
        # Get the values in the array between start and end
        chunk = np.asarray(data.slice(start, end - start))
        n = len(chunk)
        # The set of values that are not null
        has_real = chunk[~np.isnan(chunk)]
        n_has_real = len(has_real)
        if n_has_real == 1:
            # If there is only one non-null value, this is a singleton
            # point, but it might only have one or two sides to pad
            if n == 2:
                if np.isnan(chunk[0]):
                    chunk[0] = chunk[1] - common_delta(chunk[1])
                else:
                    chunk[1] = chunk[0] + common_delta(chunk[0])
            elif n == 3:
                dx = common_delta(chunk[1])
                chunk[0] = chunk[1] - dx
                chunk[2] = chunk[1] + dx
            else:
                raise Exception()
        else:
            # Otherwise this is a run of values, so we can estimate a more accurate
            # delta directly from the data
            dx, _ = estimate_median_delta(has_real)
            if np.isnan(chunk[0]):
                chunk[0] = chunk[1] - dx
            if np.isnan(chunk[-1]):
                chunk[-1] = chunk[-2] + dx
        chunks.append(chunk)
    return np.concat(chunks)
Null Semantics for Signal Data

Unless otherwise noted, readers SHOULD treat null values in the sorting rank 0 array of the entry as governed by this model with parallel null values in any intensity arrays as 0. The former should have a transformation value of MS:1003901 and the latter should have a transformation value MS:1003902. All other values for those points should be read as-is with null semantics meaning that the value was absent. Writers using null marking SHOULD only use null for the first sorting dimension and associated intensity value, all other columns should be written as-is.

Auxiliary Data Arrays

When an array is present in an entry, but is not encoded as a column in the schema, it must be stored as an auxiliary array. This can happen when mixing different kinds of detectors in a single collection, or especially with diagnostic traces where every array might have different dimensions along a shared time axis or subsampled arrays. Auxiliary data arrays have a schema similar to binaryDataArray in mzML, encoded in Parquet. They are described in JSON schema at schema/auxiliary_array.json

optional group auxiliary_arrays (List) {
  repeated group list {
    optional group item {
      optional group data (List) {
        repeated group list {
          required int32 item (Int(bitWidth=8, isSigned=false));
        }
      }
      optional group name {
        optional group value {
          optional int64 integer;
          optional double float;
          optional binary string (String);
          optional boolean boolean;
        }
        optional binary accession (String);
        optional binary name (String);
        optional binary unit (String);
      }
      optional binary data_type (String);
      optional binary compression (String);
      optional binary unit (String);
      optional group parameters (List) {
        repeated group list {
          optional group item {
            optional group value {
              optional int64 integer;
              optional double float;
              optional binary string (String);
              optional boolean boolean;
            }
            optional binary accession (String);
            optional binary name (String);
            optional binary unit (String);
          }
        }
      }
      optional binary data_processing_ref (String);
    }
  }
}

Point Layout

When storing data arrays, the point layout stores the data as-is in parallel arrays alongside a repeated index column. The top-level node is named point and it is a group with an arbitrary number of columns. The entity index column MUST be the first column under point.

point
spectrum_index mz intensity
1 213.2 1002
1 506.9 500
1 758 405
2 329.1 50
2 516.5 5002
2 783.8 302

This layout is simple, but carries several advantages. Scalar columns are easily filtered along the page-level range index. This makes multi-dimensional queries easier to write and optimize. The arrays are transparently encoded and compressed by Parquet, so the data may still be stored compactly. The data must be stored as-is in order to use the page index so no additional obscuring transformations can be used.

Chunked Layout

When storing data arrays, the chunked layout treats one array, which must be sorted, as the “main” axis, cutting the array into chunks of a fixed size along that coordinate space (e.g. steps of 50 m/z) and taking the same segments from parallel arrays. The main axis chunks’ start, end, and a repeated index are recorded as columns, and then each array may be encoded as-is or with an opaque transform (e.g. delta-encoding, Numpress). The start and end interval permits granular random access along the main axis as well as the source index. The top-level node is named chunk and it has a layout as shown below. The entity index column MUST be the first column under chunk.

chunk
spectrum_index mz_chunk_start mz_chunk_end mz_chunk_values chunk_encoding intensity
1 200 250 [0.0013, …, 0.0013] MS:1003089 […]
1 250 300 [0.0014, …, 0.0014] MS:1003089 […]
1 500 550 [0.0014, …, 0.0015] MS:1003089 […]
2 200 250 [0.0013, …, 0.0013] MS:1003089 […]
2 350 400 [0.0014, …, 0.0014] MS:1003089 […]
2 400 450 [0.0013, …, 0.0014] MS:1003089 […]

This example uses a delta-encoding for the m/z array chunks’ values, which can be efficiently reconstructed with very high precision for 64-bit floats. The m/z values within the mz_chunk_values list aren’t accessible to the page index, but the _chunk_start and _chunk_end columns are. The chunk values are still subject to Parquet encodings so they can be byte shuffled as well which further improves compression.

The chunked layout’s column naming rules:

  1. <entity>_index (integer): The index key for the entity this chunk belongs to.
  2. <array_name>_chunk_start (float64): The first coordinate value in this chunk, where it starts at inclusively. This array’s entry in the array index’s buffer_format MUST be chunk_start.
  3. <array_name>_chunk_end (float64): The last coordinate value in this chunk, where it ends at inclusively. This array’s entry in the array index’s buffer_format MUST be chunk_end.
  4. <array_name>_chunk_values (list): The encoded coordinates from array_name according to the chunk_encoding column. This array’s entry in the array index’s buffer_format MUST be chunk_values.
  5. chunk_encoding (CURIE): The method by which <array_name>_chunk_values were encoded. See Chunk Encodings for more details. This array’s entry in the array index’s buffer_format MUST be chunk_encoding.

All other columns are expected to be list arrays whose names are simply their array_name as described in the index with buffer_format chunk_secondary, or surrogate arrays with the buffer format chunk_transform.

Splitting Data Into Chunks

The process for constructing a chunk table for a signal entry may break in any pattern so long as the chunks are non-overlapping and ascending. The chunking procedure needs to be null-aware, particularly aware of null pairs used to denote masked regions. An algorithm for producing equal width chunks is given below. The granularity of the chunking is configurable, trading off random access granularity vs. compression efficiency.

Python code for partitioning chunks of up to width k with null pairs present
import pyarrow as pa

def null_chunk_every(data: pa.Array, width: float) -> list[tuple[int, int]]:
    """
    Partition a sorted numerical array into segments spanning `width` units.

    This operation is null-aware, so sparse arrays can be partitioned.

    Parameter
    ---------
    data : pa.Array
        The data to be partitioned
    width : float
        The spacing (in units along the data dimension) between chunks

    Returns
    -------
    list[tuple[int, int]]
        The start and end index of each chunk
    """
    start = None
    n = len(data)
    i = 0
    # Find the first non-null position
    while i < n:
        v = data[i]
        if v.is_valid:
            start = v.as_py()
            break
        else:
            i += 1

    # If we never found a non-null position, just return a single chunk
    if start is None:
        return [(0, n)]

    chunks = []
    offset = 0
    threshold = start + width
    i = 0
    while i < n:
        v = data[i]
        if v.is_valid:
            v = v.as_py()
            if v > threshold:
                if ((i + 1) < n) and (not data[i + 1].is_valid):
                    while ((i + 1) < n) and (not data[i + 1].is_valid):
                        i += 1
                # We don't want to create a chunk of length 1, especially not if it is a null
                # point. If not, we have to relax the width requirement.
                if i - offset > 1:
                    chunks.append((offset, i))
                    offset = i
                # Update the threshold. We might need to update multiple times if the next value
                # is far away.
                while threshold < v:
                    threshold += width
        # Look ahead and see if the next value is not null since this one is.
        elif ((i + 1) < n) and (data[i + 1].is_valid):
            i += 1
            v = data[i].as_py()
            if v > threshold:
                i -= 1
                chunks.append((offset, i))
                offset = i
                # Update the threshold. We might need to update multiple times if the next value
                # is far away.
                while threshold < v:
                    threshold += width
        i += 1
    if offset != n:
        chunks.append((offset, n))
    return chunks

Chunk Encodings

Basic Encoding

Chunk Encoding Controlled Vocabulary: MS:1000576|no compression

When storing centroids or data that are not similarly spaced as is usually the case for pre-centroided spectra, but still want to use the chunked layout, no special encoding of the chunked values is necessary. The values within each chunk are written as-is to the chunk values array. This doesn’t improve compressibility but it maintains a consistent schema for other entries that would benefit from a different encoding.

Note: The start point is excluded from the chunk values array.

Delta Encoding

Chunk Encoding Controlled Vocabulary: MS:1003089|truncation, delta prediction and zlib compression

When working with data that is laid out on a locally (almost) uniform grid using 64-bit floats, it is possible to improve compression by computing a delta encoding of the coordinates.

Note: The start point is excluded from the chunk values array.

Python code for delta encode/decode with null awareness
import pyarrow as pa

def null_delta_encode(data: pa.Array) -> pa.Array:
    """
    Delta-encode an Arrow array containing nulls. Nulls are encoded as null values, and treated as 0.0
    for the purposes of computing the next delta.

    Parameters
    ----------
    data : pa.Array
        The data to delta encode

    Returns
    -------
    pa.Array
    """
    acc = []
    it = iter(data)
    # Get the first entry in the array. It will be the first point of reference but not part
    # of the delta sequence unless it is `null`
    last = next(it)
    if not last.is_valid:
        acc.append(last)

    for item in it:
        # If the value isn't `null`,
        if item.is_valid:
            val = item.as_py()
            # Compute a delta relative to the last item if it was not `null`
            if last.is_valid:
                acc.append(pa.scalar(val - last.as_py()))
            # otherwise treat the last value as 0.0, the additive identity
            else:
                acc.append(item)
            # Update last item
            last = item
        else:
            # Append the `null` unmodified and update the last item.
            acc.append(item)
            last = item
    return pa.array(acc)


def null_delta_decode(data: pa.Array, start: pa.Scalar) -> pa.Array:
    """
    Decode an Arrow array that was delta-encoded *with* nulls.

    This is necessarily a copying operation.

    Parameters
    ----------
    data : pa.Array
        The data to be decoded.
    start : pa.Scalar
        The starting value, an offset

    Returns
    -------
    pa.Array
    """
    acc = []
    # If the first value is `null`,
    if not data[0].is_valid:
        # and the second value is `null`,
        if not data[1].is_valid:
            # then append the `start` value, we started at a non-null value immediately followed by a null pair.
            acc.append(start)
        start = pa.scalar(None, data.type)
    else:
        # otherwise use the starting point
        acc.append(start.as_py())
    last = start
    for item in data:
        # if the current point is valid
        if item.is_valid:
            val = item.as_py()
            # and the last is valid
            if last.is_valid:
                # reconstitute the delta encoded value at this position
                last = pa.scalar(val + last.as_py())
                acc.append(last)
            else:
                # otherwise the last value is assumed to be zero so it does
                # not need to be adjusted
                acc.append(item)
                last = item
        else:
            # otherwise this position is null and we carry it forward as such
            acc.append(item)
            last = item
    return pa.array(acc)
Numpress Linear Encoding

Chunk Encoding Controlled Vocabulary: MS:1002312|MS-Numpress linear prediction compression

This uses the Numpress linear prediction method (10.1074/mcp.O114.037879) to compress the chunk’s values as raw bytes. Numpress creates a buffer containing an 8 byte fixed point, 4 byte value 0, 4 byte value 1, followed by 2 byte residuals for all subsequent values in the array. This means that the array is by definition not alignable to a 4- or 8-byte type. It also has no concept of nullity, which means it is not compatible with null marking.

To store Numpress linear encoded arrays, an extra column is added <array_name>_numpress_linear_bytes is added alongside the <array_name>_chunk_values column. It is a list of byte arrays (large_list<u8> in Arrow parlance, not large_binary, see discussion of string type “optimization”). The array index entry for this column MUST have buffer_format of chunk_transform and the same array type, array name, data type, unit, and data processing ID as the _chunk_values column. The transform field in the index MUST be MS:1002312.

Note: The start point is INCLUDED from the chunk values array. It is a specific component of the Numpress encoded bytes.

Opaque Array Transforms

In some cases, we might prefer to store data lossily in non-uniform, unaligned, or otherwise non-standard data types that don’t have a physical representation in Parquet. MS-Numpress’s short logged float (SLOF) and positive integer encodings are good examples of these cases. While the Numpress-Linear chunk encoding works for the coordinate dimension, we also support using other opaque transformations to encode the secondary arrays in chunks. These columns are also recorded in the array index with buffer_format = chunk_transform, the transform field in the index must be the CURIE for the relevant encoding method, such as MS:1002314 for MS-Numpress’s SLOF encoding. The column’s data type MUST be a list of byte arrays, though the type in the array index MUST be the decoded array’s real type after decoding. The column names SHOULD be of the form <array_name>_<transform_name>_bytes, e.g. intensity_numpress_slof_bytes.

How to read a single entry from the chunked encoding

To read a single entry (e.g. spectrum, chromatogram) that is stored in chunks, the following procedure may be used:

  1. Identify which columns are annotated as chunk_start, chunk_end, chunk_encoding, and chunk_values in the array index. The <entity_type>_index column MUST be the first column in the table, so it always has index 0.
  2. Find the row group which contains the entry’s index value by querying the row group-level metadata. Optionally, if the page index is available, find the row ranges for the pages that contain that index.
  3. Read the selected row group (or data page row range) and filter the selected rows to only those whose <entity_type>_index column equals the entry’s index.
  4. Optionally, sort the rows with respect to their chunk_start column’s value in appropriate order, usually ascending, for the quantity being measured.
  5. Process each selected row, decoding its chunk_values column according to the chunk_encoding column and any transform listed in the relevant array index. Unpack the chunk_secondary columns and process any tranforms as necessary, accumulating each column’s data arrays across rows. Processing transforms may require additional information from the entry_type’s metadata table.
  6. If the entry has additional auxiliary arrays, they must be read from the metadata table and decoded.

Why all these root nodes?

Couldn’t we just unwrap the top-level struct and move on with things?

Perhaps, but the top-level structure leaves the door open for two use-cases:

  1. Clear schema signaling. When you see point at the root of the schema, you know this is a point layout, not a chunked layout file.
  2. Unaligned proprietary data. A specialized writer or reader might wish to embed other information that is not directly connected to the primary schema’s addressible unit (e.g. a spectrum, a data point), and this leaves open a door for that to be introduced. It is assumed that this is unlikely at this time, but it is a quantum physics universe.
  3. More table packing. Early in mzPeak’s design, we tried to pack tables together as much as possible as in the packed parallel table layout, but this proved to be very inefficient to write despite being no slower to read. This might have been an implementation detail, and not Parquet itself. We don’t want to throw out the opportunity to return to that in the future, requiring a schema-breaking change rather than just how we get to the tables that break.

Index File - mzpeak_index.json

An mzPeak archive is made up of multiple named files. To leave room for future files and avoid having to do complicated file name resolution, we use an index file that identifies the contents of each file. This broadly defines the kinds of schemas those files might have. The file MUST be serialized with UTF8.

TODO: Add wavelength files to examples

{
  "files": [
    {
      "name": "spectra_data.parquet",
      "entity_type": "spectrum",
      "data_kind": "data arrays"
    },
    {
      "name": "spectra_metadata.parquet",
      "entity_type": "spectrum",
      "data_kind": "metadata"
    },
    {
      "name": "chromatograms_data.parquet",
      "entity_type": "chromatogram",
      "data_kind": "data arrays"
    },
    {
      "name": "chromatograms_metadata.parquet",
      "entity_type": "chromatogram",
      "data_kind": "metadata"
    }
  ],
  "metadata": {
    "version": "0.9.0",
    "cv_list": [
      {"id": "MS", "full_name": "Proteomics Standards Initiative Mass Spectrometry Ontology", "uri": "http://purl.obolibrary.org/obo/ms/4.1.248/ms.obo", "version": "4.1.248"},
      {"id": "UO", "full_name": "Units of measurement ontology", "uri": "http://purl.obolibrary.org/obo/uo/releases/2026-01-16/uo.obo", "version": "2026-01-16"}
    ],
    "file_description": {...},
  }
}

Governed by JSONSchema schema/mzpeak_index.json

The data_kind and entity_kind fields are loose enumerations. They are expected to grow over time.

File-Level Metadata

The file level metadata SHOULD be stored in mzpeak_index.metadata and the metadata Parquet files’ key-value pairs as JSON-encoded metadata governed by the associated schemas below:

QUESTION: Anything we put in the mzpeak_index.json is necessarily shown in cleartext to all readers unless ZIP encryption is used, but ZIP encryption is well known to be flawed and inconsistent. Anything in a Parquet file’s footer key-value pairs is encryptable. The index is JSON for convenience. We could use Parquet for that too, but it’s overkill and harder for scripting languages to get at.

Data Kind

The data_kind field tells the reader the semantics of the data stored in this file, and approximately what kind of schema to expect.

There are currently 5 controlled values for data_kind:

Any value outside of these is assumed to be treated as other. Files labeled as other. Any files treated as other data kinds are implementation defined, as are proprietary files, but other files may be still be of interest to non-vendor readers.

Adding a new Data Kind

This list is necessarily incomplete as new use cases are likely to emerge. For instance, it might be desirable to store extracted LC-(IM)-MS feature bounding boxes as a separate file.

  1. Pick a name that will fit within the index JSON file. Prefer lower case names. i.e. feature map for extracted features
  2. Pick a layout or layouts associated with this data kind. i.e. packed parallel table for lists of bounding boxes with associated metadata.
  3. Describe the relationships with valid Entity Types (see below). Prefer simple relationships like one-to-one or one-to-many. If no existing entity type is reasonable, create a new entity type. i.e. an LC-MS feature might associate with spectrum, but there isn’t a one-to-one or one-to-many relationship between spectra and LC-MS, so a new entity type might be needed.

Entity Type

The entity_type tells the reader what is being described in this file, in concert with the data_kind. This makes helps the reader connect the right file to the right API.

There are currently 3 controlled values for entity_type

Any value outside of these is assumed to be treated as other.

Adding a new Entity Type

TODO: Expand this

Spectrum Signal Data File - spectra_data.parquet

File index entry:

{
  "name": "spectra_data.parquet",
  "entity_type": "spectrum",
  "data_kind": "data arrays"
}

The spectrum signal data is encoded using either point layout or chunked layout. The entity index column MUST be named spectrum_index, and if a time column is written alongside it, it SHOULD be named spectrum_time. Non-mass spectra like UV or DAD spectra should be written in the wavelength_spectra_data.parquet file.

When using null marking, follow the null semantics for signal data with care for profile data.

Only profile spectra should be written to this file, centroid spectra, or processed, centroid views of profile spectra when storing both modes should be written to the peak data file. The number of points written to this file for a particular spectrum MUST be written to the spectrum.MS_1003060_number_of_data_points column in the spectra_metadata.parquet file to facilitate appropriate reading operation planning.

Recommendations

When selecting a Parquet encoding for columns, favor:

Spectrum Peak Data - spectra_peaks.parquet

File index entry:

{
  "name": "spectra_peaks.parquet",
  "entity_type": "spectrum",
  "data_kind": "peaks"
}

The spectrum peak lists separately stored from the raw signal stored in spectra_data.parquet. The entity index column MUST be named spectrum_index, and if a time column is written alongside it, it SHOULD be named spectrum_time. Any centroid spectra MUST be written to this file, not spectra_data.parquet. The number of peaks written for a given spectrum in this file MUST be written to the spectrum.MS_1003059_number_of_peaks column in the spectra_metadata.parquet file to facilitate reading operation planning.

Spectrum Metadata - spectra_metadata.parquet

File index entry:

{
  "name": "spectra_metadata.parquet",
  "entity_type": "spectrum",
  "data_kind": "metadata"
}

This metadata table uses the packed parallel metadata table schema. The parallel schemas are shown below. The general order of columns in unspecified, but spectrum.index, scan.source_index, precursor.source_index, and selected_ion.source_index MUST be the first column of their respective schemas. Wherever these lists say MAY, that value may either be stored as a column or as an entry in the parameter list but a column tends to make more sense if it is usually present.

QUESTION: Is there a better way to make ion mobility storage generic over type (“ion mobility drift time”, “inverse reduced ion mobility”, “FAIMS compensation voltage”)?

Chromatogram Signal Data - chromatograms_data.parquet

File index entry:

{
  "name": "chromatograms_data.parquet",
  "entity_type": "chromatogram",
  "data_kind": "data arrays"
}

The chromatogram signal data is encoded using either point layout or chunked layout. The entity index column MUST be named chromatogram_index. The default primary axis for this type of signal data is a MS:1000595|time array, though the unit is up to the writer.

Recommendations

When selecting a Parquet encoding for columns, favor:

Chromatogram Metadata - chromatograms_metadata.parquet

File index entry:

{
  "name": "chromatograms_metadata.parquet",
  "entity_type": "chromatogram",
  "data_kind": "metadata"
}

Wavelength Spectrum Signal Data - wavelength_spectra_data.parquet

File index entry:

{
  "name": "wavelength_spectra_data.parquet",
  "entity_type": "wavelength spectrum",
  "data_kind": "data arrays"
}

The wavelength spectrum signal data is encoded using either point layout or chunked layout. The entity index column MUST be named wavelength_spectrum_index, and if a time column is written alongside it, it SHOULD be named wavelength_spectrum_time. This SHOULD only be present if wavelength spectra are included in the mzPeak archive.

When using null marking, follow the null semantics for signal data with care for profile data.

Wavelength Spectrum Metadata - wavelength_spectra_metadata.parquet

File index entry:

{
  "name": "wavelength_spectra_metadata.parquet",
  "entity_type": "wavelength spectrum",
  "data_kind": "metadata"
}

This metadata table uses the packed parallel metadata table schema. This should only be present if wavelength spectra are included in the mzPeak archive. The parallel schemas are shown below. The general order of columns in unspecified, but spectrum.index and scan.source_index MUST be the first column of their respective schemas. Wherever these lists say MAY, that value may either be stored as a column or as an entry in the parameter list but a column tends to make more sense if it is usually present. It mirrors the spectrum metadata layout, but does not include a precursor or selected_ion facet because EMR spectra have not been observed with isolation and fragmentation yet.

This metadata is stored separately from the mass spectra, allowing the two different data modalities to have divergent schemas without inflating the number of empty columns, and for ease of searching, so the reader does not need to sort through mass spectra while looking for EMR spectra and vice-versa.

Authors Information

Joshua A. Klein Boston MA, USA joshua.adam.klein@gmail.com

Tim Van Den Bossche, Ghent University, Ghent, Belgium; VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium Tim.VanDenBossche@ugent.be

Samuel Wein Wissenschaftlicher Mitarbeiter, Institute for Bioinformatics and Medical Informatics, University of Tübingen samuel.wein@uni-tuebingen.de

Oliver Kohlbacher Professor, Applied Bioinformatics, Dept. of Computer Science, University of Tübingen; Director, Institute for Bioinformatics and Medical Informatics, University of Tübingen; Director, Institute for Translational Bioinformatics, University Hospital Tübingen oliver.kohlbacher@uni-tuebingen.de

TODO: Fill in with more people from the mailing list

Contributors

TODO: Fill in with more people from the mailing list

Intellectual Property Statement

The PSI takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the PSI Chair.

The PSI invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this recommendation. Please address the information to the PSI Chair (see contact information at PSI website).

Copyright Notice

Copyright (C) 2026 by the Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) under the CC-BY-ND 4.0 license (https://creativecommons.org/licenses/by-nd/4.0/).

TODO: The mzPeak name is currently held in trust by the OpenMS Inc. The details of the trademark are described here

Glossary

References