SIRIUS (software)

Last updated
SIRIUS
Developer(s) Böcker Group at FSU Jena & Bright Giant GmbH
Initial release2009
Stable release
5.8.5 / 8 November 2023
Repository https://github.com/boecker-lab/sirius
Written inJava
Operating system Linux, Windows, MacOS
Available inEnglish
Type mass spectrometry,
structure elucidation,
chemistry,
bioinformatics
License GNU Affero General Public License v3.0 for client software,
web-services free for non-commercial use,
commercial subscription offered by Bright Giant GmbH
Website https://bio.informatik.uni-jena.de/software/sirius/

SIRIUS is a Java-based open-source software for the identification of small molecules from fragmentation mass spectrometry data without the use of spectral libraries. It combines the analysis of isotope patterns in MS1 spectra with the analysis of fragmentation patterns in MS2 spectra. SIRIUS is the umbrella application comprising CSI:FingerID, CANOPUS, COSMIC and ZODIAC.

Contents

SIRIUS, including its web services for structural elucidation, is freely available to use for academic research. Bright Giant GmbH offers subscription-based access to the SIRIUS web services for commercial users.

SIRIUS is not suitable for analyzing proteomics MS data.

History

The SIRIUS software is developed by the group of Sebastian Böcker at the Friedrich Schiller University Jena, Germany and since 2019 together with Bright Giant GmbH. SIRIUS development started in 2009 as a software for identification of the molecular formula by decomposing high-resolution isotope patterns (also called MS1 data). [1] The name is an akronym resulting from this original purpose: Sum formula Identification by Ranking Isotope patterns Using mass Spectrometry.

In 2008 the group introduced the concept of fragmentation trees [2] for identification of the molecular formula based on fragmentation mass spectrometry data, also called tandem MS or MS2 data. Back then, identification of small molecules was approached by searching in a reference spectral library. [3] Examples of such libraries include MassBank, [4] METLIN, [5] or NIST/EPA/NIH EI-MS Library. [6] However, this is limited to known molecules with available standards that have been measured and put in a reference spectral library. For unknown molecules, identification of the molecular formula is a crucial step. [2] In 2011/2012, the group conceived fragmentation trees as a means of structural elucidation by automatically comparing these fragmentation trees. [7] [8] Fragmentation pattern similarities are strongly correlated with the chemical similarity of molecules. [8] Thus, aligning the fragmentation tree of an unknown molecule to a set of known molecules helps to elucidate its structure. Fragmentation trees were introduced in SIRIUS 2. [7]

Also in 2012, the group of Juho Rousu at University of Helsinki, Finland, introduced a machine learning method to predict molecular properties from tandem MS data. [9] This concept was brought together with the fragmentation tree concept in 2015 resulting in CSI:FingerID, [10] being introduced in SIRIUS 3. The fragmentation tree is used to predict a molecular fingerprint of the unknown molecule using machine learning, which in turn is used to search a molecular structure database such as PubChem. Molecular structure databases are orders of magnitude larger than reference spectra libraries (PubChem containing ~111 million compounds in 2021 [11] compared to NIST Tandem Mass Spectral Library containing ~50.000 compounds in 2023 [12] ). This kind of structure identification refers to the identity and connectivity (with bond multiplicities) of the atoms, but not stereochemistry information. Elucidation of stereochemistry is currently beyond the power of automated search engines.

SIRIUS 3 also introduced the Graphical User Interface (GUI).

In 2020, in cooperation with the group of Pieter C Dorrestein at UC San Diego, USA, molecular formula identification was improved based on derivative networks from complete biological datasets to rank molecular formula candidates. [13] This method is called ZODIAC and has been integrated into SIRIUS 4. [14]

Also in 2020, in cooperation with Rousu's and Dorrestein's groups, CANOPUS for systematic compound class annotation was introduced to SIRIUS 4. [15]

In 2022, the COSMIC confidence score was added to the CSI:FingerID structure identification workflow in SIRIUS 4, allowing users to determine the trustworthiness of the identification. [16]

Data

SIRIUS is using data from liquid-chromatography tandem mass spectrometry (LC-MS/MS). It requires high-resolution, high mass accuracy MS1 and MS2 data as input. LC is not mandatory for SIRIUS, however is often required to separate individual compounds in complex samples.

SIRIUS expects both, MS1 and MS2 spectra, as input. Omitting the MS1 data is possible, but it will make the analysis more time-consuming and can lead to poorer results.

SIRIUS and CSI:FingerID have been trained on a wide variety of data, including data from different instrument types. Certain aspects of the mass spectra are important to successfully process the data:

Different common MS file formats, such as .csv, .ms or .mgf files, can be imported to SIRIUS. SIRIUS can import full LC-MS-runs (.mzML) or single compounds. At present, SIRIUS only handles single-charged compounds. [17]

Features

SIRIUS identifies small molecules in a two step approach: [17]

The following algorithms are implemented in SIRIUS:

SIRIUS: Molecular formula identification

SIRIUS is the name of the umbrella application, but (for historic reasons) also the name for the identification of the molecular formula. Molecular formula refers to the elemental composition of the molecule. The mere mass of a molecule is not sufficient to determine the correct molecular formula. [17] Even with very high mass accuracy, many molecular formulas can explain a mass measured in a spectrum, in particular in higher mass regions. In SIRIUS, molecular formula identification is done using isotope pattern analysis on the MS1 data as well as fragmentation tree computation on the MS2 data. The score of a molecular formula candidate is a combination of the isotope pattern score and the fragmentation tree score.

To identify the molecular formula, SIRIUS is considering all possible molecular formulas for a set of elements. The elements most abundant in living beings are hydrogen (H), carbon (C), nitrogen (N), oxygen (O), and phosphor (P). This is the default set of elements in SIRIUS. Some less common elements result in very characteristic isotope pattern changes and can be automatically detected. [20] Detectable elements are sulfur (S), chlorine (Cl), bromine (Br), boron (B) and selenium (Se). The current version of SIRIUS uses a deep neural network for auto-detection of elements from the isotope and fragmentation pattern of the query molecule. [14]

For very large molecules or in case of missing data (e.g., a missing isotope pattern), it is possible to restrict SIRIUS to molecular formulas found in a database, such as PubChem.

Decomposition of mass

In order to quickly generate a manageable number of molecular formula candidates, the monoisotopic mass is decomposed into all possible molecular formulas that would lead to this mass. There are two definitions of the monoisotopic mass: [21] (1) the sum of the masses of the most abundant naturally occurring stable isotope of each atom (i.e. the highest peak of the isotope pattern) (2) the sum of the masses of the lightest naturally occurring stable isotope of each atom (i.e. the peak of the isotope pattern with the lowest mass). For small molecules, the lightest peak is also mostly the highest peak of the isotope pattern. However, in the computational context of SIRIUS, the second definition is used.

Decomposing the monoisotopic mass into all possible molecular formulas requires a mass interval taking into account the measurement inaccuracy of the instrument. This real-valued decomposition is transformed into a problem instance with integer masses by using a blowup factor. The resulting problem is known as Change-making problem which is well-studied and can be solved in runtime linear in the size of the output. [22]

Isotope pattern analysis

Isotope patterns of the candidate molecular formulas are simulated starting with the isotopic distributions of the individual elements, and then combining these distributions by folding. [23] [1]

The simulated isotope pattern is compared with the measured pattern by assigning probabilities to the observed masses and intensities. [1]

Fragmentation tree computation

A fragmentation tree is a representation of the fragmentation process similar to “fragmentation diagrams” created by experts. The fragmentation tree annotates the MS2 spectrum by providing a molecular formula for each fragment peak. Peaks that do not receive an annotation are considered noise peaks. The fragmentation tree also predicts the fragmentation reactions (called losses) leading to the fragment peaks. Fragmentation trees are a valuable tool for deducing information about the fragmentation but are not a precise depiction of the actual fragmentation process. [7]

To identify the molecular formula of an unknown molecule, a separate fragmentation tree is computed for every molecular formula candidate. In other words, the method attempts to reconstruct the fragmentation process that led to this MS2 spectrum for each candidate molecular formula. This allows to compare the different hypotheses that a particular candidate is actual the correct molecular formula. The best-scoring fragmentation tree (i.e. the fragmentation process that is best explaining the spectrum) corresponds to the most likely molecular formula explanation.

ZODIAC: Improved molecular formula identification

ZODIAC improves the ranking of the formula candidates provided by SIRIUS. [13] Organisms produce related metabolites derived from multiple but limited biosynthetic pathways. For a full LC-MS/MS run that is derived from a biological sample or any other set of derivatives the relation of the metabolites is reflected in their similarity. Those similarities are in turn reflected in joint fragments and losses between the fragmentation trees and can be leveraged to improve molecular formula identification of the individual molecules.

ZODIAC uses the top X molecular formula candidates for each molecule from SIRIUS to build a similarity network, and uses Bayesian statistics to re-rank those candidates. Prior probabilities are derived from fragmentation tree similarity. Finding an optimal solution to the resulting computational problem is NP-hard, therefore Gibbs sampling is used.

ZODIAC stands for ZODIAC: Organic compound Determination by Integral Assignment of elemental Compositions.

CSI:FIngerID identifies the structure of a molecule by predicting its molecular fingerprint and using this fingerprint to search in a molecular structure database. [10]

Molecular fingerprints

A molecular fingerprint is a binary vector, where each position corresponds to a specific molecular property. In this representation, a given position X may encode the presence or absence of a particular substructure, with '1' indicating presence and '0' indicating absence. Various types of molecular fingerprints exist, including PubChem CACTVS fingerprints, Klekota-Roth fingerprints, [24] MACCS fingerprints, and Extended-Connectivity Fingerprints (ECFP). [25] A molecular fingerprint can be deterministically computed from a given molecular structure. Different molecular structures may yield the same molecular fingerprint.

Predicting molecular fingerprints

CSI:FingerID predicts a probabilistic fingerprint with a variety of molecular properties from several fingerprint types. The fingerprint is predicted from the given spectrum and its corresponding fragmentation tree using deep kernel learning, [26] [10] which is a combination of kernel methods and deep neural networks. Not only the top scoring molecular formula but multiple high-scoring molecular formula candidates are considered.

Comparing molecular fingerprints

To search in a molecular structure database requires a metric to compare and score the molecular fingerprints. Tanimoto similarity (Jaccard index) is a commonly employed metric. A similarity value of 1 signifies identical fingerprints, while a value of 0 indicates structures that do not share any molecular properties. The calculated similarity value depends on the choice of fingerprint type.

CSI:FingerID employs a logarithmic posterior probability to rank the structure candidates, where scores are represented as negative numbers, and zero is the optimum. [27] This scoring function results in a higher number of correct identifications. [10] Tanimoto similarities are also given.

COSMIC: Identification confidence

The COSMIC confidence score assigns a confidence to CSI:FingerID structure identifications. [16] The idea is similar to False Discovery Rates: All molecules in a large dataset are analysed using CSI:FingerID, the top-ranked hit for each molecule will be evaluated by COSMIC and the most trustworthy identifications can be selected for further analysis. COSMIC does not re-rank structure candidates of a particular molecule nor does it discard any identifications.

COSMIC employs a confidence score that combines E-value estimation and a linear support vector machine (SVM) with enforced directionality. Calibration of CSI:FingerID scores is achieved using E-value estimates. [28] Generating decoys for small molecule structures is a non-trivial task, that is why candidates in PubChem serve as a proxy for decoys here.

The score distribution is modeled as a mixture distribution of log-normal distributions, and the P-value and E-value of a hit score are estimated using the kernel density estimate of PubChem candidate scores. The SVM is employed to classify whether a hit is correct, utilizing features such as the calibrated score, score differences to other candidates, the total peak intensity explained by the fragmentation tree, and the cardinality of molecular fingerprints. Learning is constrained to a linear SVM to mitigate the risk of overfitting, and the directionality of features is enforced. This involves making upfront decisions about whether high or low values of a feature should enhance the confidence in an identification. For instance, a high CSI:FingerID score of a hit should increase but never decrease the confidence that the hit is correct. Some features necessitate the existence of at least two candidates for comparison, and separate SVMs are trained for single instances. The decision values of the SVM are mapped to posterior probability estimates using Platt scaling. [29] This comprehensive approach ensures a robust and nuanced assessment of the confidence in molecule identifications. [16]

CANOPUS: compound class prediction

CANOPUS is short for class assignment and ontology prediction using mass spectrometry. [15] It predicts the compound classes from the molecular fingerprint predicted by CSI:FingerID. This approach is completely database-free, i.e. it is not even limited to molecules that are listed in structure databases.

CANOPUS employs a deep neural network (DNN) [30] to predict 2,497 compound classes. The DNN was trained on 4.10 million compound structures with compound classes assigned by ClassyFire. [31] No MS/MS data was used for training, but instead simulated ‘realistic’ probabilistic fingerprints for the training molecular structures were used. The DNN predicts all compound classes simultaneously.

For full biological datasets, CANOPUS provides a comprehensive overview of compound classes present in the sample and allows for comparisons between different cohorts at compound class level.

Areas of application

Small molecules are essential components found throughout nature, playing a significant role in various fields such as drug discovery, diagnostics, food science, environmental monitoring, and more. Effectively addressing many global challenges hinges on the comprehensive identification of small molecules in complex samples. These complex mixtures contain thousands of different molecules measurable in a single mass spectrometry run.

The identification of unknown small molecules is considered a critical bottleneck in metabolomics, natural product research, and related fields, given that widely over 90% of all small molecules remain unknown. [32] [33] Commonly, analyses were based on targeted approaches that are limited to the rediscovery of known molecules. In contrast, untargeted analysis is a top-down strategy that avoids the need for a prior specific hypothesis on expected small molecules. The focus shifts from asking, "Is molecule X present in the sample?" to "Which (unknown) molecules are present in the sample and might be relevant for downstream analysis?"

SIRIUS is designed for the untargeted structural elucidation of unknown molecules, addressing various challenges:

Examples of application

Limitations

Limitation of the measurement method

Mass spectra alone lack sufficient information to unambiguously identify every molecule. Some molecules produce almost indistinguishable spectra – even more similar than the same molecule measured on two different instruments. [21] Extensive follow-up experiments are required for unambiguous identification.

Based thereon, it is impossible to always correctly identify a molecular structure merely from a mass spectrum. Thus, CSI:FingerID as well as other methods for structure database search, cannot guarantee finding the correct molecular structure as first hit. That is why it is important to have the correct structure ranked very high from an extensive list of candidates and to assess the confidence in the top hit.

Limitation of structure databases

Structure databases are orders of magnitude larger than spectral libraries but still incomplete. [40] It is understood that not every existing biomolecule is or will be contained in structure databases.

For these instances, SIRIUS offers several solutions:

Independent evaluation of the software

CASMI (Critical Assessment of Small Molecule Identification) [41] is an open contest on the identification of small molecules from mass spectrometry data, and was launched in 2012 by Emma Schymanski and Steffen Neumann. [42]

In CASMI 2016, CSI:FingerID and a derivative of CSI:FingerID, in which the Böcker Group was also involved, won first and second place in the category “Best Automatic Structural Identification - In Silico Fragmentation Only”. Also, CSI:FingerID had the best result for ranking the correct molecule structure at position one (70 out of 127, positive mode). [43] [44]

In CASMI 2017, SIRIUS plus CSI:FingerID won in 3 of 4 categories: “Best Structure Identification on Natural Products”, “Best Automatic Structural Identification - In Silico Fragmentation Only”, “Best Automatic Candidate Ranking”. [45]

In CASMI 2022, six out of 16 contestants used SIRIUS in their workflow to identify the best molecular structure candidates. SIRIUS won in the categories “Correct elemental formulas”, “Correct compound structure classes” and “Correct 2D chemical structures”. CASMI 2022 included compounds that were not even contained in PubChem. [46]

Awards and recognition

Sebastian Böcker's group at FSU Jena won the 2022 Thuringian Research Award in the Applied Research category for SIRIUS and the underlying methods. [47] [48]

SIRIUS was recognized as a "method to watch" by Nature Methods in 2020. [49]

Licences

SIRIUS is developed by the group of Sebastian Böcker at the FSU Jena in close collaboration with the Bright Giant GmbH. SIRIUS is provided as a software-as-a-service solution. The client software is open-source and installed on the users’ computers. Molecular formula annotation using fragmentation trees and isotope pattern analysis is performed on your local computer without subscription requirement.

The SIRIUS web services for structural elucidation, including molecular fingerprint prediction, structure database search, confidence score assessment and compound class prediction, require a user account. The web services are free for academic/non-commercial use provided/hosted by the FSU Jena. Academic institutions are identified by their email domain and access will be granted automatically. In some cases, further validation might be required.

Bright Giant GmbH offers subscription-based access to the SIRIUS web services for structural elucidation for commercial users.

Alternatives

Other algorithms and software for searching in structure databases are CFM-ID, [50] [51] ICEBERG, [52] MetFrag, [53] MS-FINDER, [54] [55] MetaboScape® (Bruker), MassHunter (Agilent) or Compound Discoverer™ (Thermo Fisher Scientific).

See also

Related Research Articles

<span class="mw-page-title-main">Mass spectrometry</span> Analytical technique based on determining mass to charge ratio of ions

Mass spectrometry (MS) is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a mass spectrum, a plot of intensity as a function of the mass-to-charge ratio. Mass spectrometry is used in many different fields and is applied to pure samples as well as complex mixtures.

<span class="mw-page-title-main">Tandem mass spectrometry</span> Type of mass spectrometry

Tandem mass spectrometry, also known as MS/MS or MS2, is a technique in instrumental analysis where two or more stages of analysis using one or more mass analyzer are performed with an additional reaction step in between these analyses to increase their abilities to analyse chemical samples. A common use of tandem MS is the analysis of biomolecules, such as proteins and peptides.

<span class="mw-page-title-main">Gas chromatography–mass spectrometry</span> Analytical method

Gas chromatography–mass spectrometry (GC–MS) is an analytical method that combines the features of gas-chromatography and mass spectrometry to identify different substances within a test sample. Applications of GC–MS include drug detection, fire investigation, environmental analysis, explosives investigation, food and flavor analysis, and identification of unknown samples, including that of material samples obtained from planet Mars during probe missions as early as the 1970s. GC–MS can also be used in airport security to detect substances in luggage or on human beings. Additionally, it can identify trace elements in materials that were previously thought to have disintegrated beyond identification. Like liquid chromatography–mass spectrometry, it allows analysis and detection even of tiny amounts of a substance.

<span class="mw-page-title-main">Metabolomics</span> Scientific study of chemical processes involving metabolites

Metabolomics is the scientific study of chemical processes involving metabolites, the small molecule substrates, intermediates, and products of cell metabolism. Specifically, metabolomics is the "systematic study of the unique chemical fingerprints that specific cellular processes leave behind", the study of their small-molecule metabolite profiles. The metabolome represents the complete set of metabolites in a biological cell, tissue, organ, or organism, which are the end products of cellular processes. Messenger RNA (mRNA), gene expression data, and proteomic analyses reveal the set of gene products being produced in the cell, data that represents one aspect of cellular function. Conversely, metabolic profiling can give an instantaneous snapshot of the physiology of that cell, and thus, metabolomics provides a direct "functional readout of the physiological state" of an organism. There are indeed quantifiable correlations between the metabolome and the other cellular ensembles, which can be used to predict metabolite abundances in biological samples from, for example mRNA abundances. One of the ultimate challenges of systems biology is to integrate metabolomics with all other -omics information to provide a better understanding of cellular biology.

<span class="mw-page-title-main">Metabolome</span>

The metabolome refers to the complete set of small-molecule chemicals found within a biological sample. The biological sample can be a cell, a cellular organelle, an organ, a tissue, a tissue extract, a biofluid or an entire organism. The small molecule chemicals found in a given metabolome may include both endogenous metabolites that are naturally produced by an organism as well as exogenous chemicals that are not naturally produced by an organism.

<span class="mw-page-title-main">Matrix-assisted laser desorption/ionization</span> Ionization technique

In mass spectrometry, matrix-assisted laser desorption/ionization (MALDI) is an ionization technique that uses a laser energy-absorbing matrix to create ions from large molecules with minimal fragmentation. It has been applied to the analysis of biomolecules and various organic molecules, which tend to be fragile and fragment when ionized by more conventional ionization methods. It is similar in character to electrospray ionization (ESI) in that both techniques are relatively soft ways of obtaining ions of large molecules in the gas phase, though MALDI typically produces far fewer multi-charged ions.

In chemistry, isotopologues are molecules that differ only in their isotopic composition. They have the same chemical formula and bonding arrangement of atoms, but at least one atom has a different number of neutrons than the parent.

Infrared multiple photon dissociation (IRMPD) is a technique used in mass spectrometry to fragment molecules in the gas phase usually for structural analysis of the original (parent) molecule.

A tandem mass tag (TMT) is a chemical label that facilitates sample multiplexing in mass spectrometry (MS)-based quantification and identification of biological macromolecules such as proteins, peptides and nucleic acids. TMT belongs to a family of reagents referred to as isobaric mass tags which are a set of molecules with the same mass, but yield reporter ions of differing mass after fragmentation. The relative ratio of the measured reporter ions represents the relative abundance of the tagged molecule, although ion suppression has a detrimental effect on accuracy. Despite these complications, TMT-based proteomics has been shown to afford higher precision than Label-free quantification. In addition to aiding in protein quantification, TMT tags can also increase the detection sensitivity of certain highly hydrophilic analytes, such as phosphopeptides, in RPLC-MS analyses.

<span class="mw-page-title-main">Mass (mass spectrometry)</span> Physical quantities being measured

The mass recorded by a mass spectrometer can refer to different physical quantities depending on the characteristics of the instrument and the manner in which the mass spectrum is displayed.

<span class="mw-page-title-main">Protein mass spectrometry</span> Application of mass spectrometry

Protein mass spectrometry refers to the application of mass spectrometry to the study of proteins. Mass spectrometry is an important method for the accurate mass determination and characterization of proteins, and a variety of methods and instrumentations have been developed for its many uses. Its applications include the identification of proteins and their post-translational modifications, the elucidation of protein complexes, their subunits and functional interactions, as well as the global measurement of proteins in proteomics. It can also be used to localize proteins to the various organelles, and determine the interactions between different proteins as well as with membrane lipids.

<span class="mw-page-title-main">Quantitative proteomics</span> Analytical chemistry technique

Quantitative proteomics is an analytical chemistry technique for determining the amount of proteins in a sample. The methods for protein identification are identical to those used in general proteomics, but include quantification as an additional dimension. Rather than just providing lists of proteins identified in a certain sample, quantitative proteomics yields information about the physiological differences between two biological samples. For example, this approach can be used to compare samples from healthy and diseased patients. Quantitative proteomics is mainly performed by two-dimensional gel electrophoresis (2-DE), preparative native PAGE, or mass spectrometry (MS). However, a recent developed method of quantitative dot blot (QDB) analysis is able to measure both the absolute and relative quantity of an individual proteins in the sample in high throughput format, thus open a new direction for proteomic research. In contrast to 2-DE, which requires MS for the downstream protein identification, MS technology can identify and quantify the changes.

<span class="mw-page-title-main">Mass spectral interpretation</span>

Mass spectral interpretation is the method employed to identify the chemical formula, characteristic fragment patterns and possible fragment ions from the mass spectra. Mass spectra is a plot of relative abundance against mass-to-charge ratio. It is commonly used for the identification of organic compounds from electron ionization mass spectrometry. Organic chemists obtain mass spectra of chemical compounds as part of structure elucidation and the analysis is part of many organic chemistry curricula.

<span class="mw-page-title-main">Fragmentation (mass spectrometry)</span>

In mass spectrometry, fragmentation is the dissociation of energetically unstable molecular ions formed from passing the molecules mass spectrum. These reactions are well documented over the decades and fragmentation patterns are useful to determine the molar weight and structural information of unknown molecules. Fragmentation that occurs in tandem mass spectrometry experiments has been a recent focus of research, because this data helps facilitate the identification of molecules.

The METLIN Metabolite and Chemical Entity Database is the largest repository of experimental tandem mass spectrometry and neutral loss data acquired from standards. The tandem mass spectrometry data on over 930,000 molecular standards is provided to facilitate the identification of chemical entities from tandem mass spectrometry experiments. In addition to the identification of known molecules, it is also useful for identifying unknowns using its similarity searching technology. All tandem mass spectrometry data comes from the experimental analysis of standards at multiple collision energies and in both positive and negative ionization modes.

The Yeast Metabolome Database (YMDB) is a comprehensive, high-quality, freely accessible, online database of small molecule metabolites found in or produced by Saccharomyces cerevisiae. The YMDB was designed to facilitate yeast metabolomics research, specifically in the areas of general fermentation as well as wine, beer and fermented food analysis. YMDB supports the identification and characterization of yeast metabolites using NMR spectroscopy, GC-MS spectrometry and Liquid chromatography–mass spectrometry. The YMDB contains two kinds of data: 1) chemical data and 2) molecular biology/biochemistry data. The chemical data includes 2027 metabolite structures with detailed metabolite descriptions along with nearly 4000 NMR, GC-MS and LC/MS spectra.

<span class="mw-page-title-main">Secondary electrospray ionization</span>

Secondary electro-spray ionization (SESI) is an ambient ionization technique for the analysis of trace concentrations of vapors, where a nano-electrospray produces charging agents that collide with the analyte molecules directly in gas-phase. In the subsequent reaction, the charge is transferred and vapors get ionized, most molecules get protonated and deprotonated. SESI works in combination with mass spectrometry or ion-mobility spectrometry.

Within the environmental sciences, screening broadly refers to a set of analytical techniques used to monitor levels of potentially hazardous organic compounds in the environment, particularly in tandem with mass spectrometry techniques. Such screening techniques are typically classified as either targeted, where compounds of interest are chosen before the analysis begins, or non-targeted, where compounds of interest are chosen at a later stage of the analysis. These two techniques can be organized into at least three approaches: target screening, using reference standards that are analogous to the target compound; suspect screening, which uses a library of cataloged data such as exact mass, isotope patterns, and chromatographic retention times in lieu of reference standards; and non-target screening, using no pre-existing knowledge for comparison before analysis. As such, target screening is most useful when monitoring the presence of specific organic compounds—particularly for regulatory purposes—which requires higher selectivity and sensitivity. When the number of detected compounds and associated metabolites needs to be maximized for discovering new or emerging environmental trends or biomarkers for disease, a more non-targeted approach has traditionally been used. However, the rapid improvement of mass spectrometers into more high-resolution forms, with increased sensitivity, has made suspect and non-target screening more attractive, either as stand-alone approaches or in conjunction with more targeted methods.

<span class="mw-page-title-main">Emma Schymanski</span> Chemist

Emma Schymanski is chemist known for her work identifying unknown organic compounds, particularly pollutants, and is an advocate for open science.

References

  1. 1 2 3 4 Böcker, Sebastian; Letzel, Matthias C.; Lipták, Zsuzsanna; Pervukhin, Anton (15 January 2009). "SIRIUS: decomposing isotope patterns for metabolite identification". Bioinformatics. 25 (2): 218–224. doi:10.1093/bioinformatics/btn603. PMC   2639009 . PMID   19015140.
  2. 1 2 Böcker, Sebastian; Rasche, Florian (15 August 2008). "Towards de novo identification of metabolites by analyzing tandem mass spectra". Bioinformatics. 24 (16): i49–i55. doi:10.1093/bioinformatics/btn270. PMID   18689839.
  3. Scheubert, Kerstin; Hufsky, Franziska; Böcker, Sebastian (December 2013). "Computational mass spectrometry for small molecules". Journal of Cheminformatics. 5 (1): 12. doi: 10.1186/1758-2946-5-12 . PMC   3648359 . PMID   23453222.
  4. Horai, Hisayuki; Arita, Masanori; Kanaya, Shigehiko; Nihei, Yoshito; Ikeda, Tasuku; Suwa, Kazuhiro; Ojima, Yuya; Tanaka, Kenichi; Tanaka, Satoshi; Aoshima, Ken; Oda, Yoshiya; Kakazu, Yuji; Kusano, Miyako; Tohge, Takayuki; Matsuda, Fumio; Sawada, Yuji; Hirai, Masami Yokota; Nakanishi, Hiroki; Ikeda, Kazutaka; Akimoto, Naoshige; Maoka, Takashi; Takahashi, Hiroki; Ara, Takeshi; Sakurai, Nozomu; Suzuki, Hideyuki; Shibata, Daisuke; Neumann, Steffen; Iida, Takashi; Tanaka, Ken; Funatsu, Kimito; Matsuura, Fumito; Soga, Tomoyoshi; Taguchi, Ryo; Saito, Kazuki; Nishioka, Takaaki (July 2010). "MassBank: a public repository for sharing mass spectral data for life sciences". Journal of Mass Spectrometry. 45 (7): 703–714. Bibcode:2010JMSp...45..703H. doi: 10.1002/jms.1777 . PMID   20623627.
  5. Smith, Colin A; Maille, Grace O??; Want, Elizabeth J; Qin, Chuan; Trauger, Sunia A; Brandon, Theodore R; Custodio, Darlene E; Abagyan, Ruben; Siuzdak, Gary (December 2005). "METLIN: A Metabolite Mass Spectral Database". Therapeutic Drug Monitoring. 27 (6): 747–751. doi:10.1097/01.ftd.0000179845.53213.39. PMID   16404815. S2CID   14774455.
  6. "Mass Spectrometry Data Center, NIST". chemdata.nist.gov.
  7. 1 2 3 Rasche, Florian; Svatoš, Aleš; Maddula, Ravi Kumar; Böttcher, Christoph; Böcker, Sebastian (15 February 2011). "Computing Fragmentation Trees from Tandem Mass Spectrometry Data". Analytical Chemistry. 83 (4): 1243–1251. doi:10.1021/ac101825k. PMID   21182243.
  8. 1 2 Rasche, Florian; Scheubert, Kerstin; Hufsky, Franziska; Zichner, Thomas; Kai, Marco; Svatoš, Aleš; Böcker, Sebastian (3 April 2012). "Identifying the Unknowns by Aligning Fragmentation Trees". Analytical Chemistry. 84 (7): 3417–3426. doi:10.1021/ac300304u. PMID   22390817.
  9. Heinonen, Markus; Shen, Huibin; Zamboni, Nicola; Rousu, Juho (15 September 2012). "Metabolite identification and molecular fingerprint prediction through machine learning". Bioinformatics. 28 (18): 2333–2341. doi:10.1093/bioinformatics/bts437. hdl: 20.500.11850/55584 . PMID   22815355.
  10. 1 2 3 4 5 6 Dührkop, Kai; Shen, Huibin; Meusel, Marvin; Rousu, Juho; Böcker, Sebastian (13 October 2015). "Searching molecular structure databases with tandem mass spectra using CSI:FingerID". Proceedings of the National Academy of Sciences. 112 (41): 12580–12585. Bibcode:2015PNAS..11212580D. doi: 10.1073/pnas.1509788112 . PMC   4611636 . PMID   26392543.
  11. Kim, Sunghwan; Chen, Jie; Cheng, Tiejun; Gindulyte, Asta; He, Jia; He, Siqian; Li, Qingliang; Shoemaker, Benjamin A; Thiessen, Paul A; Yu, Bo; Zaslavsky, Leonid; Zhang, Jian; Bolton, Evan E (8 January 2021). "PubChem in 2021: new data content and improved web interfaces". Nucleic Acids Research. 49 (D1): D1388–D1395. doi:10.1093/nar/gkaa971. PMC   7778930 . PMID   33151290.
  12. "2023 Release of the NIST EI and Tandem Libraries" (PDF). National Institute of Standards and Technology (NIST). Retrieved 12 January 2023.
  13. 1 2 Ludwig, Marcus; Nothias, Louis-Félix; Dührkop, Kai; Koester, Irina; Fleischauer, Markus; Hoffmann, Martin A.; Petras, Daniel; Vargas, Fernando; Morsy, Mustafa; Aluwihare, Lihini; Dorrestein, Pieter C.; Böcker, Sebastian (13 October 2020). "Database-independent molecular formula annotation using Gibbs sampling through ZODIAC". Nature Machine Intelligence. 2 (10): 629–641. doi:10.1038/s42256-020-00234-6.
  14. 1 2 Dührkop, Kai; Fleischauer, Markus; Ludwig, Marcus; Aksenov, Alexander A.; Melnik, Alexey V.; Meusel, Marvin; Dorrestein, Pieter C.; Rousu, Juho; Böcker, Sebastian (April 2019). "SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information". Nature Methods. 16 (4): 299–302. doi:10.1038/s41592-019-0344-8. PMID   30886413. S2CID   81985235.
  15. 1 2 3 4 Dührkop, Kai; Nothias, Louis-Félix; Fleischauer, Markus; Reher, Raphael; Ludwig, Marcus; Hoffmann, Martin A.; Petras, Daniel; Gerwick, William H.; Rousu, Juho; Dorrestein, Pieter C.; Böcker, Sebastian (April 2021). "Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra". Nature Biotechnology. 39 (4): 462–471. doi:10.1038/s41587-020-0740-8. PMID   33230292.
  16. 1 2 3 4 5 Hoffmann, Martin A.; Nothias, Louis-Félix; Ludwig, Marcus; Fleischauer, Markus; Gentry, Emily C.; Witting, Michael; Dorrestein, Pieter C.; Dührkop, Kai; Böcker, Sebastian (March 2022). "High-confidence structural annotation of metabolites absent from spectral libraries". Nature Biotechnology. 40 (3): 411–421. doi:10.1038/s41587-021-01045-9. PMC   8926923 . PMID   34650271.
  17. 1 2 3 4 Ludwig, Marcus; Fleischauer, Markus; Dührkop, Kai; Hoffmann, Martin A.; Böcker, Sebastian (2020). "De Novo Molecular Formula Annotation and Structure Elucidation Using SIRIUS 4". Computational Methods and Data Analysis for Metabolomics. Methods in Molecular Biology. Vol. 2104. pp. 185–207. doi:10.1007/978-1-0716-0239-3_11. ISBN   978-1-0716-0238-6. PMID   31953819. S2CID   210709539.
  18. Röst, Hannes L; Sachsenberg, Timo; Aiche, Stephan; Bielow, Chris; Weisser, Hendrik; Aicheler, Fabian; Andreotti, Sandro; Ehrlich, Hans-Christian; Gutenbrunner, Petra; Kenar, Erhan; Liang, Xiao; Nahnsen, Sven; Nilse, Lars; Pfeuffer, Julianus; Rosenberger, George; Rurik, Marc; Schmitt, Uwe; Veit, Johannes; Walzer, Mathias; Wojnar, David; Wolski, Witold E; Schilling, Oliver; Choudhary, Jyoti S; Malmström, Lars; Aebersold, Ruedi; Reinert, Knut; Kohlbacher, Oliver (September 2016). "OpenMS: a flexible open-source software platform for mass spectrometry data analysis" (PDF). Nature Methods. 13 (9): 741–748. doi:10.1038/nmeth.3959. PMID   27575624. S2CID   873670.
  19. Schmid, Robin; Heuckeroth, Steffen; Korf, Ansgar; Smirnov, Aleksandr; Myers, Owen; Dyrlund, Thomas S.; Bushuiev, Roman; Murray, Kevin J.; Hoffmann, Nils; Lu, Miaoshan; Sarvepalli, Abinesh; Zhang, Zheng; Fleischauer, Markus; Dührkop, Kai; Wesner, Mark; Hoogstra, Shawn J.; Rudt, Edward; Mokshyna, Olena; Brungs, Corinna; Ponomarov, Kirill; Mutabdžija, Lana; Damiani, Tito; Pudney, Chris J.; Earll, Mark; Helmer, Patrick O.; Fallon, Timothy R.; Schulze, Tobias; Rivas-Ubach, Albert; Bilbao, Aivett; Richter, Henning; Nothias, Louis-Félix; Wang, Mingxun; Orešič, Matej; Weng, Jing-Ke; Böcker, Sebastian; Jeibmann, Astrid; Hayen, Heiko; Karst, Uwe; Dorrestein, Pieter C.; Petras, Daniel; Du, Xiuxia; Pluskal, Tomáš (April 2023). "Integrative analysis of multimodal mass spectrometry data in MZmine 3". Nature Biotechnology. 41 (4): 447–449. doi:10.1038/s41587-023-01690-2. PMC   10496610 . PMID   36859716.
  20. Meusel, Marvin; Hufsky, Franziska; Panter, Fabian; Krug, Daniel; Müller, Rolf; Böcker, Sebastian (2 August 2016). "Predicting the Presence of Uncommon Elements in Unknown Biomolecules from Isotope Patterns". Analytical Chemistry. 88 (15): 7556–7566. doi:10.1021/acs.analchem.6b01015. PMID   27398867.
  21. 1 2 Böcker, Sebastian (29 April 2022). Algorithmic Mass Spectrometry (PDF) (Version 0.8.4 ed.). Retrieved 12 January 2024.
  22. Bocker, Sebastian; Liptak, Zsuzsanna (August 2007). "A Fast and Simple Algorithm for the Money Changing Problem". Algorithmica. 48 (4): 413–432. doi:10.1007/s00453-007-0162-8. S2CID   17652643.
  23. Kubinyi, Hugo (June 1991). "Calculation of isotope distributions in mass spectrometry. A trivial solution for a non-trivial problem". Analytica Chimica Acta. 247 (1): 107–119. Bibcode:1991AcAC..247..107K. doi:10.1016/S0003-2670(00)83059-7.
  24. Klekota, Justin; Roth, Frederick P. (1 November 2008). "Chemical substructures that enrich for biological activity". Bioinformatics. 24 (21): 2518–2525. doi:10.1093/bioinformatics/btn479. PMC   2732283 . PMID   18784118.
  25. Rogers, David; Hahn, Mathew (24 May 2010). "Extended-Connectivity Fingerprints". Journal of Chemical Information and Modeling. 50 (5): 742–754. doi:10.1021/ci100050t. PMID   20426451.
  26. Dührkop, Kai (24 June 2022). "Deep kernel learning improves molecular fingerprint prediction from tandem mass spectra". Bioinformatics. 38 (Supplement_1): i342–i349. doi:10.1093/bioinformatics/btac260. PMC   9235503 . PMID   35758813.
  27. Ludwig, Marcus; Dührkop, Kai; Böcker, Sebastian (1 July 2018). "Bayesian networks for mass spectrometric metabolite identification via molecular fingerprints". Bioinformatics. 34 (13): i333–i340. doi:10.1093/bioinformatics/bty245. PMC   6022630 . PMID   29949965.
  28. Keich, Uri; Noble, William Stafford (6 February 2015). "On the Importance of Well-Calibrated Scores for Identifying Shotgun Proteomics Spectra". Journal of Proteome Research. 14 (2): 1147–1160. doi:10.1021/pr5010983. PMC   4324453 . PMID   25482958.
  29. Platt, John C. (29 September 2000). "Probabilities for SV Machines". Advances in Large-Margin Classifiers: 61–74. doi:10.7551/mitpress/1113.003.0008. ISBN   978-0-262-28397-7.
  30. LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (28 May 2015). "Deep learning". Nature. 521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID   26017442. S2CID   3074096.
  31. Djoumbou Feunang, Yannick; Eisner, Roman; Knox, Craig; Chepelev, Leonid; Hastings, Janna; Owen, Gareth; Fahy, Eoin; Steinbeck, Christoph; Subramanian, Shankar; Bolton, Evan; Greiner, Russell; Wishart, David S. (December 2016). "ClassyFire: automated chemical classification with a comprehensive, computable taxonomy". Journal of Cheminformatics. 8 (1): 61. doi: 10.1186/s13321-016-0174-y . PMC   5096306 . PMID   27867422.
  32. da Silva, Ricardo R.; Dorrestein, Pieter C.; Quinn, Robert A. (13 October 2015). "Illuminating the dark matter in metabolomics". Proceedings of the National Academy of Sciences. 112 (41): 12549–12550. doi: 10.1073/pnas.1516878112 . PMC   4611607 . PMID   26430243.
  33. Hulleman, Tobias; Turkina, Viktoriia; O’Brien, Jake W.; Chojnacka, Aleksandra; Thomas, Kevin V.; Samanipour, Saer (26 September 2023). "Critical Assessment of the Chemical Space Covered by LC–HRMS Non-Targeted Analysis". Environmental Science & Technology. 57 (38): 14101–14112. Bibcode:2023EnST...5714101H. doi:10.1021/acs.est.3c03606. PMC   10537454 . PMID   37704971.
  34. Ottosson, Filip; Russo, Francesco; Abrahamsson, Anna; MacSween, Nadia; Courraud, Julie; Nielsen, Zaki Krag; Hougaard, David M.; Cohen, Arieh S.; Ernst, Madeleine (5 April 2023). "Effects of Long-Term Storage on the Biobanked Neonatal Dried Blood Spot Metabolome". Journal of the American Society for Mass Spectrometry. 34 (4): 685–694. doi:10.1021/jasms.2c00358. PMC   10080689 . PMID   36913955.
  35. Le Loarer, Alexandre; Marcellin-Gros, Rémy; Dufossé, Laurent; Bignon, Jérôme; Frédérich, Michel; Ledoux, Allison; Queiroz, Emerson Ferreira; Wolfender, Jean-Luc; Gauvin-Bialecki, Anne; Fouillaud, Mireille (8 March 2023). "Prioritization of Microorganisms Isolated from the Indian Ocean Sponge Scopalina hapalia Based on Metabolomic Diversity and Biological Activity for the Discovery of Natural Products". Microorganisms. 11 (3): 697. doi: 10.3390/microorganisms11030697 . PMC   10057949 . PMID   36985270.
  36. Weber, Ronja; Streckenbach, Bettina; Welti, Lara; Inci, Demet; Kohler, Malcolm; Perkins, Nathan; Zenobi, Renato; Micic, Srdjan; Moeller, Alexander (31 March 2023). "Online breath analysis with SESI/HRMS for metabolic signatures in children with allergic asthma". Frontiers in Molecular Biosciences. 10. doi: 10.3389/fmolb.2023.1154536 . PMC   10102578 . PMID   37065443.
  37. Li, Xianjiang; Tu, Mengling; Yang, Bingxin; Ma, Wen; Li, Hongmei (October 2023). "Structurally related impurity profiling of thiacloprid by orbitrap and de novo identification tool". Microchemical Journal. 193: 109123. doi:10.1016/j.microc.2023.109123. S2CID   260123222.
  38. Uzi-Gavrilov, S; Tik, Z; Sabti, O; Meijler, MM (17 July 2023). "Chemical Modification of a Bacterial Siderophore by a Competitor in Dual-Species Biofilms". Angewandte Chemie (International ed. In English). 62 (29): e202300585. doi: 10.1002/anie.202300585 . PMID   37211536.
  39. Li, Min; Mao, Junhong; Diaz, Isabel; Kopylova, Evguenia; Melnik, Alexey V.; Aksenov, Alexander A.; Tipton, Craig D.; Soliman, Nadia; Morgan, Andrea M.; Boyd, Thomas (18 July 2023). "Multi-omic approach to decipher the impact of skincare products with pre/postbiotics on skin microbiome and metabolome". Frontiers in Medicine. 10. doi: 10.3389/fmed.2023.1165980 . PMC   10392128 . PMID   37534320.
  40. Hufsky, Franziska; Böcker, Sebastian (September 2017). "Mining molecular structure databases: Identification of small molecules based on fragmentation mass spectrometry data". Mass Spectrometry Reviews. 36 (5): 624–633. Bibcode:2017MSRv...36..624H. doi:10.1002/mas.21489. PMID   26763615.
  41. "Critical Assessment of Small Molecule Identification" . Retrieved 12 January 2023.
  42. Schymanski, Emma; Neumann, Steffen (25 June 2013). "The Critical Assessment of Small Molecule Identification (CASMI): Challenges and Solutions". Metabolites. 3 (3): 517–538. doi: 10.3390/metabo3030517 . PMC   3901296 . PMID   24958137.
  43. Schymanski, Emma L.; Ruttkies, Christoph; Krauss, Martin; Brouard, Céline; Kind, Tobias; Dührkop, Kai; Allen, Felicity; Vaniya, Arpana; Verdegem, Dries; Böcker, Sebastian; Rousu, Juho; Shen, Huibin; Tsugawa, Hiroshi; Sajed, Tanvir; Fiehn, Oliver; Ghesquière, Bart; Neumann, Steffen (December 2017). "Critical Assessment of Small Molecule Identification 2016: automated methods". Journal of Cheminformatics. 9 (1): 22. doi: 10.1186/s13321-017-0207-1 . PMC   5368104 . PMID   29086042.
  44. "CASMI 2016 Results" . Retrieved 12 January 2023.
  45. "CASMI 2017 Results" . Retrieved 12 January 2023.
  46. "CASMI 2022 Results" . Retrieved 12 January 2023.
  47. "Thüringer Forschungspreis 2022". YouTube. Thüringer Wirtschafts- & Wissenschaftsministerium. Retrieved 12 January 2023.
  48. Schönfelder, Ute (6 April 2022). "Artificial Intelligence identifies small molecules: Bioinformatics team awarded 2022 Thuringian Research Prize in the category Applied Research". Friedrich Schiller University Jena. Retrieved 12 January 2023.
  49. Singh, Arunima (January 2020). "Tools for metabolomics". Nature Methods. 17 (1): 24. doi:10.1038/s41592-019-0710-6. PMID   31907484.
  50. Allen, Felicity; Pon, Allison; Wilson, Michael; Greiner, Russ; Wishart, David (1 July 2014). "CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra". Nucleic Acids Research. 42 (W1): W94–W99. doi:10.1093/nar/gku436. PMC   4086103 . PMID   24895432.
  51. Wang, Fei; Allen, Dana; Tian, Siyang; Oler, Eponine; Gautam, Vasuk; Greiner, Russell; Metz, Thomas O.; Wishart, David S. (5 July 2022). "CFM-ID 4.0 - a web server for accurate MS-based metabolite identification". Nucleic Acids Research. 50 (W1): W165–W174. doi:10.1093/nar/gkac383. PMC   9252813 . PMID   35610037.
  52. Goldman, Samuel; Li, Janet; Coley, Connor W. (2023). "Generating Molecular Fragmentation Graphs with Autoregressive Neural Networks". arXiv: 2304.13136 [q-bio.QM].
  53. Ruttkies, Christoph; Schymanski, Emma L.; Wolf, Sebastian; Hollender, Juliane; Neumann, Steffen (December 2016). "MetFrag relaunched: incorporating strategies beyond in silico fragmentation". Journal of Cheminformatics. 8 (1): 3. doi: 10.1186/s13321-016-0115-9 . PMC   4732001 . PMID   26834843.
  54. Tsugawa, Hiroshi; Kind, Tobias; Nakabayashi, Ryo; Yukihira, Daichi; Tanaka, Wataru; Cajka, Tomas; Saito, Kazuki; Fiehn, Oliver; Arita, Masanori (16 August 2016). "Hydrogen Rearrangement Rules: Computational MS/MS Fragmentation and Structure Elucidation Using MS-FINDER Software". Analytical Chemistry. 88 (16): 7946–7958. doi:10.1021/acs.analchem.6b00770. PMC   7063832 . PMID   27419259.
  55. Lai, Zijuan; Tsugawa, Hiroshi; Wohlgemuth, Gert; Mehta, Sajjan; Mueller, Matthew; Zheng, Yuxuan; Ogiwara, Atsushi; Meissen, John; Showalter, Megan; Takeuchi, Kohei; Kind, Tobias; Beal, Peter; Arita, Masanori; Fiehn, Oliver (January 2018). "Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics". Nature Methods. 15 (1): 53–56. doi:10.1038/nmeth.4512. PMC   6358022 . PMID   29176591.