Chemical file format

Last updated

A chemical file format is a type of data file which is used specifically for depicting molecular data. One of the most widely used is the chemical table file format, which is similar to Structure Data Format (SDF) files. They are text files that represent multiple chemical structure records and associated data fields. The XYZ file format is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols (or atomic numbers) and cartesian coordinates. The Protein Data Bank Format is commonly used for proteins but is also used for other types of molecules. There are many other types which are detailed below. Various software systems are available to convert from one format to another.

Contents

Distinguishing formats

Chemical information is usually provided as files or streams and many formats have been created, with varying degrees of documentation. The format is indicated in three ways:
(see § The Chemical MIME Project)

Chemical Markup Language

Chemical Markup Language (CML) is an open standard for representing molecular and other chemical data. The open source project includes XML Schema, source code for parsing and working with CML data, and an active community. The articles Tools for Working with Chemical Markup Language and XML for Chemistry and Biosciences discusses CML in more detail. CML data files are accepted by many tools, including JChemPaint, Jmol, XDrawChem and MarvinView.

Protein Data Bank Format

The Protein Data Bank Format is commonly used for proteins but it can be used for other types of molecules as well. It was originally designed as, and continues to be, a fixed-column-width format and thus officially has a built-in maximum number of atoms, of residues, and of chains; this resulted in splitting very large structures such as ribosomes into multiple files. However, many tools can read files that exceed those limits. For example, the E. coli 70S ribosome was represented as 4 PDB files in 2009: 3I1M Archived 2016-10-05 at the Wayback Machine , 3I1N Archived 2016-10-16 at the Wayback Machine , 3I1O and 3I1P. In 2014 they were consolidated into a single file, 4V6C.

Some PDB files contain an optional section describing atom connectivity as well as position. Because these files are sometimes used to describe macromolecular assemblies or molecules represented in explicit solvent, they can grow very large and are often compressed. Some tools, such as Jmol and KiNG, [1] can read PDB files in gzipped format. The wwPDB maintains the specifications of the PDB file format and its XML alternative, PDBML. There was a fairly major change in PDB format specification (to version 3.0) in August 2007, and a remediation of many file problems in the existing database. [2] The typical file extension for a PDB file is .pdb, although some older files use .ent or .brk. Some molecular modeling tools write nonstandard PDB-style files that adapt the basic format to their own needs.

GROMACS format

The GROMACS file format family was created for use with the molecular simulation software package GROMACS. It closely resembles the PDB format but was designed for storing output from molecular dynamics simulations, so it allows for additional numerical precision and optionally retains information about particle velocity as well as position at a given point in the simulation trajectory. It does not allow for the storage of connectivity information, which in GROMACS is obtained from separate molecule and system topology files. The typical file extension for a GROMACS file is .gro.

CHARMM format

The CHARMM molecular dynamics package [3] can read and write a number of standard chemical and biochemical file formats; however, the CARD (coordinate) and PSF (protein structure file) are largely unique to CHARMM. The CARD format is fixed-column-width, resembles the PDB format, and is used exclusively for storing atomic coordinates. The PSF file contains atomic connectivity information (which describes atomic bonds) and is required before beginning a simulation. The typical file extensions used are .crd and .psf respectively.

GSD format

The General Simulation Data (GSD) file format created for efficient reading / writing of generic particle simulations, primarily - but not restricted to - those from HOOMD-blue. The package also contains a python module that reads and writes HOOMD schema gsd files with an easy to use syntax.

Ghemical file format

The Ghemical software can use OpenBabel to import and export a number of file formats. However, by default, it uses the GPR format. This file is composed of several parts, separated by a tag (!Header, !Info, !Atoms, !Bonds, !Coord, !PartialCharges and !End).

The proposed MIME type for this format is application/x-ghemical.

SYBYL Line Notation

SYBYL Line Notation (SLN) is a chemical line notation. Based on SMILES, it incorporates a complete syntax for specifying relative stereochemistry. SLN has a rich query syntax that allows for the specification of Markush structure queries. The syntax also supports the specification of combinatorial libraries of ChemDraw.

Example SLNs
DescriptionSLN string
Benzene C[1]H:CH:CH:CH:CH:CH:@1
Alanine NH2C[s=n]H(CH3)C(=O)OH
Query showing R sidechainR1[hac>1]C[1]:C:C:C:C:C:@1
Query for amide/sulfamideNHC=M1{M1:O,S}

SMILES

The simplified molecular input line entry system, or SMILES, [4] is a line notation for molecules. SMILES strings include connectivity but do not include 2D or 3D coordinates.

Hydrogen atoms are not represented. Other atoms are represented by their element symbols B, C, N, O, F, P, S, Cl, Br, and I. The symbol = represents double bonds and # represents triple bonds. Branching is indicated by ( ). Rings are indicated by pairs of digits.

Some examples are

NameFormulaSMILES string
Methane CH4C
Ethanol C2H6OCCO
Benzene C6H6C1=CC=CC=C1 or c1ccccc1
Ethylene C2H4C=C

XYZ

The XYZ file format is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols (or atomic numbers) and cartesian coordinates.

MDL number

The MDL number contains a unique identification number for each reaction and variation. The format is RXXXnnnnnnnn. R indicates a reaction, XXX indicates which database contains the reaction record. The numeric portion, nnnnnnnn, is an 8-digit number.

Other common formats

One of the most widely used industry standards are chemical table file formats, like the Structure Data Format (SDF) files. They are text files that adhere to a strict format for representing multiple chemical structure records and associated data fields. The format was originally developed and published by Molecular Design Limited (MDL). MOL is another file format from MDL. It is documented in Chapter 4 of CTfile Formats. [5]

PubChem also has XML and ASN1 file formats, which are export options from the PubChem online database. They are both text based (ASN1 is most often a binary format).

There are a large number of other formats listed in the table below

Converting between formats

OpenBabel and JOELib are freely available open source tools specifically designed for converting between file formats. Their chemical expert systems support a large atom type conversion tables.

obabel -i input_formatinput_file -o output_formatoutput_file

For example, to convert the file epinephrine.sdf in SDF to CML use the command

obabel -i sdf epinephrine.sdf -o cml epinephrine.cml

The resulting file is epinephrine.cml.

IOData is a free and open-source Python library for parsing, storing, and converting various file formats commonly used by quantum chemistry, molecular dynamics, and plane-wave density-functional-theory software programs. It also supports a flexible framework for generating input files for various software packages. For a complete list of supported formats, please go to https://iodata.readthedocs.io/en/latest/formats.html.

A number of tools intended for viewing and editing molecular structures are able to read in files in a number of formats and write them out in other formats. The tools JChemPaint (based on the Chemistry Development Kit), XDrawChem (based on OpenBabel), Chime, Jmol, Mol2mol [6] [ citation needed ] and Discovery Studio fit into this category.

The Chemical MIME Project

"Chemical MIME" is a de facto approach for adding MIME types to chemical streams.

This project started in January 1994, and was first announced during the Chemistry workshop at the First WWW International Conference, held at CERN in May 1994. ... The first version of an Internet draft was published during May–October 1994, and the second revised version during April–September 1995. A paper presented to the CPEP (Committee on Printed and Electronic Publications) at the IUPAC meeting in August 1996 is available for discussion. [7]

In 1998 the work was formally published in the JCIM. [8]

File extension MIME TypeProper NameDescription
.alcchemical/x-alchemyAlchemy Format
.csfchemical/x-cache-csfCAChe MolStruct CSF
.cbin, .cascii, .ctabchemical/x-cactvs-binaryCACTVS format
.cdxchemical/x-cdxChemDraw eXchange file
.cerchemical/x-ceriusMSI Cerius II format
.c3dchemical/x-chem3dChem3D Format
.chmchemical/x-chemdrawChemDraw file
.cifchemical/x-cif Crystallographic Information File, Crystallographic Information FrameworkPromulgated by the International Union of Crystallography
.cmdfchemical/x-cmdfCrystalMaker Data format
.cmlchemical/x-cml Chemical Markup Language XML based Chemical Markup Language.
.cpachemical/x-compassCompass program of the Takahashi
.bsdchemical/x-crossfireCrossfire file
.csm, .csmlchemical/x-csmlChemical Style Markup Language
.ctxchemical/x-ctxGasteiger group CTX file format
.cxf, .cefchemical/x-cxfChemical eXchange Format
.emb, .emblchemical/x-embl-dl-nucleotideEMBL Nucleotide Format
.spcchemical/x-galactic-spcSPC format for spectral and chromatographic data
.inp, .gam, .gaminchemical/x-gamess-inputGAMESS Input format
.fch, .fchkchemical/x-gaussian-checkpoint Gaussian Checkpoint Format
.cubchemical/x-gaussian-cube Gaussian Cube (Wavefunction) Format
.gau, .gjc, .gjf, .comchemical/x-gaussian-input Gaussian Input Format
.gcgchemical/x-gcg8-sequenceProtein Sequence Format
.genchemical/x-genbankToGenBank Format
.istr, .istchemical/x-isostarIsoStar Library of Intermolecular Interactions
.jdx, .dxchemical/x-jcamp-dx JCAMP Spectroscopic Data Exchange Format
.kinchemical/x-kinemageKinetic (Protein Structure) Images; Kinemage
.mcmchemical/x-macmoleculeMacMolecule File Format
.mmd, .mmodchemical/x-macromodel-input MacroModel Molecular Mechanics
.molchemical/x-mdl-molfile MDL Molfile
.smiles, .smichemical/x-daylight-smiles Simplified molecular input line entry specification A line notation for molecules.
.sdfchemical/x-mdl-sdfile Structure-Data File
.elchemical/x-sketchelSketchEl Molecule
.dschemical/x-datasheetSketchEl XML DataSheet
.inchichemical/x-inchiIUPAC International Chemical Identifier (InChI)
.jsd, .jsdrawchemical/x-jsdrawJSDraw native file format
.helm, .ihelmchemical/x-helmPistoia Alliance HELM stringA line notation for biological molecules
.xhelmchemical/x-xhelmPistoia Alliance XHELM XML file XML based HELM including monomer definitions

Support

For Linux/Unix, configuration files are available as a "chemical-mime-data" package in .deb, RPM and tar.gz formats to register chemical MIME types on a web server. [9] [10] Programs can then register as viewer, editor or processor for these formats so that full support for chemical MIME types is available.

Sources of chemical data

Here is a short list of sources of freely available molecular data. There are many more resources than listed here out there on the Internet. Links to these sources are given in the references below.

  1. The US National Institute of Health PubChem database is a huge source of chemical data. All of the data is in two-dimensions. Data includes SDF, SMILES, PubChem XML, and PubChem ASN1 formats.
  2. The worldwide Protein Data Bank (wwPDB) [11] is an excellent source of protein and nucleic acid molecular coordinate data. The data is three-dimensional and provided in Protein Data Bank (PDB) format.
  3. eMolecules is a commercial database for molecular data. The data includes a two-dimensional structure diagram and a smiles string for each compound. eMolecules supports fast substructure searching based on parts of the molecular structure.
  4. ChemExper is a commercial data base for molecular data. The search results include a two-dimensional structure diagram and a mole file for many compounds.
  5. New York University Library of 3-D Molecular Structures.
  6. The US Environmental Protection Agency's The Distributed Structure-Searchable Toxicity (DSSTox) Database Network is a project of EPA's Computational Toxicology Program. The database provides SDF molecular files with a focus on carcinogenic and otherwise toxic substances.

See also

Related Research Articles

The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron microscopy, and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organisations. The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB.

<span class="mw-page-title-main">National Center for Biotechnology Information</span> Database branch of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by US Congressman Claude Pepper.

GROMACS is a molecular dynamics package mainly designed for simulations of proteins, lipids, and nucleic acids. It was originally developed in the Biophysical Chemistry department of University of Groningen, and is now maintained by contributors in universities and research centers worldwide. GROMACS is one of the fastest and most popular software packages available, and can run on central processing units (CPUs) and graphics processing units (GPUs). It is free, open-source software released under the GNU General Public License (GPL), and starting with version 4.6, the GNU Lesser General Public License (LGPL).

Chemistry at Harvard Macromolecular Mechanics (CHARMM) is the name of a widely used set of force fields for molecular dynamics, and the name for the molecular dynamics simulation and analysis computer software package associated with them. The CHARMM Development Project involves a worldwide network of developers working with Martin Karplus and his group at Harvard to develop and maintain the CHARMM program. Licenses for this software are available, for a fee, to people and groups working in academia.

<span class="mw-page-title-main">Structural bioinformatics</span> Bioinformatics subfield

Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structures such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. The main objective of structural bioinformatics is the creation of new methods of analysing and manipulating biological macromolecular data in order to solve problems in biology and generate new knowledge.

A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data.

Chemical Markup Language is an approach to managing molecular information using tools such as XML and Java. It was the first domain specific implementation based strictly on XML, first based on a DTD and later on an XML Schema, the most robust and widely used system for precise information management in many areas. It has been developed over more than a decade by Murray-Rust, Rzepa and others and has been tested in many areas and on a variety of machines.

<span class="mw-page-title-main">Molecular modelling</span> Discovering chemical properties by physical simulations

Molecular modelling encompasses all methods, theoretical and computational, used to model or mimic the behaviour of molecules. The methods are used in the fields of computational chemistry, drug design, computational biology and materials science to study molecular systems ranging from small chemical systems to large biological molecules and material assemblies. The simplest calculations can be performed by hand, but inevitably computers are required to perform molecular modelling of any reasonably sized system. The common feature of molecular modelling methods is the atomistic level description of the molecular systems. This may include treating atoms as the smallest individual unit, or explicitly modelling protons and neutrons with its quarks, anti-quarks and gluons and electrons with its photons.

Chemical table file is a family of text-based chemical file formats that describe molecules and chemical reactions. One format, for example, lists each atom in a molecule, the x-y-z coordinates of that atom, and the bonds among the atoms.

<span class="mw-page-title-main">XDrawChem</span> Software for chemical structure drawing

XDrawChem is a free software program for drawing chemical structural formulas, available for Unix and macOS. It is distributed under the GNU GPL. In Microsoft Windows this program is called WinDrawChem.

<span class="mw-page-title-main">Force field (chemistry)</span> Concept on molecular modeling

In the context of chemistry, molecular physics and physical chemistry and molecular modelling, a force field is a computational model that is used to describe the forces between atoms within molecules or between molecules as well as in crystals. Force fields are a variety of interatomic potentials. More precisely, the force field refers to the functional form and parameter sets used to calculate the potential energy of a system of the atomistic level. Force fields are usually used in molecular dynamics or Monte Carlo simulations. The parameters for a chosen energy function may be derived from classical laboratory experiment data, calculations in quantum mechanics, or both. Force fields utilize the same concept as force fields in classical physics, with the main difference that the force field parameters in chemistry describe the energy landscape on the atomistic level. From a force field, the acting forces on every particle are derived as a gradient of the potential energy with respect to the particle coordinates.

<span class="mw-page-title-main">Jmol</span> Open-source Java viewer for 3D chemical structures

Jmol is computer software for molecular modelling chemical structures in 3-dimensions. Jmol returns a 3D representation of a molecule that may be used as a teaching tool, or for research e.g., in chemistry and biochemistry.

The Protein Data Bank (PDB) file format is a textual file format describing the three-dimensional structures of molecules held in the Protein Data Bank, now succeeded by the mmCIF format. The PDB format accordingly provides for description and annotation of protein and nucleic acid structures including atomic coordinates, secondary structure assignments, as well as atomic connectivity. In addition experimental metadata are stored. The PDB format is the legacy file format for the Protein Data Bank which has kept data on biological macromolecules in the newer PDBx/mmCIF file format since 2014.

<span class="mw-page-title-main">ISIS/Draw</span>

ISIS/Draw was a chemical structure drawing program developed by MDL Information Systems. It introduced a number of file formats for the storage of chemical information that have become industry standards.

<span class="mw-page-title-main">BALL</span>

BALL is a C++ class framework and set of algorithms and data structures for molecular modelling and computational structural bioinformatics, a Python interface to this library, and a graphical user interface to BALL, the molecule viewer BALLView.

Discovery Studio is a suite of software for simulating small molecule and macromolecule systems. It is developed and distributed by Dassault Systemes BIOVIA.

The hierarchical editing language for macromolecules (HELM) is a method of describing complex biological molecules. It is a notation that is machine readable to render the composition and structure of peptides, proteins, oligonucleotides, and related small molecule linkers.

Molecular Operating Environment (MOE) is a drug discovery software platform that integrates visualization, modeling and simulations, as well as methodology development, in one package. MOE scientific applications are used by biologists, medicinal chemists and computational chemists in pharmaceutical, biotechnology and academic research. MOE runs on Windows, Linux, Unix, and macOS. Main application areas in MOE include structure-based design, fragment-based design, ligand-based design, pharmacophore discovery, medicinal chemistry applications, biologics applications, structural biology and bioinformatics, protein and antibody modeling, molecular modeling and simulations, virtual screening, cheminformatics & QSAR. The Scientific Vector Language (SVL) is the built-in command, scripting and application development language of MOE.

MBN Explorer is a software package for molecular dynamics simulations, structure optimization and kinetic Monte Carlo simulations. It is designed for multiscale computational analysis of structure and dynamics of atomic clusters and nanoparticles, biomolecules and nanosystems, nanostructured materials, different states of matter and various interfaces. The software has been developed by MBN Research Center.

The Biological Magnetic Resonance Data Bank is an open access repository of nuclear magnetic resonance (NMR) spectroscopic data from peptides, proteins, nucleic acids and other biologically relevant molecules. The database is operated by the University of Wisconsin–Madison and is supported by the National Library of Medicine. The BMRB is part of the Research Collaboratory for Structural Bioinformatics and, since 2006, it is a partner in the Worldwide Protein Data Bank (wwPDB). The repository accepts NMR spectral data from laboratories around the world and, once the data is validated, it is available online at the BMRB website. The database has also an ftp site, where data can be downloaded in the bulk. The BMRB has two mirror sites, one at the Protein Database Japan (PDBj) at Osaka University and one at the Magnetic Resonance Research Center (CERM) at the University of Florence in Italy. The site at Japan accepts and processes data depositions.

References

  1. Chen, V.B.; et al. (2009). "KING (Kinemage, Next Generation): A versatile interactive molecular and scientific visualization program". Protein Science. 18 (11): 2403–2409. doi:10.1002/pro.250. PMC   2788294 . PMID   19768809.
  2. Henrick, K.; et al. (2008). "Remediation of the protein data bank archive". Nucleic Acids Research. 36 (Database issue): D426–D433. doi:10.1093/nar/gkm937. PMC   2238854 . PMID   18073189.
  3. Brooks, B.M.; et al. (1983). "CHARMM: A program for macromolecular energy, minimization, and dynamics calculations". J. Comput. Chem. 4 (2): 187–217. doi:10.1002/jcc.540040211. S2CID   91559650.
  4. Weininger, David (1988). "SMILES, a Chemical Language and Information System: 1: Introduction to Methodology and Encoding Rules". Journal of Chemical Information and Modeling. 28 (1): 31–36. doi:10.1021/ci00057a005. S2CID   5445756.
  5. MDL Information Systems 2005
  6. Mol2mol homepage
  7. The Chemical MIME Home Page (accessed 2013-January-24)
  8. Rzepa, H. S.; Murray-Rust, P.; Whitaker, B. J. (1998). "The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange". Journal of Chemical Information and Modeling. 38 (6): 976. doi:10.1021/ci9803233.
  9. "Package Search Results for "chemical-mime" | Debian".
  10. "Why Use SourceForge? Features and Benefits".
  11. Berman, H.M.; et al. (2003). "Announcing the worldwide Protein Data Bank". Nature Structural Biology. 10 (12): 980. doi: 10.1038/nsb1203-980 . PMID   14634627.