Chemical table file

Last updated

Chemical table file (CT file) is a family of text-based chemical file formats that describe molecules and chemical reactions. One format, for example, lists each atom in a molecule, the x-y-z coordinates of that atom, and the bonds among the atoms.

Contents

File formats

There are several file formats in the family.

The formats were created by MDL Information Systems (MDL), which was acquired by Symyx Technologies then merged with Accelrys Corp., and now called BIOVIA, a subsidiary of Dassault Systemes of Dassault Group. [1]

The CT file is an open format. BIOVIA publishes its specification. [2] BIOVIA requires users to register to download the CT file format specifications. [3]

Molfile

ctab
Filename extension
.mol
Internet media type
chemical/x-mdl-molfile
Type of format chemical file format

An MDL Molfile is a file format for holding information about the atoms, bonds, connectivity and coordinates of a molecule.

The molfile consists of some header information, the Connection Table (CT) containing atom info, then bond connections and types, followed by sections for more complex information.

The molfile is sufficiently common that most, if not all, cheminformatics software systems/applications are able to read the format, though not always to the same degree. It is also supported by some computational software such as Mathematica.

The current de facto standard version is molfile V2000, although, more recently, the V3000 format has been circulating widely enough to present a potential compatibility issue for those applications that are not yet V3000-capable.

The contents of a Molfile of L-Alanine L-Alanine.svg
The contents of a Molfile of L-Alanine
L-Alanine
Title line (can be blank but line must exist)Header Block

(3 lines)

  ABCDEFGH09071717443D
Program / file timestamp line

(Name of source program and a file timestamp)

Exported
Comment line (can be blank but line must exist)
6 5 0 0 1 0 3 V2000
Counts lineConnection table
-0.6622  0.5342 0.0000 C 0 0 2 0 0 0  0.6622 -0.3000 0.0000 C 0 0 0 0 0 0 -0.7207  2.0817 0.0000 C 1 0 0 0 0 0 -1.8622 -0.3695 0.0000 N 0 3 0 0 0 0  0.6220 -1.8037 0.0000 O 0 0 0 0 0 0  1.9464  0.4244 0.0000 O 0 5 0
Atom block

(1 line for each atom): x, y, z (in angstroms), element, etc.

1 2 1 0 0 0 0 1 3 1 0 1 0 0 1 4 1 0 0 0 0 2 5 2 0 0 0 0 2 6 1 0 0 0 0
Bond block

(1 line for each bond): 1st atom, 2nd atom, type, etc.

M  CHG 2 4 1 6 -1 M  ISO 1 3 13
Properties block
M  END
END line

(NOTE: some programs don't like a blank line before M END)

END

Counts line block specification

Value650001V2000
Descriptionnumber of atomsnumber of bondsnumber of atom listChiral flag, 1 = chiral;

0 = not chiral

number of stext entriesnumber of lines of

additional properties

mol version
Type[Generic][Generic][Query][Generic][ISIS/Desktop][Generic]

Bond block specification

The Bond Block is made up of bond lines, one line per bond, with the following format:

111 222 ttt sss xxx rrr ccc

where the values are described in the following table:

FieldMeaningValues
111first atom number
222second atom number
tttbond type1= Single, 2 = Double, 3 = Triple, 4 = Aromatic,5 = Single or Double, 6 = Single or Aromatic, 7 = Double or Aromatic, 8 = Any
sssbond stereoFor single bonds:

0 = not stereo; 1= up; 4=either, 6= down

For double bonds:

0= Use x-, y-, z-coords from atom block to determine cis or trans; 3=Cis or trans (either) double bond

xxxnot used
rrrbond topology0 = Either, 1 = Ring, 2 = Chain
cccreacting center status0 = unmarked, 1 = a center, -1 = not a center, Additional: 2 = no change, 4 = bond made/broken, 8 = bond order changes

12 = 4+8 (both made/broken and changes);

5 = (4 + 1), 9 = (8 + 1), and 13 = (12 + 1) are also possible

Extended Connection Table (V3000)

The extended (V3000) molfile consists of a regular molfile “no structure” followed by a single molfile appendix that contains the body of the connection table (Ctab). The following figure shows both an alanine structure and the extended molfile corresponding to it.

Note that the “no structure” is flagged with the “V3000” instead of the “V2000” version stamp. There are two other changes to the header in addition to the version:

Unlike the V2000 molfile, the V3000 extended Rgroup molfile has the same header format as a non-Rgroup molfile.

L-Alanine.svg
L-Alanine
DescriptionHeader block
GSMACCS-II07189510252D 1 0.00366 0.00000 0
Header with timestamp
Figure 1, J. Chem. Inf. Comput. Sci., Vol 32, No. 3., 1992
Comment line
0 0 0 0 0 999 V3000
V2000-compatibility line
M V30 BEGIN CTAB
Connection table
M V30 COUNTS 6 5 0 0 1
Counts line
M V30 BEGIN ATOM M V30 1 C -0.6622 0.5342 0 0 CFG=2  M V30 2 C 0.6622 -0.3 0 0  M V30 3 C -0.7207 2.0817 0 0 MASS=13  M V30 4 N -1.8622 -0.3695 0 0 CHG=1  M V30 5 O 0.622 -1.8037 0 0  M V30 6 O 1.9464 0.4244 0 0 CHG=-1  M V30 END ATOM
Atom block
M V30 BEGIN BOND M V30 1 1 1 2  M V30 2 1 1 3 CFG=1  M V30 3 1 1 4  M V30 4 2 2 5  M V30 5 1 2 6  M V30 END BOND
Bond block
M V30 END CTAB M END

Counts line

A counts line is required, and must be first. It specifies the number of atoms, bonds, 3D objects, and Sgroups. It also specifies whether or not the CHIRAL flag is set. Optionally, the counts line can specify molregno. This is only used when the regno exceeds 999999 (the limit of the format in the molfile header line). The format of the counts line is:

M V30 COUNTS na nb nsg n3d chiral
M V30 COUNTSnanbnsgn3dchiral[REGNO=regno]
M V30 COUNTS65001
number of atoms
number of bonds
number of Sgroups
number of 3D constrains
if 1 = molecule is chiral
molecule or model regno

SDF

ctab
Filename extension
.sd, .sdf
Internet media type
chemical/x-mdl-sdfile
Type of format chemical file format

SDF is one of a family of chemical-data file formats developed by MDL; it is intended especially for structural information. "SDF" stands for structure-data format, and SDF files actually wrap the molfile (MDL Molfile) format. Multiple records are delimited by lines consisting of four dollar signs ($$$$). A key feature of this format is its ability to include associated data.

Associated data items are denoted as follows:

><Unique_ID> XCA3464366><ClogP> 5.825><Vendor> Sigma><Molecular Weight> 499.611

Multiple-line data items are also supported. The MDL SDF-format specification requires that a hard-carriage-return character be inserted if a single line of any text field exceeds 200 characters. This requirement is frequently violated in practice, as many SMILES and InChI strings exceed that length.

Other formats of the family

There are other, less commonly used formats of the family:

See also

Related Research Articles

In chemistry, a chemical formula is a way of presenting information about the chemical proportions of atoms that constitute a particular chemical compound or molecule, using chemical element symbols, numbers, and sometimes also other symbols, such as parentheses, dashes, brackets, commas and plus (+) and minus (−) signs. These are limited to a single typographic line of symbols, which may include subscripts and superscripts. A chemical formula is not a chemical name since it does not contain any words. Although a chemical formula may imply certain simple chemical structures, it is not the same as a full chemical structural formula. Chemical formulae can fully specify the structure of only the simplest of molecules and chemical substances, and are generally more limited in power than chemical names and structural formulae.

<span class="mw-page-title-main">Molecule</span> Electrically neutral group of two or more atoms

A molecule is a group of two or more atoms held together by attractive forces known as chemical bonds; depending on context, the term may or may not include ions which satisfy this criterion. In quantum physics, organic chemistry, and biochemistry, the distinction from ions is dropped and molecule is often used when referring to polyatomic ions.

<span class="mw-page-title-main">Simplified molecular-input line-entry system</span> Chemical species structure notation

The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.

The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron microscopy, and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organisations. The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB.

<span class="mw-page-title-main">Lewis structure</span> Diagrams for the bonding between atoms of a molecule and lone pairs of electrons

Lewis structures – also called Lewis dot formulas, Lewis dot structures, electron dot structures, or Lewis electron dot structures (LEDs) – are diagrams that show the bonding between atoms of a molecule, as well as the lone pairs of electrons that may exist in the molecule. A Lewis structure can be drawn for any covalently bonded molecule, as well as coordination compounds. The Lewis structure was named after Gilbert N. Lewis, who introduced it in his 1916 article The Atom and the Molecule. Lewis structures extend the concept of the electron dot diagram by adding lines between atoms to represent shared pairs in a chemical bond.

<span class="mw-page-title-main">Structural bioinformatics</span> Bioinformatics subfield

Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structures such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. The main objective of structural bioinformatics is the creation of new methods of analysing and manipulating biological macromolecular data in order to solve problems in biology and generate new knowledge.

A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data.

A molecule editor is a computer program for creating and modifying representations of chemical structures.

A chemical file format is a type of data file which is used specifically for depicting molecular data. One of the most widely used is the chemical table file format, which is similar to Structure Data Format (SDF) files. They are text files that represent multiple chemical structure records and associated data fields. The XYZ file format is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols and cartesian coordinates. The Protein Data Bank Format is commonly used for proteins but is also used for other types of molecules. There are many other types which are detailed below. Various software systems are available to convert from one format to another.

The International Chemical Identifier is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web. Initially developed by the International Union of Pure and Applied Chemistry (IUPAC) and National Institute of Standards and Technology (NIST) from 2000 to 2005, the format and algorithms are non-proprietary. Since May 2009, it has been developed by the InChI Trust, a nonprofit charity from the United Kingdom which works to implement and promote the use of InChI.

<span class="mw-page-title-main">Jmol</span> Open-source Java viewer for 3D chemical structures

Jmol is computer software for molecular modelling chemical structures in 3-dimensions. Jmol returns a 3D representation of a molecule that may be used as a teaching tool, or for research e.g., in chemistry and biochemistry.

<span class="mw-page-title-main">JOELib</span>

JOELib is computer software, a chemical expert system used mainly to interconvert chemical file formats. Because of its strong relationship to informatics, this program belongs more to the category cheminformatics than to molecular modelling. It is available for Windows, Unix and other operating systems supporting the programming language Java. It is free and open-source software distributed under the GNU General Public License (GPL) 2.0.

<span class="mw-page-title-main">Chemistry Development Kit</span> Computer software

The Chemistry Development Kit (CDK) is computer software, a library in the programming language Java, for chemoinformatics and bioinformatics. It is available for Windows, Linux, Unix, and macOS. It is free and open-source software distributed under the GNU Lesser General Public License (LGPL) 2.0.

MDL Information Systems, Inc. was a provider of R&D informatics products for the life sciences and chemicals industries. The company was launched as a computer-aided drug design firm in January 1978 in Hayward, California. The company was acquired by Symyx Technologies, Inc. in 2007. Subsequently Accelrys merged with Symyx. The Accelrys name was retained for the combined company. In 2014 Accelrys was acquired by Dassault Systemes. The Accelrys business unit was renamed BIOVIA.

<span class="mw-page-title-main">ISIS/Draw</span>

ISIS/Draw was a chemical structure drawing program developed by MDL Information Systems. It introduced a number of file formats for the storage of chemical information that have become industry standards.

<span class="mw-page-title-main">BALL</span>

BALL is a C++ class framework and set of algorithms and data structures for molecular modelling and computational structural bioinformatics, a Python interface to this library, and a graphical user interface to BALL, the molecule viewer BALLView.

The SYBYL line notation or SLN is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. SLN differs from SMILES in several significant ways. SLN can specify molecules, molecular queries, and reactions in a single line notation whereas SMILES handles these through language extensions. SLN has support for relative stereochemistry, it can distinguish mixtures of enantiomers from pure molecules with pure but unresolved stereochemistry. In SMILES aromaticity is considered to be a property of both atoms and bonds whereas in SLN it is a property of bonds.

<span class="mw-page-title-main">Space-filling model</span> Type of 3D molecular model

In chemistry, a space-filling model, also known as a calotte model, is a type of three-dimensional (3D) molecular model where the atoms are represented by spheres whose radii are proportional to the radii of the atoms and whose center-to-center distances are proportional to the distances between the atomic nuclei, all in the same scale. Atoms of different chemical elements are usually represented by spheres of different colors.

SMILES arbitrary target specification (SMARTS) is a language for specifying substructural patterns in molecules. The SMARTS line notation is expressive and allows extremely precise and transparent substructural specification and atom typing.

The hierarchical editing language for macromolecules (HELM) is a method of describing complex biological molecules. It is a notation that is machine readable to render the composition and structure of peptides, proteins, oligonucleotides, and related small molecule linkers.

References

  1. Dalby, A.; Nourse, J. G.; Hounshell, W. D.; Gushurst, A. K. I.; Grier, D. L.; Leland, B. A.; Laufer, J. (1992). "Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited". Journal of Chemical Information and Modeling. 32 (3): 244. doi:10.1021/ci00007a012.
  2. "CT File Formats" (PDF). Biovia. August 2020. Archived (PDF) from the original on 2021-02-19. Retrieved 2021-02-19.
  3. "Registration form". Biovia. 13 August 2020. Archived from the original on 2020-10-01. Retrieved 2021-02-19.