A chemical file format is a type of data file which is used specifically for depicting molecular data. One of the most widely used is the chemical table file format, which is similar to Structure Data Format (SDF) files. They are text files that represent multiple chemical structure records and associated data fields. The XYZ file format is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols (or atomic numbers) and cartesian coordinates. The Protein Data Bank Format is commonly used for proteins but is also used for other types of molecules. There are many other types which are detailed below. Various software systems are available to convert from one format to another.
Chemical information is usually provided as files or streams and many formats have been created, with varying degrees of documentation. The format is indicated in three ways:
Chemical Markup Language (CML) is an open standard for representing molecular and other chemical data. The open source project includes XML Schema, source code for parsing and working with CML data, and an active community. The articles Tools for Working with Chemical Markup Language and XML for Chemistry and Biosciences discusses CML in more detail. CML data files are accepted by many tools, including JChemPaint, Jmol, XDrawChem and MarvinView.
Simpler chemical format focus on describing the connectivity of atoms (and sometimes their stereochemistry). They include:
InChI is an IUPAC-standard format for describing molecules. Multiple InChI strings can be joined into an "RInChI" to describe a chemical reaction, or into a "MInChI" to describe a mixture.
The simplified molecular input line entry system, or SMILES, is a line notation for molecules. SMILES strings include connectivity but do not include 2D or 3D coordinates.
Hydrogen atoms are not represented. Other atoms are represented by their element symbols <code>B</code>, <code>C</code>, <code>N</code>, <code>O</code>, <code>F</code>, <code>P</code>, <code>S</code>, <code>Cl</code>, <code>Br</code>, and <code>I</code>. The symbol <code>=</code> represents double bonds and <code>#</code> represents triple bonds. Branching is indicated by <code>( )</code>. Rings are indicated by pairs of digits.
Some examples are
Multiple SMILES strings can be joined into a "reaction SMILES", which describes a chemical reaction.
SYBYL Line Notation (SLN) is a chemical line notation. Based on SMILES, it incorporates a complete syntax for specifying relative stereochemistry. SLN has a rich query syntax that allows for the specification of Markush structure queries. The syntax also supports the specification of combinatorial libraries of ChemDraw.
Some chemical formats describe the coordinates of atoms. This is important for
One of the most widely used industry standards are the chemical table file formats. They are text files that adhere to a strict format for representing multiple chemical structure records and associated data fields. The format was originally developed and published by Molecular Design Limited (MDL). This family includes the MOLfile, the SDfile (Structure Data Format, MOLfile with metadata), the RXNfile (multiple MOLfiles put together to describe a chemical reaction), and RDfile (RXNfile with metadata).
The Protein Data Bank Format is an obsolete format for protein structures developed in 1972. It is a fixed-width format and thus limited to a maximum number of atoms, residues, and chains; this resulted in splitting very large structures such as ribosomes into multiple files. For example, the E. coli 70S was represented as 4 PDB files in 2009: 3I1M , 3I1N , 3I1O, and 3I1P. In 2014, they were consolidated into a single file, 4V6C.
Some PDB files contained an optional section describing atom connectivity as well as position. Because these files were sometimes used to describe macromolecular assemblies or molecules represented in explicit solvent, they could grow very large and were often compressed. Some tools, such as Jmol and KiNG, could read PDB files in gzipped format. The wwPDB maintained the specifications of the PDB file format and its XML alternative, PDBML. There was a fairly major change in PDB format specification (to version 3.0) in August 2007, and a remediation of many file problems in the existing database. The typical file extension for a PDB file was <code>.pdb</code>, although some older files used <code>.ent</code> or <code>.brk</code>. Some molecular modeling tools wrote nonstandard PDB-style files that adapted the basic format to their own needs.
The GROMACS file format family was created for use with the molecular simulation software package GROMACS. It closely resembles the PDB format but was designed for storing output from molecular dynamics simulations, so it allows for additional numerical precision and optionally retains information about particle velocity as well as position at a given point in the simulation trajectory. It does not allow for the storage of connectivity information, which in GROMACS is obtained from separate molecule and system topology files. The typical file extension for a GROMACS file is <code>.gro</code>.
The CHARMM molecular dynamics package can read and write a number of standard chemical and biochemical file formats; however, the CARD (coordinate) and PSF (protein structure file) are largely unique to CHARMM. The CARD format is fixed-column-width, resembles the PDB format, and is used exclusively for storing atomic coordinates. The PSF file contains atomic connectivity information (which describes atomic bonds) and is required before beginning a simulation. The typical file extensions used are <code>.crd</code> and <code>.psf</code> respectively.
In 2014, the PDB format was officially replaced with mmCIF. mmCIF is a new text format for representing atomic coordinates and "biological assemblies", i.e. assemblies of molecules. It can express things that the PDB format cannot express, so some newer PDB structures may not have PDB files available (but a "bundle file" containing PDB files split from the main mmCIF file can be downloaded).
There is also a more verbose XML variant.
The General Simulation Data (GSD) file format created for efficient reading / writing of generic particle simulations, primarily - but not restricted to - those from HOOMD-blue. The package also contains a python module that reads and writes HOOMD schema gsd files with an easy-to-use syntax.https://gsd.readthedocs.io
The Ghemical software can use OpenBabel to import and export a number of file formats. However, by default, it uses the GPR format. This file is composed of several parts, separated by a tag (<code>!Header</code>, <code>!Info</code>, <code>!Atoms</code>, <code>!Bonds</code>, <code>!Coord</code>, <code>!PartialCharges</code> and <code>!End</code>).
The XYZ file format is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols (or atomic numbers) and cartesian coordinates.
PubChem offers data export for SDF, JSON, XML, and ASNT/B formats.
These "formats" are references to entries in specific databases. Examples include:
These formats wrap other formats.
The Mixfile format is a JSON-based format for describing mixtures.
Just like extension names are used to distinguish file types in folders, MIME types are used to distinguish data-stream types on the Internet. "Chemical MIME" is a project for proposing MIME types to chemical streams. <blockquote> This project started in January 1994, and was first announced during the Chemistry workshop at the First WWW International Conference, held at CERN in May 1994. ... The first version of an Internet draft was published during MayâÂÂOctober 1994, and the second revised version during AprilâÂÂSeptember 1995. A paper presented to the CPEP (Committee on Printed and Electronic Publications) at the IUPAC meeting in August 1996 is available for discussion.</blockquote> In 1998 the work was formally published in the JCIM.
For Linux/Unix, configuration files are available as a "chemical-mime-data" package in .deb, RPM and tar.gz formats to register chemical MIME types on a web server. Programs can then register as viewer, editor or processor for these formats so that full support for chemical MIME types is available.
OpenBabel and JOELib are freely available open source tools specifically designed for converting between file formats. Their chemical expert systems support a large atom type conversion tables.
For example, to convert the file epinephrine.sdf in SDF to CML use the command
The resulting file is epinephrine.cml.
IOData is a free and open-source Python library for parsing, storing, and converting various file formats commonly used by quantum chemistry, molecular dynamics, and plane-wave density-functional-theory software programs. It also supports a flexible framework for generating input files for various software packages. For a complete list of supported formats, please go to https://iodata.readthedocs.io/en/latest/formats.html.
A number of tools intended for viewing and editing molecular structures are able to read in files in a number of formats and write them out in other formats. The tools JChemPaint (based on the Chemistry Development Kit), XDrawChem (based on OpenBabel), Chime, Jmol, Mol2mol and Discovery Studio fit into this category.
Here is a short list of sources of freely available molecular data. There are many more resources than listed here out there on the Internet. Links to these sources are given in the references below.
Small molecules:
Macromolecules: