Condensed representation
The chemical structure representation of a protein sequence can be compressed efficiently by representing all unmodified natural amino acids by single so-called pseudo atoms.
![Example protein sequence. [Image: Example protein sequence.]](sequence.png)
Example protein sequence. The second residue of the second chain has been modified from Glu to 4-carboxy-glutamate.
![Full structure version of example protein. [Image: Full structure version of example protein.]](full_structure.png)
Full structure representation of example protein sequence.
![Condensed structure version of example protein. [Image: Condensed structure version of example protein.]](condensed_structure.png)
Condensed structure representation of example protein sequence. Only the 4-carboxy-glutamate is represented as a full structure. All other residues are represented by single atoms.
Examples
Below are examples of full structure molfiles generated from protein entries with a varying number of expressed amino acids.
- Human insulin, 51 residues 5.8 kDa
- Heat shock protein, 101 residues, 10.8 kDa
- Growth hormone, 191 residues, 22.3 kDa
- Factor VIIa, 406 residues, 45.1 kDa
- Serum albumin, 585 residues, 66.4 kDa
- Angiotensin enzyme, 788 residues, 90.7 kDa
- Collagen, 999 residues, 107 kDa
- Multi-drug resistant protein, 1531 residues, 172 kDa
- Myoferlin, 2061 residues, 235 kDa
If you attempt to register these molecules in a normal chemistry database, you will find that registration eventually becomes a very lengthy process as the number of residues grow.
By using the condensed representation instead the size and complexity of the molfile is greatly reduced - without chemical information loss. Compare the file sizes and registration performance of the full-structure files above with the equivalent condensed representations using pseudo atoms below.
- Human insulin, 51 residues, condensed representation
- Heat shock protein, 101 residues, condensed representation
- Growth hormone, 191 residues, condensed representation
- Factor VIIa, 406 residues, condensed representation
- Serum albumin, 585 residues, condensed representation
- Angiotensin enzyme, 788 residues, condensed representation
- Collagen, 999 residues, condensed representation
- Multi-drug resistant protein, 1531 residues, condensed representation
- Myoferlin, 2061 residues, condensed representation
The condensed representation above is known as the "simple condensed" format. Symyx/MDL Draw users will probably want to use the "MDL condensed format" which includes extra annotations that enable the drawing tool to expand/contract residues and layout disulfide bridges nicely.