Here, we present a new binary and compressed data representation, the MacroMolecular Transmission Format, MMTF, as well as software implementations in several languages that have been developed around it, which address these issues. Macromolecular structure files, such as PDB or PDBx/mmCIF files can be slow to transfer, parse, and hard to incorporate into third-party software tools. This creates a challenge for macromolecular visualization and analysis. Recent advances in experimental techniques have led to a rapid growth in complexity, size, and number of macromolecular structures that are made available through the Protein Data Bank.
The PDB archive is available in MMTF file format through web services and data are updated on a weekly basis.
As a consequence of the new data representation, it is now possible to visualize structures with millions of atoms in a web browser, keep the whole PDB archive in memory or parse it within few minutes on average computers, which opens up a new way of thinking how to implement efficient algorithms in structural bioinformatics. We describe the new format and its APIs and demonstrate that it is several times faster to parse, and about a quarter of the file size of the current standard format, PDBx/mmCIF. Together, Binar圜IF and CIFTools enable lightweight, efficient, and extensible handling of 3D macromolecular structural data.
Herein, we describe CIFTools, a set of libraries in Java and TypeScript for generic and typed handling of CIF and Binar圜IF files. Moreover, for the largest structures, Binar圜IF provides even better compression-factor ten and four versus CIF files and gzipped CIF files, respectively.
To this end, we introduce Binar圜IF, a serialization of Crystallographic Information File (CIF) format files that maintains full compatibility to related data schemas, such as PDBx/mmCIF, while reducing file sizes by more than a factor of two versus gzip compressed CIF files. Second, complexity is managed through improved software tooling and fully leveraging available data dictionary schemas. First, data size is reduced by bespoke compression techniques. We address two challenges posed by growth in data size and complexity. Efficient means of working with 3D macromolecular structural data for archiving, analyses, and visualization are central to facilitating interoperability and reusability in compliance with the FAIR Principles. I've converted the streams to byte for the below: package mainįlateReaderFn = func(r io.Reader) (io.ReadCloser, error) else if indObj, is := obj.(*core.3D macromolecular structural data is growing ever more complex and plentiful in the wake of substantive advances in experimental and computational structure determination methods including macromolecular crystallography, cryo-electron microscopy, and integrative methods. compress/flate ( rfc-1951) – (removing the first 2 bytes ( CMF, FLG)).I copied just the stream contents into new files within Vim (excluding the carriage returns after stream and before endstream). Here are 2 streams from the original pdf, along with their length objects: I've written a test script to help debug, and have pulled out smaller streams from the file to test with. I'm only able to get corrupt inputs and invalid checksums errors. I'm trying to deflate the stream(s), to work with the source data, but am struggling. Working with the 2016-W4 pdf, which has 2 large streams (page 1 & 2), along with a bunch of other objects and smaller streams.