From 3936957d678b7220a014f93c57fdcb104e87e301 Mon Sep 17 00:00:00 2001 From: David Sehnal Date: Tue, 31 Dec 2019 11:10:38 +0100 Subject: [PATCH] moving to mol* --- README.md | 471 +++------------------------------------------------ benchmark.md | 35 ++++ encoding.md | 252 +++++++++++++++++++++++++++ principle.md | 131 ++++++++++++++ 4 files changed, 445 insertions(+), 444 deletions(-) create mode 100644 benchmark.md create mode 100644 encoding.md create mode 100644 principle.md diff --git a/README.md b/README.md index bf97444..33e7690 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,10 @@ -![Version 0.3.0](http://img.shields.io/badge/Version-0.3.0-blue.svg?style=flat) +![Version 0.3.0](https://img.shields.io/badge/Version-0.3.0-blue.svg?style=flat) ![BinaryCIF](img/logo.png) -What is BinaryCIF -================= - BinaryCIF is a data format that stores text based CIF files using a more efficient binary encoding. It enables both lossless and lossy -compression of the original CIF file. - -The BinaryCIF format support is implemented as part of the [CIFTools.js library](https://github.com/dsehnal/CIFTools.js). -The format is also used by the [CoordinateServer](https://webchemdev.ncbr.muni.cz/CoordinateServer) service to encode macromolecular data. +compression of the original CIF file. BinaryCIF is currently mainly used by RCSB PDB and PDBe and is supported by the [Mol*](https://github.com/molstar/molstar) and [LiteMol](https://github.com/dsehnal/LiteMol) viewers. Some aspects of the BinaryCIF format, namely using [MessagePack](https://msgpack.org/) as the container and the usage the fixed point, run length, delta, and integer packing encodings was @@ -19,462 +13,51 @@ inspired by the [MMTF data format](http://mmtf.rcsb.org). Table of contents ================= -* [Basic Principles](#basic-principles) -* [BinaryCIF Format](#binarycif-format) - - [Data Layout](#data-layout) - - [Encoding Methods](#encoding-methods) - - [Encoding Process](#encoding-process) - - [Decoding Process](#decoding-process) - - [Reference Implementation](#reference-implementation) +* [Implementations](#implementations) +* [Principles](#principles) * [Use Cases](#use-cases) - [CoordinateServer](#coordinateserver) -* [Benchmark](#benchmark) - - [HIV-1 Capsid size](#hiv-1-capsid-size) - - [Whole PDB archive size](#whole-pdb-archive-size) - - [Read and write performance](#read-and-write-performance) - -Basic Principles -================ - -In this chapter the basic ideas behind the BinaryCIF will are discussed. - -[CIF](http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax) is a text based format for storing tabular data. -The data is stored row by row using this syntax: - -``` -loop_ -_category.field1 -_category.field2 -... -_category.fieldK -value-1_1 value-1_2 ... value-1_K -... -value-N_1 value-N_2 ... value-N_K -``` - -For example, the table called ``atoms`` with columns ``type, id, element, x, y, z`` - -|type|id|element|x|y|z| -|:---:|---:|:---:|---:|---:|---:| -|ATOM|1|C|0|0|0| -|ATOM|2|C|1|0|0| -|ATOM|3|O|0|1|0| -|HETATM|4|Fe|0|0|1| - -would be stored in CIF as - -``` -loop_ -_atoms.type -_atoms.id -_atoms.element -_atoms.x -_atoms.y -_atoms.z -ATOM 1 C 0 0 0 -ATOM 2 C 1 0 0 -ATOM 3 O 0 1 0 -HETATM 4 Fe 0 0 1 -``` - -If we want to compress the rows using a dictionary compression, it would identify -the string ATOM as a repeated substring and represent the rows something along the lines of - -``` -A = ATOM - -{A} 1 C 0 0 0 -{A} 2 C 1 0 0 -{A} 3 O 0 1 0 -HETATM 4 Fe 0 0 1 -``` - -where ``{A}`` is a dictionary reference to the string ``ATOM``. At first, it would seem -that this is an efficient solution. However, the problem with this data representation is that -it is actually hard to compress because related data is not next to each other. - -Fortunately, we can do much better than this: we can transpose the tabular data -and store them *per column* instead of *per row*: - -``` -_atoms.type: ATOM ATOM ATOM HETATM -_atoms.id: 1 2 3 4 -_atoms.element: C C O Fe -_atoms.x 0 1 0 0 -_atoms.y 0 0 1 0 -_atoms.z 0 0 0 1 -``` - -Now, we can compress all the repeating ATOM values using a method called run-length encoding: - -``` -_atoms.type: {ATOM, 3} HETATM -``` - -Where ``{ATOM, 3}`` means *repeat the string* ``ATOM`` *3 times*. If the value ATOM repeats -1 million times (which is quite common), this approach saves us a lot of space. - -Similarly, we can apply different encoding schemes to other types of data. -For example, the sequence - -``` -1 2 3 4 5 ... n -``` - -can be encoded using delta encoding as - -``` -1 1 1 1 1 ... -``` - -meaning we start with 1, then add 1 to the previous value, ending up with 2, then add 1 to the -previous values as well getting 3, etc. At this point, we can use the run-length encoding -approach from the ATOM example and end up with - -``` -{1, n} -``` - -to represent the original sequence of integers from 1 to n. - -The final step is to use binary instead of text encoding to store our data to make it more -space efficient. For example, storing the number 1234 as text requires 4 bytes: - -``` -"1" "2" "3" "4" - -0x31 0x32 0x33 0x34 -``` - -However, storing the number as a 16-bit integer, we required only 2 bytes: - -``` -4 * 256 + 210 - - 0x04 0xD2 -``` - -Applying the different encoding methods, the representation of our ``atoms`` table becomes - -``` -_atoms.type: {ATOM, 3} HETATM -_atoms.id: {1, 4} -_atoms.element: {C, 2} O Fe -_atoms.x 0x00 0x01 0x00 0x00 -_atoms.y 0x00 0x00 0x01 0x00 -_atoms.z 0x00 0x00 0x00 0x01 -``` - -BinaryCIF Format -================ - -## Data Layout - -A [CIF file](http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax) ([example](https://www.ebi.ac.uk/pdbe/static/entry/1tqn_updated.cif)) contains: - -* One or more data blocks -* Each data block has one or more category -* Each category has one or more field -* Each field contains *data* - -To represent this hierarchy, the basic shape of the BinaryCIF file defines the following -interfaces: - -``` -File { - version: string - encoder: string - dataBlocks: DataBlock[] -} - -DataBlock { - header: string - categories: Category[] -} - -Category { - name: string - rowCount: number - columns: Column[] -} - -Column { - name: string - data: Data - mask: Data -} - -Data { - data: Uint8Array - encoding: Encoding[] -} -``` - -The most interesting part is the ``Data`` interface where the actual data is stored. -The interface has two properties: ``data`` which is just an array of bytes (the binary data) -and an array of *encodings* that describes the transformations that were applied to the -source data to obtain the final binary result stored in the ``data`` field. - -Additionally, the ``Column`` interface defines a ``mask`` property used to determine -if a certain value is present, not present (``.`` token in CIF), or unknown (``?`` token in CIF). - -Currently, BinaryCIF supports these encoding methods: - -``` -type Encoding = - | ByteArray - | FixedPoint - | IntervalQuantization - | RunLength - | Delta - | IntegerPacking - | StringArray -``` - -## Encoding Methods - -### Byte Array - -Encodes an array of numbers of specified types as raw bytes. + - [DensityServer](#densityserver) -``` -ByteArray { - kind = "ByteArray" - type: Int8 | Int16 | Int32 | Uint8 | Uint16 | Uint32 | Float32 | Float64 -} -``` - -### Fixed Point - -Converts an array of floating point numbers to an array of 32-bit integers multiplied -by a given factor. - -``` -FixedPoint { - kind = "FixedPoint" - factor: number - srcType: Float32 | Float64 -} -``` - -#### Example - -``` -[1.2, 1.23, 0.123] ----FixedPoint---> -{ factor = 100 } [120, 123, 12] -``` - -### Interval Quantization - -Converts an array of floating point numbers to an array of 32-bit integers where -the values are quantized within a given interval into specified number of -discrete steps. Values lower than the minimum value or greater than the -maximum are reprented by the respective boundary values. - -``` -FixedPoint { - kind = "IntervalQuantization" - min: number, - max: number, - numSteps: number, - srcType: Float32 | Float64 -} -``` - -#### Example - -``` -[0.5, 1, 1.5, 2, 3, 1.345 ] ----IntervalQuantization---> -{ min = 1, max = 2, numSteps = 3 } [0, 0, 1, 2, 2, 1] -``` - -### Run Length - -Represents each integer value in the input as a pair of ``(value, number of repeats)`` -and stores the result sequentially as an array of 32-bit integers. Additionally, -stores the size of the original array to make decoding easier. - -``` -RunLength { - kind = "RunLength" - srcType: int[] - srcSize: number -} -``` - -#### Example - -``` -[1, 1, 1, 2, 3, 3] ----RunLength---> -{ srcSize = 6 } [1, 3, 2, 1, 3, 2] -``` - -### Delta - -Stores the input integer array as an array of consecutive differences. - -``` -Delta { - kind = "Delta" - origin: number - srcType: int[] -} -``` - -Because delta encoding is often used in conjuction with integer packing, -the ``origin`` property is present. This is to optimize the case -where the first value is large, but the differences are small. - -#### Example - -``` -[1000, 1003, 1005, 1006] ----Delta---> -{ origin = 1000, srcType = Int32 } [0, 2, 2, 1] -``` - -### Integer Packing - -Stores a 32-bit integer array using 8- or 16-bit values. Includes the size -of the input array for easier decoding. The encoding is more effective -when only unsigned values are privided. - -``` -IntegerPacking { - kind = "IntegerPacking" - byteCount: number - srcSize: number - isUnsigned: boolean -} -``` - -#### Example - -``` -[1, 2, -3, 128] ----IntgerPacking---> -{ byteCount = 1, srcSize = 4, isUnsigned = false } [1, 2, -3, 127, 1] -``` - -### String Array - -Stores an array of strings as a concatenation of all unique strings, an array of offsets -describing substrings, and indices into the offset array. -indices to corresponding substrings. - -``` -StringArray { - kind = "StringArray" - dataEncoding: Encoding[] - stringData: string - offsetEncoding: Encoding[] - offsets: Uint8Array -} -``` - -#### Example - -``` -['a','AB','a'] ----StringArray---> -{ stringData = 'aAB', offsets = [0, 1, 3] } [0, 1, 0] -``` - -Encoding Process ----------------- - -To encode the data, a sequence of encoding transformations needs to be specified -for each input column. For example, to encode the ``_atoms.id`` column -from the background section, we could specify the encoding as ``[Delta, RunLength, IntegerPacking]``: - -``` -[1, 2, 3, 4] ----Delta---> -{ srcType = Int32 } [1, 1, 1, 1] ----RunLength---> -{ srcSize = 4 } [1, 4] ----IntegerPacking---> -{ byteCount = 1, srcSize = 2 } [1, 4] -``` - -**Little endian** is used everywhere to encode the data. - -Once each column has been encoded and the ``File`` data structure built, the -[MessagePack](https://msgpack.org/) format (which is more or less a binary encoding of the standard -JSON format) is used to produce the final binary result. - -Optionally, the data can be compressed using standard methods such as Gzip to achieve -further compression. - -Decoding Process ----------------- - -To decode the BinaryCIF data, first the MessagePack data are decoded and then -for each column, the binary data are decoded applying inverses of the transformations -specified in the ``encoding`` array backwards. So to decode the encoding specified by -``[Delta, RunLength, IntegerPacking]`` we would first apply the decoding -of ``IntegerPacking``, then ``RunLength``, and finally ``Delta``. - -Reference Implementation ------------------------- +Implementations +================= -The BinaryCIF format is implemented in the [CIFTools.js library](https://github.com/dsehnal/CIFTools.js). +BinaryCIF is currently available as TypeScript (JavaScript) and Java. -Specific useful parts of the code: +- TypeScript implementation is part of the [Mol* project](https://github.com/molstar/molstar) (the original implementation of the BinaryCIF format is the [CIFTools.js library](https://github.com/dsehnal/CIFTools.js)). +- [Mol* ciftools](https://github.com/molstar/ciftools) are available as a standalone library/tools for conversion of text CIF to BinaryCIF. +- Java implementation is available at [rcsb/ciftools-java](https://github.com/rcsb/ciftools-java). -- [Data Layout and Encoding Methods Interfaces](https://github.com/dsehnal/CIFTools.js/blob/master/src/Binary/Encoding.ts) -- [CIF Dictionary Interfaces](https://github.com/dsehnal/CIFTools.js/blob/master/src/Dictionary.ts) - - [BinaryCIF implementation of the Interfaces](https://github.com/dsehnal/CIFTools.js/blob/master/src/Binary/Dictionary.ts) +Principles +========== -All the important code can be found in [this folder](https://github.com/dsehnal/CIFTools.js/tree/master/src/Binary). Be sure to check out -the [examples](https://github.com/dsehnal/CIFTools.js/tree/master/examples). +* [Basic Principles](principle) +* [BinaryCIF Format](encoding) +* [Benchmark](benchmark) Use Cases ========= ## CoordinateServer -BinaryCIF is supported by the [CoordinateServer](https://webchemdev.ncbr.muni.cz/CoordinateServer), a web service for +BinaryCIF is supported by the [CoordinateServer](https://cs.litemol.org), a web service for delivering subsets of 3D macromolecular data stored in the [mmCIF format](http://mmcif.wwpdb.org/). The server can return data both in the text and binary version of the CIF format, with the binary representation being a lot more efficient (see the [benchmark](#benchmark)). -Benchmark -========= - -The BinaryCIF format has been applied to encode macromolecular data stored using [mmCIF](http://mmcif.wwpdb.org/) -data format (mmCIF is a schema of categories and fields that desribe a macromolecular structure -stored using the CIF format). The raw data for the benchmark are included in this repository. - -- The "CoordinateServer" results were obtained using the corresponding query of the [CoordinateServer](https://webchemdev.ncbr.muni.cz/CoordinateServer). -- The "MMTF" results were obtained using [MMTF](https://mmtf.rcsb.org) version 1.0. - -## HIV-1 Capsid size - -Encoding the currenly largest structure in the PDB.org archive, the [HIV-1 Capsid](https://pdbe.org/3j3q) -with 2.44M atoms, BinaryCIF achieves very good results. - -![3j3q size](img/bench-3j3q.png) - - -## Whole PDB archive size - -- (*) RCSB PDB: 122333 Entries, some 404'ed; recompressed using the same compression level as the other data. -- (**) reduced = alpha + phosphate trace + HET - - -![PDB Size](img/bench-pdb.png) +## DensityServer -## Read and write performance +BinaryCIF is supported by the [DensityServer](https://ds.litemol.org), a web service for accessing subsets of volumetric density data, that automatically downsamples the data depending on the volume of the requested region to reduce the bandwidth requirements and provide near-instant access to even the largest data sets. -This is the performance of BinaryCIF implementation of the [CIFTools.js library](https://github.com/dsehnal/CIFTools.js), -[LiteMol](https://github.com/dsehnal/LiteMol), and the [CoordinateServer](https://webchemdev.ncbr.muni.cz/CoordinateServer). +------------------- -![Parse](img/bench-perf-parse.png) +## Contributing +Just open an issue or make a pull request. All contributions are welcome. -![Write](img/bench-perf-read.png) +## Funding +Funding sources include but are not limited to: +* [RCSB PDB](https://www.rcsb.org) funding by a grant [DBI-1338415; PI: SK Burley] from the NSF, the NIH, and the US DoE +* [PDBe, EMBL-EBI](https://pdbe.org) +* [CEITEC](https://www.ceitec.eu/) \ No newline at end of file diff --git a/benchmark.md b/benchmark.md new file mode 100644 index 0000000..6559250 --- /dev/null +++ b/benchmark.md @@ -0,0 +1,35 @@ +Benchmark +========= + +The BinaryCIF format has been applied to encode macromolecular data stored using [mmCIF](http://mmcif.wwpdb.org/) +data format (mmCIF is a schema of categories and fields that desribe a macromolecular structure +stored using the CIF format). The raw data for the benchmark are included in this repository. + +- The "CoordinateServer" results were obtained using the corresponding query of the [CoordinateServer](https://webchemdev.ncbr.muni.cz/CoordinateServer). +- The "MMTF" results were obtained using [MMTF](https://mmtf.rcsb.org) version 1.0. + +## HIV-1 Capsid size + +Encoding the currenly largest structure in the PDB.org archive, the [HIV-1 Capsid](https://pdbe.org/3j3q) +with 2.44M atoms, BinaryCIF achieves very good results. + +![3j3q size](img/bench-3j3q.png) + + +## Whole PDB archive size + +- (*) RCSB PDB: 122333 Entries, some 404'ed; recompressed using the same compression level as the other data. +- (**) reduced = alpha + phosphate trace + HET + + +![PDB Size](img/bench-pdb.png) + +## Read and write performance + +This is the performance of BinaryCIF implementation of the [CIFTools.js library](https://github.com/dsehnal/CIFTools.js), +[LiteMol](https://github.com/dsehnal/LiteMol), and the [CoordinateServer](https://webchemdev.ncbr.muni.cz/CoordinateServer). + +![Parse](img/bench-perf-parse.png) + +![Write](img/bench-perf-read.png) + diff --git a/encoding.md b/encoding.md new file mode 100644 index 0000000..a89303a --- /dev/null +++ b/encoding.md @@ -0,0 +1,252 @@ +BinaryCIF Format +================ + +## Data Layout + +A [CIF file](http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax) ([example](https://www.ebi.ac.uk/pdbe/static/entry/1tqn_updated.cif)) contains: + +* One or more data blocks +* Each data block has one or more category +* Each category has one or more field +* Each field contains *data* + +To represent this hierarchy, the basic shape of the BinaryCIF file defines the following +interfaces: + +``` +File { + version: string + encoder: string + dataBlocks: DataBlock[] +} + +DataBlock { + header: string + categories: Category[] +} + +Category { + name: string + rowCount: number + columns: Column[] +} + +Column { + name: string + data: Data + mask: Data +} + +Data { + data: Uint8Array + encoding: Encoding[] +} +``` + +The most interesting part is the ``Data`` interface where the actual data is stored. +The interface has two properties: ``data`` which is just an array of bytes (the binary data) +and an array of *encodings* that describes the transformations that were applied to the +source data to obtain the final binary result stored in the ``data`` field. + +Additionally, the ``Column`` interface defines a ``mask`` property used to determine +if a certain value is present, not present (``.`` token in CIF), or unknown (``?`` token in CIF). + +Currently, BinaryCIF supports these encoding methods: + +``` +type Encoding = + | ByteArray + | FixedPoint + | IntervalQuantization + | RunLength + | Delta + | IntegerPacking + | StringArray +``` + +## Encoding Methods + +### Byte Array + +Encodes an array of numbers of specified types as raw bytes. + +``` +ByteArray { + kind = "ByteArray" + type: Int8 | Int16 | Int32 | Uint8 | Uint16 | Uint32 | Float32 | Float64 +} +``` + +### Fixed Point + +Converts an array of floating point numbers to an array of 32-bit integers multiplied +by a given factor. + +``` +FixedPoint { + kind = "FixedPoint" + factor: number + srcType: Float32 | Float64 +} +``` + +#### Example + +``` +[1.2, 1.23, 0.123] +---FixedPoint---> +{ factor = 100 } [120, 123, 12] +``` + +### Interval Quantization + +Converts an array of floating point numbers to an array of 32-bit integers where +the values are quantized within a given interval into specified number of +discrete steps. Values lower than the minimum value or greater than the +maximum are reprented by the respective boundary values. + +``` +FixedPoint { + kind = "IntervalQuantization" + min: number, + max: number, + numSteps: number, + srcType: Float32 | Float64 +} +``` + +#### Example + +``` +[0.5, 1, 1.5, 2, 3, 1.345 ] +---IntervalQuantization---> +{ min = 1, max = 2, numSteps = 3 } [0, 0, 1, 2, 2, 1] +``` + +### Run Length + +Represents each integer value in the input as a pair of ``(value, number of repeats)`` +and stores the result sequentially as an array of 32-bit integers. Additionally, +stores the size of the original array to make decoding easier. + +``` +RunLength { + kind = "RunLength" + srcType: int[] + srcSize: number +} +``` + +#### Example + +``` +[1, 1, 1, 2, 3, 3] +---RunLength---> +{ srcSize = 6 } [1, 3, 2, 1, 3, 2] +``` + +### Delta + +Stores the input integer array as an array of consecutive differences. + +``` +Delta { + kind = "Delta" + origin: number + srcType: int[] +} +``` + +Because delta encoding is often used in conjuction with integer packing, +the ``origin`` property is present. This is to optimize the case +where the first value is large, but the differences are small. + +#### Example + +``` +[1000, 1003, 1005, 1006] +---Delta---> +{ origin = 1000, srcType = Int32 } [0, 2, 2, 1] +``` + +### Integer Packing + +Stores a 32-bit integer array using 8- or 16-bit values. Includes the size +of the input array for easier decoding. The encoding is more effective +when only unsigned values are privided. + +``` +IntegerPacking { + kind = "IntegerPacking" + byteCount: number + srcSize: number + isUnsigned: boolean +} +``` + +#### Example + +``` +[1, 2, -3, 128] +---IntgerPacking---> +{ byteCount = 1, srcSize = 4, isUnsigned = false } [1, 2, -3, 127, 1] +``` + +### String Array + +Stores an array of strings as a concatenation of all unique strings, an array of offsets +describing substrings, and indices into the offset array. +indices to corresponding substrings. + +``` +StringArray { + kind = "StringArray" + dataEncoding: Encoding[] + stringData: string + offsetEncoding: Encoding[] + offsets: Uint8Array +} +``` + +#### Example + +``` +['a','AB','a'] +---StringArray---> +{ stringData = 'aAB', offsets = [0, 1, 3] } [0, 1, 0] +``` + +Encoding Process +---------------- + +To encode the data, a sequence of encoding transformations needs to be specified +for each input column. For example, to encode the ``_atoms.id`` column +from the background section, we could specify the encoding as ``[Delta, RunLength, IntegerPacking]``: + +``` +[1, 2, 3, 4] +---Delta---> +{ srcType = Int32 } [1, 1, 1, 1] +---RunLength---> +{ srcSize = 4 } [1, 4] +---IntegerPacking---> +{ byteCount = 1, srcSize = 2 } [1, 4] +``` + +**Little endian** is used everywhere to encode the data. + +Once each column has been encoded and the ``File`` data structure built, the +[MessagePack](https://msgpack.org/) format (which is more or less a binary encoding of the standard +JSON format) is used to produce the final binary result. + +Optionally, the data can be compressed using standard methods such as Gzip to achieve +further compression. + +Decoding Process +---------------- + +To decode the BinaryCIF data, first the MessagePack data are decoded and then +for each column, the binary data are decoded applying inverses of the transformations +specified in the ``encoding`` array backwards. So to decode the encoding specified by +``[Delta, RunLength, IntegerPacking]`` we would first apply the decoding +of ``IntegerPacking``, then ``RunLength``, and finally ``Delta``. \ No newline at end of file diff --git a/principle.md b/principle.md new file mode 100644 index 0000000..75dd3cb --- /dev/null +++ b/principle.md @@ -0,0 +1,131 @@ +Basic Principles +================ + +In this chapter the basic ideas behind the BinaryCIF will are discussed. + +[CIF](http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax) is a text based format for storing tabular data. +The data is stored row by row using this syntax: + +``` +loop_ +_category.field1 +_category.field2 +... +_category.fieldK +value-1_1 value-1_2 ... value-1_K +... +value-N_1 value-N_2 ... value-N_K +``` + +For example, the table called ``atoms`` with columns ``type, id, element, x, y, z`` + +|type|id|element|x|y|z| +|:---:|---:|:---:|---:|---:|---:| +|ATOM|1|C|0|0|0| +|ATOM|2|C|1|0|0| +|ATOM|3|O|0|1|0| +|HETATM|4|Fe|0|0|1| + +would be stored in CIF as + +``` +loop_ +_atoms.type +_atoms.id +_atoms.element +_atoms.x +_atoms.y +_atoms.z +ATOM 1 C 0 0 0 +ATOM 2 C 1 0 0 +ATOM 3 O 0 1 0 +HETATM 4 Fe 0 0 1 +``` + +If we want to compress the rows using a dictionary compression, it would identify +the string ATOM as a repeated substring and represent the rows something along the lines of + +``` +A = ATOM + +{A} 1 C 0 0 0 +{A} 2 C 1 0 0 +{A} 3 O 0 1 0 +HETATM 4 Fe 0 0 1 +``` + +where ``{A}`` is a dictionary reference to the string ``ATOM``. At first, it would seem +that this is an efficient solution. However, the problem with this data representation is that +it is actually hard to compress because related data is not next to each other. + +Fortunately, we can do much better than this: we can transpose the tabular data +and store them *per column* instead of *per row*: + +``` +_atoms.type: ATOM ATOM ATOM HETATM +_atoms.id: 1 2 3 4 +_atoms.element: C C O Fe +_atoms.x 0 1 0 0 +_atoms.y 0 0 1 0 +_atoms.z 0 0 0 1 +``` + +Now, we can compress all the repeating ATOM values using a method called run-length encoding: + +``` +_atoms.type: {ATOM, 3} HETATM +``` + +Where ``{ATOM, 3}`` means *repeat the string* ``ATOM`` *3 times*. If the value ATOM repeats +1 million times (which is quite common), this approach saves us a lot of space. + +Similarly, we can apply different encoding schemes to other types of data. +For example, the sequence + +``` +1 2 3 4 5 ... n +``` + +can be encoded using delta encoding as + +``` +1 1 1 1 1 ... +``` + +meaning we start with 1, then add 1 to the previous value, ending up with 2, then add 1 to the +previous values as well getting 3, etc. At this point, we can use the run-length encoding +approach from the ATOM example and end up with + +``` +{1, n} +``` + +to represent the original sequence of integers from 1 to n. + +The final step is to use binary instead of text encoding to store our data to make it more +space efficient. For example, storing the number 1234 as text requires 4 bytes: + +``` +"1" "2" "3" "4" + +0x31 0x32 0x33 0x34 +``` + +However, storing the number as a 16-bit integer, we required only 2 bytes: + +``` +4 * 256 + 210 + + 0x04 0xD2 +``` + +Applying the different encoding methods, the representation of our ``atoms`` table becomes + +``` +_atoms.type: {ATOM, 3} HETATM +_atoms.id: {1, 4} +_atoms.element: {C, 2} O Fe +_atoms.x 0x00 0x01 0x00 0x00 +_atoms.y 0x00 0x00 0x01 0x00 +_atoms.z 0x00 0x00 0x00 0x01 +``` \ No newline at end of file