Python: pyballc
Javascript: ballcjs
Please see the latest v0.9 update below
BAllCools is a comprehensive tool designed to handle binary AllC files (BAllC). The tool aims to solve the challenges posed by the large number of single-cell data. The conventional AllC files store methylation data as text, which becomes significantly bulky when dealing with single-cell data. BAllCools provides an efficient way to store and retrieve data by converting these text files into binary files, thereby saving substantial storage space. The BAllC format save >55% storage compared to AllC the format, and BAllCools accelorates BAllC operation like data merging (ballcools merge
) ~30x.
Note: Currently, ballcools is designed to only solve the storage problem of single cell methylation data. There is no plan to add analysis functions to ballcools to replace allcools.
For background of (B)AllC format and the specification of BAllC format, please check doc/ballc_spec.pdf for details.
Install BAllCools to the current environment
conda install -c jksr ballcools
or install to a new environment
conda create -n ballcenv ballcools -c jksr
docker pull jksrtw/ballcools
docker run --rm jksrtw/ballcools ballcools --help
sudo
permision may be needed when working with docker.
git clone https://github.com/jksr/ballcools
cd ballcools & make
And then the executable ballcools
will be created in the folder bin
.
htslib >=1.10, <2.0
libdeflate >=1.6, <2.0
- view: allows you to view the data stored in a BAllC file.
- index: indexes a BAllC file to expedite the retrieval of data.
- a2b: converts an AllC file (text format) into a BAllC file (binary format), allowing for more efficient storage and faster access.
- b2a: converts a BAllC file (binary format) into an AllC file (text format).
- meta: extracts and indexes Cytosines from a genome sequence file (fasta format) and stores them in a CMeta file (bed format).
- query: allows to retrieve data from a BAllC file according to genome range and context of cytosine.
- merge: merges multiple BAllC files into a single file. This is ~30x faster than merge AllC files directly
After installation, BAllCools can be run with the following command:
ballcools [OPTIONS] [SUBCOMMAND]
For help with the tool, use the -h
or --help
option:
ballcools -h
BAllCools: Binary AllC file tools v(0.0.1)
Usage: ballcools [OPTIONS] [SUBCOMMAND]
Options:
-h,--help Print this help message and exit
Subcommands:
view View data stored in a BAllC file.
index Index a BAllC file.
a2b Convert an AllC file to a BAllc file.
b2a Convert an BAllC file to a Allc file.
meta Extract and index C from a genome sequence file (fasta) and store in a CMeta file (bed format).
query Query info from a BAllC file
check check a BAllC file
merge Merge BAllC files
This will print a help message with a summary of the subcommands and their functionalities.
A CMeta file stores the context and the strandness of each cytosine of a given genome. Although the CMeta file is not required for BAllC files and BAllCools, it is highly recommended to have one to accompany the associated BAllC files. For details of the CMeta file, please check doc/ballc_spec.pdf.
For a given genome (eg. a standard one like mm10 or hg38 or an individual-specific genome), ballcools meta
can be used to generate the corresponding CMeta file from the genome fasta file (.fasta or .fa). This step usually takes around 30min. This only need to be run once for each genome, and the resulting CMeta file can be used for all BAllC files associated with this genome.
BAllC files can be created from AllC files with command ballcools a2b
. It is highly recommended to use the --assembly_info
option to specify a human-readable label (eg, mm10, hg38, hg38-donor1, etc) so that the responding genome and the associated CMeta file would not be mismatched later.
Another option --note
can be used as well to specify more info about the genome or other meta info/notes.
Convert AllC to BAllC files will need a chromosome size file, which is a tab-separated-value file with 1st col as the chromomsome name and 2nd col as the chromosome length. These files can be usually downloaded accompanied when you download the genome fasta from online resources like UCSC, Refseq, etc. Eg. for hg38, see https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/.
See ballcools a2b -h
for details.
scBAllC files can be merged to create pseudo-bulk BAllC files with command ballcools merge
. This is a much faster (>30x) replacement comparing merge AllC files directly.
After merging, you could query the pseudo-bulk BAllC files directly.
Because tools like allcools
or methylpy
already provides versatile functions for analyzing (pseudo)bulk AllC files, and the larger storage requirement in AllC format on bulk level data is usually tolerable,
one can convert the merged BAllC files back to AllC files for downstream analysis.
To query a BAllC file, command ballcools query
can be used. When the corresponding CMeta file is given with the option --cmetapath
, information of cytosine context and strandness will be output. Otherwise, only the methylation read and total read numbers will be output.
BAllCools was described in BAllC and BAllCools: Efficient Formatting and Operating for Single-Cell DNA Methylation Data
. W Tian, W Ding, JR Ecker. BioRxiv (https://doi.org/10.1101/2023.09.22.559047). Please cite the paper if you use BAllCools in your research.
@article {Tian2023ballc,
author = {Tian, Wei and Ding, Wubin and Ecker, Joseph R},
title = {BAllC and BAllCools: Efficient Formatting and Operating for Single-Cell DNA Methylation Data},
year = {2023},
doi = {10.1101/2023.09.22.559047},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://doi.org/10.1101/2023.09.22.559047},
journal = {BioRxiv}
}
Better error handling. Now ballcools outputs human readable errors when, for example, wrong/bad input file is used.
query
now supports using bed file as input (-b
).
Now a permanent cmeta (and cmeta index) path can be specified in the header note of a ballc file. If such info is specified, ballcools can query the cmeta automatically without needing explicitly specifying cmeta path with -c
option in query
function.
BAllCools now will search for a pattern "cmeta((CMETA_PATH))" in the header note of a BAllC file. If found, ballcools query
will use this info. To add such info, users can use
ballcools a2b ALLC_PATH BALLC_PATH CHROM_SIZE -n "cmeta((CMETA_PATH))"
where -n
adds note to the ballc header. Both local or online (eg a url) CMETA_PATHs are supported.
The index file path is not necessary if it is in the same position and with '.tbi' postfix in addition to the cmeta name. Otherwise it can be specified as well in the header note of a BAllC file with pattern "cmetaidx((CMETA_INDEX_PATH))".
- A bug which causes all generated ballc files using ballcools version before 0.9 to be of "bulk" format even if "single cell" format is specified (see doc/ballc_spec.pdf ).
This will not affect any user application, but only increase the ballc file size a little bit. You could ignore this and keep using the old files, or convert them to allc files and then to the "single cell" ballc format with the updated ballcools (ver>=0.9).
-
ballcools view
now provide more details of the file header -
ballcools a2b
now will index the ballc file directly. noballcools index
call is needed any more -
ballcools query
has been rewriten, with huge performance boost, especially for large genome regions (>100M). As a result, now the query will not tolerate mismatched cmeta file and will raise an error. In the previous verion, the query function could have more tolerant behaviors with options such as--skip_mismatch
and--warn_mismatch
. The old query function was deprecated and was renamed asballcools query_slow
. This function will be removed in the future verion.
Add support to query multiple regions at the same time