Skip to content
/ ballcools Public

tools for binary file format of DNA methylation data

License

Notifications You must be signed in to change notification settings

jksr/ballcools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BAllCools: Binary AllC File Tools

DOI Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

Other programing language bindings and related resources

Python: pyballc

Javascript: ballcjs

Please see the latest v0.9 update below

BAllCools is a comprehensive tool designed to handle binary AllC files (BAllC). The tool aims to solve the challenges posed by the large number of single-cell data. The conventional AllC files store methylation data as text, which becomes significantly bulky when dealing with single-cell data. BAllCools provides an efficient way to store and retrieve data by converting these text files into binary files, thereby saving substantial storage space. The BAllC format save >55% storage compared to AllC the format, and BAllCools accelorates BAllC operation like data merging (ballcools merge) ~30x.

Note: Currently, ballcools is designed to only solve the storage problem of single cell methylation data. There is no plan to add analysis functions to ballcools to replace allcools.

Background and BAllC format

For background of (B)AllC format and the specification of BAllC format, please check doc/ballc_spec.pdf for details.

Installation and compilation

The most convient way to install BAllCools is using conda

Install BAllCools to the current environment

conda install -c jksr ballcools

or install to a new environment

conda create -n ballcenv ballcools -c jksr

BAllCools can also be used with docker

docker pull jksrtw/ballcools
docker run --rm jksrtw/ballcools ballcools --help

sudo permision may be needed when working with docker.

Compiling from source code requires a c++ compiler (eg. g++) and make

git clone https://github.com/jksr/ballcools
cd ballcools & make

And then the executable ballcools will be created in the folder bin.

Dependency

htslib >=1.10, <2.0
libdeflate >=1.6, <2.0

Features

  • view: allows you to view the data stored in a BAllC file.
  • index: indexes a BAllC file to expedite the retrieval of data.
  • a2b: converts an AllC file (text format) into a BAllC file (binary format), allowing for more efficient storage and faster access.
  • b2a: converts a BAllC file (binary format) into an AllC file (text format).
  • meta: extracts and indexes Cytosines from a genome sequence file (fasta format) and stores them in a CMeta file (bed format).
  • query: allows to retrieve data from a BAllC file according to genome range and context of cytosine.
  • merge: merges multiple BAllC files into a single file. This is ~30x faster than merge AllC files directly

Usage

After installation, BAllCools can be run with the following command:

ballcools [OPTIONS] [SUBCOMMAND]

For help with the tool, use the -h or --help option:

ballcools -h
BAllCools: Binary AllC file tools v(0.0.1)
Usage: ballcools [OPTIONS] [SUBCOMMAND]

Options:
  -h,--help                   Print this help message and exit

Subcommands:
  view                        View data stored in a BAllC file.
  index                       Index a BAllC file.
  a2b                         Convert an AllC file to a BAllc file.
  b2a                         Convert an BAllC file to a Allc file.
  meta                        Extract and index C from a genome sequence file (fasta) and store in a CMeta file (bed format).
  query                       Query info from a BAllC file
  check                       check a BAllC file
  merge                       Merge BAllC files

This will print a help message with a summary of the subcommands and their functionalities.

Basic workflow

ballcools workflow

Create the CMeta file

A CMeta file stores the context and the strandness of each cytosine of a given genome. Although the CMeta file is not required for BAllC files and BAllCools, it is highly recommended to have one to accompany the associated BAllC files. For details of the CMeta file, please check doc/ballc_spec.pdf.

For a given genome (eg. a standard one like mm10 or hg38 or an individual-specific genome), ballcools meta can be used to generate the corresponding CMeta file from the genome fasta file (.fasta or .fa). This step usually takes around 30min. This only need to be run once for each genome, and the resulting CMeta file can be used for all BAllC files associated with this genome.

Create BAllC files

BAllC files can be created from AllC files with command ballcools a2b. It is highly recommended to use the --assembly_info option to specify a human-readable label (eg, mm10, hg38, hg38-donor1, etc) so that the responding genome and the associated CMeta file would not be mismatched later.

Another option --note can be used as well to specify more info about the genome or other meta info/notes.

Convert AllC to BAllC files will need a chromosome size file, which is a tab-separated-value file with 1st col as the chromomsome name and 2nd col as the chromosome length. These files can be usually downloaded accompanied when you download the genome fasta from online resources like UCSC, Refseq, etc. Eg. for hg38, see https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/.

See ballcools a2b -h for details.

Merge BAllC files

scBAllC files can be merged to create pseudo-bulk BAllC files with command ballcools merge. This is a much faster (>30x) replacement comparing merge AllC files directly. After merging, you could query the pseudo-bulk BAllC files directly.

Because tools like allcools or methylpy already provides versatile functions for analyzing (pseudo)bulk AllC files, and the larger storage requirement in AllC format on bulk level data is usually tolerable, one can convert the merged BAllC files back to AllC files for downstream analysis.

Query BAllC files

To query a BAllC file, command ballcools query can be used. When the corresponding CMeta file is given with the option --cmetapath, information of cytosine context and strandness will be output. Otherwise, only the methylation read and total read numbers will be output.

Citation

BAllCools was described in BAllC and BAllCools: Efficient Formatting and Operating for Single-Cell DNA Methylation Data. W Tian, W Ding, JR Ecker. BioRxiv (https://doi.org/10.1101/2023.09.22.559047). Please cite the paper if you use BAllCools in your research.

@article {Tian2023ballc,
    author = {Tian, Wei and Ding, Wubin and Ecker, Joseph R},
    title = {BAllC and BAllCools: Efficient Formatting and Operating for Single-Cell DNA Methylation Data},
    year = {2023},
    doi = {10.1101/2023.09.22.559047},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://doi.org/10.1101/2023.09.22.559047},
    journal = {BioRxiv}
}

Update

v0.9.9

5/21/2024

Better error handling. Now ballcools outputs human readable errors when, for example, wrong/bad input file is used.

5/17/2024

query now supports using bed file as input (-b).

5/14/2024

Now a permanent cmeta (and cmeta index) path can be specified in the header note of a ballc file. If such info is specified, ballcools can query the cmeta automatically without needing explicitly specifying cmeta path with -c option in query function.

BAllCools now will search for a pattern "cmeta((CMETA_PATH))" in the header note of a BAllC file. If found, ballcools query will use this info. To add such info, users can use

ballcools a2b ALLC_PATH BALLC_PATH CHROM_SIZE -n "cmeta((CMETA_PATH))"

where -n adds note to the ballc header. Both local or online (eg a url) CMETA_PATHs are supported.

The index file path is not necessary if it is in the same position and with '.tbi' postfix in addition to the cmeta name. Otherwise it can be specified as well in the header note of a BAllC file with pattern "cmetaidx((CMETA_INDEX_PATH))".

v0.9.0 (3/2/2024)

A update with important performance improvement and bug fix.

  • A bug which causes all generated ballc files using ballcools version before 0.9 to be of "bulk" format even if "single cell" format is specified (see doc/ballc_spec.pdf ).

This will not affect any user application, but only increase the ballc file size a little bit. You could ignore this and keep using the old files, or convert them to allc files and then to the "single cell" ballc format with the updated ballcools (ver>=0.9).

  • ballcools view now provide more details of the file header

  • ballcools a2b now will index the ballc file directly. no ballcools index call is needed any more

  • ballcools query has been rewriten, with huge performance boost, especially for large genome regions (>100M). As a result, now the query will not tolerate mismatched cmeta file and will raise an error. In the previous verion, the query function could have more tolerant behaviors with options such as --skip_mismatch and --warn_mismatch. The old query function was deprecated and was renamed as ballcools query_slow. This function will be removed in the future verion.

v0.0.6 (2/22/2024)

Add support to query multiple regions at the same time

About

tools for binary file format of DNA methylation data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages