Efficiently perform simple merge/append on many VCFs? #9097

bbimber · 2025-02-18T14:44:49Z

Hello,

This isnt exactly a GATK question, but I think this problem would be faced by many GATK users. Our current practice is to run many GATK-based jobs scatter/gathered, meaning 100s or 1000s of individual jobs where each operates against a set of coordinates. The result is N number of VCFs that we need to append to one another and bgzip to make a final VCF. This process is painfully slow when dealing with many jobs and large VCFs. Are there simple linux tricks to make this faster?

Our current pattern is something like this:

#!/bin/bash

# first write the VCF header to a file by itself named header.vcf

# Then zcat them in a block, piped to bgzip:
{
cat header.vcf
zcat vcf1.vcf.gz | grep -v '^#';
zcat vcf2.vcf.gz | grep -v '^#';
zcat vcf3.vcf.gz | grep -v '^#';
zcat vcf4.vcf.gz | grep -v '^#';
etc....
} | bgzip -f --threads XX > finalVcf.vcf.gz

I hope this slightly off topic question is alright here. I appreciate any suggestions people might have.

The text was updated successfully, but these errors were encountered:

gokalpcelik · 2025-02-18T16:40:06Z

Why don't you use bcftools or gatk GatherVcfs?

In your case bcftools would be the best. Just provide all the parameters and pass the list of files with wildcards or regexes.

bbimber · 2025-02-18T16:44:15Z

There is nothing VCF-specific to this, so frankly avoiding any parsing whatsoever (which is why GATK is honestly often not the right direction). Fair point on bcftools - i will compare them.

The point of this question is about speed and low overhead.

cmnbroad · 2025-02-18T21:29:39Z

I believe both GatherVcfs (picard) and GatherVcfsCloud (gatk) try to avoid doing any parsing at all if the shards are already block-gzipped - I believe they just transfer the blocks directly. So at least for that case they might be fine.

bbimber · 2025-02-18T21:34:41Z

Interesting - i will check that out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiently perform simple merge/append on many VCFs? #9097

Efficiently perform simple merge/append on many VCFs? #9097

bbimber commented Feb 18, 2025

gokalpcelik commented Feb 18, 2025 •

edited

Loading

bbimber commented Feb 18, 2025

cmnbroad commented Feb 18, 2025

bbimber commented Feb 18, 2025

Efficiently perform simple merge/append on many VCFs? #9097

Efficiently perform simple merge/append on many VCFs? #9097

Comments

bbimber commented Feb 18, 2025

gokalpcelik commented Feb 18, 2025 • edited Loading

bbimber commented Feb 18, 2025

cmnbroad commented Feb 18, 2025

bbimber commented Feb 18, 2025

gokalpcelik commented Feb 18, 2025 •

edited

Loading