You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This isnt exactly a GATK question, but I think this problem would be faced by many GATK users. Our current practice is to run many GATK-based jobs scatter/gathered, meaning 100s or 1000s of individual jobs where each operates against a set of coordinates. The result is N number of VCFs that we need to append to one another and bgzip to make a final VCF. This process is painfully slow when dealing with many jobs and large VCFs. Are there simple linux tricks to make this faster?
Our current pattern is something like this:
#!/bin/bash
# first write the VCF header to a file by itself named header.vcf
# Then zcat them in a block, piped to bgzip:
{
cat header.vcf
zcat vcf1.vcf.gz | grep -v '^#';
zcat vcf2.vcf.gz | grep -v '^#';
zcat vcf3.vcf.gz | grep -v '^#';
zcat vcf4.vcf.gz | grep -v '^#';
etc....
} | bgzip -f --threads XX > finalVcf.vcf.gz
I hope this slightly off topic question is alright here. I appreciate any suggestions people might have.
The text was updated successfully, but these errors were encountered:
There is nothing VCF-specific to this, so frankly avoiding any parsing whatsoever (which is why GATK is honestly often not the right direction). Fair point on bcftools - i will compare them.
The point of this question is about speed and low overhead.
I believe both GatherVcfs (picard) and GatherVcfsCloud (gatk) try to avoid doing any parsing at all if the shards are already block-gzipped - I believe they just transfer the blocks directly. So at least for that case they might be fine.
Hello,
This isnt exactly a GATK question, but I think this problem would be faced by many GATK users. Our current practice is to run many GATK-based jobs scatter/gathered, meaning 100s or 1000s of individual jobs where each operates against a set of coordinates. The result is N number of VCFs that we need to append to one another and bgzip to make a final VCF. This process is painfully slow when dealing with many jobs and large VCFs. Are there simple linux tricks to make this faster?
Our current pattern is something like this:
I hope this slightly off topic question is alright here. I appreciate any suggestions people might have.
The text was updated successfully, but these errors were encountered: