Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allelicBalance and checkHetsIndvVCF on WGS data #17

Open
npatel-ah opened this issue Apr 30, 2024 · 1 comment
Open

allelicBalance and checkHetsIndvVCF on WGS data #17

npatel-ah opened this issue Apr 30, 2024 · 1 comment

Comments

@npatel-ah
Copy link

npatel-ah commented Apr 30, 2024

Hello,

I've come across this informative article https://speciationgenomics.github.io/allelicBalance/ which brings me to the "allelicBalance" and "checkHetsIndvVCF" scripts.

As I read through the article it suggests that the script is not designed for WGS data but while reviewing the presentation here "https://github.com/speciationgenomics/presentations/blob/master/PDFs_2022/AllelicBalance_PCRduplication.pdf" it shows an example of WGS data.

What are the downsides of using the "checkHetsIndvVCF.sh" for WGS analysis? I am also curious how haploid individuals and the hard region of the genome like repeats will affect the results.

Regards,
Nihir

@joanam
Copy link
Contributor

joanam commented Jan 18, 2025

Hi Nihir,

This tools is absolutely for NGS data. I would use it on SNPs only. It starts with a vcf file. The tool only works for diploid individuals. Repeats (if collapsed in the reference genome) will generate wrong heterozygous sites and can thus in theory affect the results. If two repeat copies map to a single collapsed one in the reference genome, it would create false heterozygotes with a read frequency of 0.5 (at every site where the two repeat copies differ). This would not cause wrong signatures of contamination. If more than two repeat regions map to the collapsed region in the reference genome, it could in principle cause biased read frequencies. However, I have never seen this as a problem in all datasets I have worked with. Perhaps if the reference genome is quite poorly assembled or of a distant relative and there is a recent TE expansion, it could cause problems.

Best,
Joana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants