Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tumor-normal variant calling results change drastically after TrimGalore's application #201

Closed
fmazzarotto opened this issue Feb 4, 2025 · 4 comments

Comments

@fmazzarotto
Copy link

Hello,

I am processing 150bp Illumina paired-end whole-exome sequenced tumor-normal sample pairs, with an ensemble of 6 somatic variant callers.

I ran TrimGalore with the following command (which I then repeat on the normal control sample):

trim_galore \
--fastqc_args "--outdir $ResultsDir/Fastqc --format fastq --threads 6" \
--output_dir $TempSampleDir \
--cores 8 \
--paired \
${TumorSampleName}_R1_raw.fastq.gz ${TumorSampleName}_R2_raw.fastq.gz 

However, variant calling results change drastically without vs with trimming, to the point that there is clearly something wrong (e.g. Strelka2 goes from calling hundreds of PASS variants - including some lab-validated ones, to flagging all calls with the LowDepth/LowEVS filter - other callers also go from calling hundreds of variants to none). Average depth across the target region is >600x for the tumor and >200x for the normal sample, so the Strelka2 LowDepth filters don't make sense to me.

One thing I don't understand (and that I suspect may explain what I see) is that TrimGalore recognizes the Illumina universal adapter in 96741 reads (9.6%) of the first million reads of the tumor R1 file, but then the summary says that "reads with adapters" were 41% (39M on about 95.6M). A similar discrepancy is present on all files (tumor/normal R1/R2).

Is it possible I am using a wrong command? And is there something that can explain such a large discrepancy between the estimated 9.6% reads with adapters and the effective 41%?

Thanks very much

@FelixKrueger
Copy link
Owner

Hi there,

Looking at your command, I don't think that there is anything wrong with it. It should remove Illumina adapters, and further filter reads for a quality threshold of Phred 20 - and not do anything sinister to the reads that prevent variant callers from working. Trying to understand what LowEVS is, it seems that other have experienced the sample thing: All variants being labeled as "LowEVS" #231
. Or https://groups.google.com/g/strelka-discuss/c/v0farBTs0Iw)

Maybe you could try disabling the quality trimming (-q 0) to see if that makes a difference? (thinking back to the GATK best practices course I once attended, I thought that variant callers can deal with low quality values, and know how to handle this? I have also reached out to the authors of the nf-core Sarek pipeline for their thoughts on this.

Regarding the discrepancy of numbers:
The adapter auto-detection in Trim Galore looks for perfect matches of adapters in the first 1M sequences. These sequences are ~13bp long, which apparently was found in ~10% of cases.

The 41% statistic is a cumulative value of all sequences that could the adapter, so for the Illumina adapater AGATCGGAAGAGC it could match A, AG, AGA, AGAT, AGATC, etc. as well as full length read-through contaminants. So again, this is all fine, and expected. You can see at the trimming report that the majority of what was removed is probably 1-4 bp, everything beyond 10bp or so is probably true adapter.

@FelixKrueger
Copy link
Owner

Just relaying some further information here from @maxulysse and @tdanhorn:

"The LowEVS filter is Strelka's attempt to score its confidence in variants. I know nothing about it, but It might be based on machine learning, the key word (and E in EVS) being "empirical", which means based on data, rather than theory. I would assume that this was trained on data as it comes off the sequencer (i.e. not quality filtered, but probably adapter-trimmed). If you throw out a bunch of reads based on Q-scores, the data may no longer conform to the assumptions the EVS is based on. Furthermore, you would be reducing the number of reads, and if the coverage is already low or border-line, this would certainly affect variant calling (since in most sequence analysis applications confidence comes from high numbers)."

and

... adapter trimming is indicated, quality-trimming isn’t?

"That is my gut-feeling yes. Whenever you filter or trim, you are throwing away information, so I am generally weary of doing it for just making data "clean and nice" (obviously, adapters are not helpful and should be removed).
I think that issue warrants more study -- is that >200x coverage (which is certainly good) still there after the trimming? If so, that would certainly be strange, especially if it affects all callers. It might be useful to compare CRAMs/BAMs with versus without quality trimming in IGV and look at the loci where there are high-confidence variants ... If this is WES, the coverage can vary a lot from one locus to another."

So yea, I think the current consensus is what I also suggested above, namely to try disabling the quality trimming (-q 0) and look at the results in some closer detail.

Please let us know how you get on!

@fmazzarotto
Copy link
Author

Hi Felix,

thank you so much for your replies! I looked into this more closely, and I realized there was a (very) dumb mistake in the way my scripts were parsing fastq files. Incredibly, I haven't realized this in days while testing the various options. Sorry about this.

Now that everything seems to work fine, I can confirm that results look good when applying TrimGalore prior to mapping. Both trimming and quality filtering seem to impact the number of LowEVS-flagged variants, but neither of them seem to modify the number of PASS calls substantially.

To put it in practical terms: without running TrimGalore, Strelka2 calls 8316 variants, including 852 PASS and 7462 LowEVS. Running TrimGalore with its default quality filtering (q>20), the total number of variants increases to 9239, including however a very similar number of PASS calls (863) and a higher number of LowEVS variants (8373). Running TrimGalore without quality filtering (--quality 0), the total number of variants is 8989, including 859 PASS and 8127 LowEVS.

So, in this case at least it looks like adapter trimming has a bigger effect on the number of LowEVS calls than read quality filtering, but overall the number of PASS variants remains constant (which I take as a good sign).

Based on this, I would probably be incline to use TrimGalore letting it perform its default read quality filtering. If you think it would be wise to disable that, I will proceed that way.

Thanks again for your assistance!

@FelixKrueger
Copy link
Owner

Thanks for taking a closer look and reporting back here, these are good news all round!

It's good to see this put to a test. I am glad that the quality trimming doesn't seem to make much of difference (and that if anything the default Trim Galore results come out so well).

Wishing you all the best with your downstream analyses!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants