Some questions regarding the preprocessing of the RNA-seq data #7

aeijpe · 2024-10-31T16:44:56Z

Hii!

First of all, thank you very much for your amazing work and for providing the code, it’s been very helpful. I have a few questions on how you obtain your rna_clean.csv files, which, as I understand, you use as your RNA-seq data for training the MMP.

According to the preprocess_pancancer_TCGA_normalized_RNA.ipynb notebook, you obtain the pan-cancer normalized RNA-seq data from the Xena database. My first question is about the choice to use pan-cancer normalized data. Since each model is tested within the same TCGA cohort it’s trained on, I was wondering if there was a specific reason for choosing pan-cancer normalization for cohort-specific testing.

Then, I have a few questions on how you went from the downloaded RNA-seq pan-cancer normalized data (df_raw) to the data stored in the rna_clean.csv files. According to the notebook, you transpose the data and keep only the genes that are also in the hallmark gene sets. However, when I follow these steps, I get data that is different from the rna_clean.csv files you provided.

Regarding the TCGA-LUAD cohort:

I noticed that although the raw dataset from Xena includes 4168 genes that are also in the Hallmark set, LUAD/rna_clean.csv has only 1022 of these. Could you share any insight into why the additional 3146 genes were removed?
Moreover, some of the values differ between the downloaded data from xena and the clean_rna.csv data that you provide. For example, the gene values for subjects TCGA-44-2657 and TCGA-38-4625 differ between the two files, while other subjects like TCGA-05-4398 and TCGA-38-4625, have the same values. Could you explain why these adjustments were made?

Regarding the TCGA-BRCA cohort:

Am I correct in assuming that BRCA/rna_clean.csv was derived from the same pan-cancer normalized data source? (https://tcga-xena-hub.s3.us-east-1.amazonaws.com/download/TCGA.BRCA.sampleMap%2FHiSeqV2_PANCAN.gz)
I noticed that the raw dataset from Xena lacks 73 genes present in both the Hallmark gene sets and rna_clean.csv. Some examples of such genes are ACKR1, ACKR3, ADGRA2, ADGRE1, ADGRG1. These genes are missing from the raw data, but appear with non-zero values in your BRCA/rna_clean.csv. Could you share how these values were obtained?
Moreover, I observed that the mean and standard deviation of rna_clean.csv samples are around 3 and 4 (gene expression level distribution per sample/person), whereas the downloaded data has a mean and std of around 0 and 1. Would you mind clarifying if there were additional processing steps to obtain these different values, and why these adjustments were made? Moreover, why were these adjustments made for BRCA but not LUAD?

Regarding the TCGA-BLCA cohort:

Here, I have the same questions as for the BRCA cohort. I assume that you use the pan-cancer normalized dataset from Xena to obtain your BLCA/rna_clean.csv (https://tcga-xena-hub.s3.us-east-1.amazonaws.com/download/TCGA.BLCA.sampleMap%2FHiSeqV2_PANCAN.gz). This raw dataset from Xena is missing some genes that are in the rna_clean.csv file you provide. The missing genes are the same as the ones missing from the BRCA dataset.
Moreover, again here, the values of the genes that are in both files are different, for all the samples and the distributions have the same shift as with the BRCA dataset.

I’d be very grateful for any insights you can share to help me understand the preprocessing steps I may have missed. Thank you for your time and for any guidance you can provide.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions regarding the preprocessing of the RNA-seq data #7

Some questions regarding the preprocessing of the RNA-seq data #7

aeijpe commented Oct 31, 2024

Some questions regarding the preprocessing of the RNA-seq data #7

Some questions regarding the preprocessing of the RNA-seq data #7

Comments

aeijpe commented Oct 31, 2024