added all files

IDSIA · Dec 4, 2024 · 6119a2e · 6119a2e
commit 6119a2e
Show file tree

Hide file tree

Showing 9 changed files with 59,752 additions and 0 deletions.
diff --git a/DKD_clin.rds b/DKD_clin.rds
diff --git a/DKD_tpm.rds b/DKD_tpm.rds
diff --git a/README.md b/README.md
@@ -0,0 +1,21 @@
+# CLIER
+
+This repository contains code and test data for the protocol **"Protocol for interpretable and context-specific single-cell informed deconvolution of bulk RNA-seq data"**, currently under review for STAR Protocols.
+
+## Detailed Description of Files
+
+### Code
+- **protocol_code.R**: Contains the lines of code included in the manuscript (excluding the processing from FASTQ to TPM).
+- **aux_functions.R**: Contains all the R functions necessary for executing the protocol.
+- **align_fastq.sh**: Automates the process of downloading FASTQ files and aligning paired-end RNA-Seq data using the STAR aligner.
+
+### Data
+- **kidney_atlas_matrix.rds**: Contains the single-cell signatures atlas built in "A transfer learning framework to elucidate the clinical relevance of altered proximal tubule cell states in kidney disease" (Legouis et al., 2024).
+- **kidney_atlas_info.xlsx**: Contains descriptions of the signatures included in the single-cell signatures atlas built in Legouis et al., 2024.
+- **DKD_tpm.rds**: Contains a processed version (TPM) of the dataset GSE142025, also used in Legouis et al., 2024.
+- **DKD_clin.rds**: Contains clinical information (fibrosis) regarding the dataset GSE142025.
+- **genelength.txt**: Contains genes length (to be used in data processing).
+
+## On the Execution
+
+The code in **protocol_code.R** can be fully executed using the test data provided in this repository. Users who might want to skip the training phase (that takes approximately 9 hours) and test a pre-trained model can find the KCLIER model [here](https://drive.switch.ch/index.php/s/OpvMh1vGRgRmKKf), together with other intermediate files produced during the execution. We share these file separately since, given their large size, they cannot fit on GitHub.
diff --git a/align_fastq.sh b/align_fastq.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+
+# Define the array of URLs
+URL_LIST=(
+    "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/031/SRR10691631/SRR10691631_1.fastq.gz"
+    "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/031/SRR10691631/SRR10691631_2.fastq.gz"
+)
+
+# Define the fastq files, output and STAR genome directories
+FASTQ_DIR=/fastq_folder
+OUTPUT_DIR=/output_folder
+STAR_GENOME=/yourstargenomefolder
+
+# Create the directory if it doesn't exist
+mkdir -p "$FASTQ_DIR"
+
+for URL in "${URL_LIST[@]}"; do
+    # Extract filename from URL
+    FILE_NAME=$(basename "$URL")
+
+    # Download the fastq file
+    curl -L "$URL" -o "$FASTQ_DIR/$FILE_NAME"
+
+    # Check if the download was successful
+
+	if [ $? -eq 0 ]; then
+    echo "Download of $FILE_NAME completed successfully."
+else
+    echo "Error in downloading $FILE_NAME."
+fi
+done
+
+for R1 in ${FASTQ_DIR}/*_1.fastq.gz
+ do
+ 	# Derive R2 file by replacing "_1.fastq.gz" with "_2.fastq.gz"
+ 	R2=${R1/_1.fastq.gz/_2.fastq.gz}
+	# Extract the sample name (e.g., SRR10691631 from SRR10691631_1.fastq.gz)
+ 	sample_name=$(basename ${R1} _1.fastq.gz)
+	# Define the output prefix
+     output_prefix="${OUTPUT_DIR}/${sample_name}."
+	# Run STAR alignment
+ 	STAR --runThreadN $ncpus \
+      	--genomeDir ${STAR_GENOME} \
+      	--outFileNamePrefix ${output_prefix} \
+      	--readFilesIn ${R1} ${R2} \
+      	--readFilesCommand zcat \
+      	--quantMode GeneCounts \
+      	--twopassMode Basic
+ done