Skip to content

Commit

Permalink
added all files
Browse files Browse the repository at this point in the history
  • Loading branch information
dmalpetti authored Dec 4, 2024
0 parents commit 6119a2e
Show file tree
Hide file tree
Showing 9 changed files with 59,752 additions and 0 deletions.
Binary file added DKD_clin.rds
Binary file not shown.
Binary file added DKD_tpm.rds
Binary file not shown.
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# CLIER

This repository contains code and test data for the protocol **"Protocol for interpretable and context-specific single-cell informed deconvolution of bulk RNA-seq data"**, currently under review for STAR Protocols.

## Detailed Description of Files

### Code
- **protocol_code.R**: Contains the lines of code included in the manuscript (excluding the processing from FASTQ to TPM).
- **aux_functions.R**: Contains all the R functions necessary for executing the protocol.
- **align_fastq.sh**: Automates the process of downloading FASTQ files and aligning paired-end RNA-Seq data using the STAR aligner.

### Data
- **kidney_atlas_matrix.rds**: Contains the single-cell signatures atlas built in "A transfer learning framework to elucidate the clinical relevance of altered proximal tubule cell states in kidney disease" (Legouis et al., 2024).
- **kidney_atlas_info.xlsx**: Contains descriptions of the signatures included in the single-cell signatures atlas built in Legouis et al., 2024.
- **DKD_tpm.rds**: Contains a processed version (TPM) of the dataset GSE142025, also used in Legouis et al., 2024.
- **DKD_clin.rds**: Contains clinical information (fibrosis) regarding the dataset GSE142025.
- **genelength.txt**: Contains genes length (to be used in data processing).

## On the Execution

The code in **protocol_code.R** can be fully executed using the test data provided in this repository. Users who might want to skip the training phase (that takes approximately 9 hours) and test a pre-trained model can find the KCLIER model [here](https://drive.switch.ch/index.php/s/OpvMh1vGRgRmKKf), together with other intermediate files produced during the execution. We share these file separately since, given their large size, they cannot fit on GitHub.
49 changes: 49 additions & 0 deletions align_fastq.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/bash

# Define the array of URLs
URL_LIST=(
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/031/SRR10691631/SRR10691631_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/031/SRR10691631/SRR10691631_2.fastq.gz"
)

# Define the fastq files, output and STAR genome directories
FASTQ_DIR=/fastq_folder
OUTPUT_DIR=/output_folder
STAR_GENOME=/yourstargenomefolder

# Create the directory if it doesn't exist
mkdir -p "$FASTQ_DIR"

for URL in "${URL_LIST[@]}"; do
# Extract filename from URL
FILE_NAME=$(basename "$URL")

# Download the fastq file
curl -L "$URL" -o "$FASTQ_DIR/$FILE_NAME"

# Check if the download was successful

if [ $? -eq 0 ]; then
echo "Download of $FILE_NAME completed successfully."
else
echo "Error in downloading $FILE_NAME."
fi
done

for R1 in ${FASTQ_DIR}/*_1.fastq.gz
do
# Derive R2 file by replacing "_1.fastq.gz" with "_2.fastq.gz"
R2=${R1/_1.fastq.gz/_2.fastq.gz}
# Extract the sample name (e.g., SRR10691631 from SRR10691631_1.fastq.gz)
sample_name=$(basename ${R1} _1.fastq.gz)
# Define the output prefix
output_prefix="${OUTPUT_DIR}/${sample_name}."
# Run STAR alignment
STAR --runThreadN $ncpus \
--genomeDir ${STAR_GENOME} \
--outFileNamePrefix ${output_prefix} \
--readFilesIn ${R1} ${R2} \
--readFilesCommand zcat \
--quantMode GeneCounts \
--twopassMode Basic
done
Loading

0 comments on commit 6119a2e

Please sign in to comment.