Clustering by variable and not by UMAP after RunHarmony #266

RRingquist · 2025-01-27T19:13:42Z

Hello, I am attempting to integrate a dataset comprised of 22 samples (~1M cells total) and I specifically would like to regress out two covariates (GCResponse and Subject). I have attempted this two ways with separate issues arising from each. I followed the standard Seurat pipeline for QC on the merged object and found that the two covariates in question showed signficant batch effects:

`

Run QC on the Merged Seurat

Combined[["percent.mt"]] <- PercentageFeatureSet(Combined, pattern = "^MT-")
VlnPlot(Combined, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = 0)
Combined <- subset(Combined, subset = nFeature_RNA >= 250 & nCount_RNA >= 500 & percent.mt < 15) #Need to adjust these parameters based on violin plot
Combined <- NormalizeData(Combined)

Identify the most variable genes in the merged object

Combined <- FindVariableFeatures(Combined,
selection.method = "vst",
nfeatures = 2000,
verbose = FALSE)

Scale the counts

Combined <- ScaleData(Combined)

Perform PCA

Combined <- RunPCA(Combined, npcs = 50)

Visualize the unintegrated dataset

Combined <- FindNeighbors(Combined, dims = 1:20, reduction = "pca")
Combined <- FindClusters(Combined, resolution = 0.3, cluster.name = "unintegrated_clusters")
Combined <- RunUMAP(Combined, dims = 1:20, reduction = "pca", reduction.name = "umap.unintegrated")
DimPlot(Combined, reduction = "umap.unintegrated", group.by = 'Subject')`

The unintegrated UMAP shows signficant batch effects arising from the 'Subject' covariate (and in turn 'GCResponse'):

I then attempted to use harmony integration via IntegrateLayers:

Integrate the dataset

`library(parallel)
detectCores() #24
num_cores = 30
combined.hy <- IntegrateLayers(
object = Combined, method = HarmonyIntegration,
orig.reduction = "pca", new.reduction = "harmony",
theta = 4, lambda = 0.5,
max.iter.harmony = 20L, max.iter.cluster = 50L,
verbose = TRUE, num.threads = num_cores)

combined.hy <- FindNeighbors(combined.hy, dims = 1:30)
combined.hy <- FindClusters(combined.hy, resolution = 0.3)
combined.hy <- RunUMAP(combined.hy, reduction = "harmony", dims = 1:30, reduction.name = "umap.harmony")
DimPlot(combined.hy, reduction = "umap.harmony", split.by = 'GCResponse', label.size = 2)`

This somewhat mitigated the effects of the 'Subject' covariate, but still resulted in fairly substantial 'GCResponse' effects (especially between the Healthy and High Responder groups, which is unexpected biologically):

I then attempted to instead run harmony integration directly via RunHarmony:

`

Integrate the dataset

combined.Hy <- RunHarmony(Combined,
group.by.vars = c("Subject", "GCResponse"),
reduction = "pca", assay.use = "SCT", reduction.save = "harmony",
dims = 1:30, theta = c(4,4), lambda = 0.5, max_iter = 50)

combined.Hy <- FindNeighbors(combined.Hy, dims = 1:30)
combined.Hy <- FindClusters(combined.Hy, resolution = 0.3)
combined.Hy <- RunUMAP(combined.Hy, reduction = "harmony", dims = 1:30, reduction.name = "umap.harmony")
DimPlot(combined.Hy, reduction = "umap.harmony", split.by = 'GCResponse', label.size = 2) `

After some tuning of the parameters, I achieved a more uniform UMAP with minimal effects from Subject or GCResponse. My issue here is that the clustering now seems to be heavily influenced by GCResponse instead of the UMAP space.

I am at a loss for how to correct for this, as adjusting the dims and resolution parameters in FindNeighbords and FindClusters, respectively, have not helped. Any advice would be highly appreciated, thank you!

pati-ni · 2025-01-27T19:42:47Z

Could you specify the reduction layer for the neighbors?

combined.Hy <- FindNeighbors(combined.Hy, dims = 1:30)
combined.Hy <- FindClusters(combined.Hy, resolution = 0.3)

Also integratelayers is a leaky abstraction that does not deal properly with several covariates. Use RunHarmony() instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering by variable and not by UMAP after RunHarmony #266

Clustering by variable and not by UMAP after RunHarmony #266

RRingquist commented Jan 27, 2025 •

edited

Loading

pati-ni commented Jan 27, 2025

Clustering by variable and not by UMAP after RunHarmony #266

Clustering by variable and not by UMAP after RunHarmony #266

Comments

RRingquist commented Jan 27, 2025 • edited Loading

Run QC on the Merged Seurat

Identify the most variable genes in the merged object

Scale the counts

Perform PCA

Visualize the unintegrated dataset

Integrate the dataset

Integrate the dataset

pati-ni commented Jan 27, 2025

RRingquist commented Jan 27, 2025 •

edited

Loading