Advanced integration tutorial
In this tutorial, we explore the integration of two multi-sample datasets using the harmonypy
and scanorama
libraries. The first dataset originates from a colorectal cancer study, comprising 12 tissue sections distributed across two biological replicates for each of the six samples. The details of this dataset are documented in Valdeolivas et al. 2023. The second dataset featured in this tutorial is derived from a mouse
renal ischemia study, encompassing five samples collected at various time points, as outlined by Eryn E. et al. 2022.
[1]:
import scanorama
import numpy as np
import scanpy as sc
from glob import glob
import matplotlib.pyplot as plt
import chrysalis as ch
I. Colorectal Cancer
For this tutorial we are working with h5ad
files that can be downloaded using this link.
[2]:
data_dir = '/mnt/c/Users/demeter_turos/PycharmProjects/chrysalis/data/crc_samples'
1.1 Preprocess samples
We start by reading the samples, performing QC and normalization, detecting spatially variable genes, and then saving the processed samples.
[3]:
samples = glob(data_dir + '/*.h5ad')
for sample in samples:
ad = sc.read_h5ad(sample)
# replace .uns dictionary key with the `sample_id` column to make it easier reach the H&E images
spatial_key = list(ad.uns['spatial'].keys())[0]
ad.uns['spatial'][ad.obs['sample_id'][0]] = ad.uns['spatial'][spatial_key]
del ad.uns['spatial'][spatial_key]
# we can optionally use ENSEMBL gene IDs instead of gene symbols to deal with non-unique gene identifiers
# this circumvents issues during integration
ad.var['gene_symbols'] = ad.var_names
ad.var_names = ad.var['gene_ids']
assert len(np.unique(ad.var_names)) == len(ad.var)
# ad.var_names_make_unique()
# filter low quality spots
sc.pp.calculate_qc_metrics(ad, inplace=True)
sc.pp.filter_genes(ad, min_cells=10)
sc.pp.filter_cells(ad, min_counts=2000)
sc.pp.filter_cells(ad, min_genes=10)
# normalize counts
sc.pp.normalize_total(ad, inplace=True, target_sum=1e4, exclude_highly_expressed=True)
sc.pp.log1p(ad)
# detect spatially variable genes
ch.detect_svgs(ad, min_morans=0.05, min_spots=0.05)
# save samples
ad.write_h5ad(data_dir + f'/ch_{ad.obs["sample_id"][0]}.h5ad')
Calculating SVGs: 100%|██████████| 12285/12285 [01:25<00:00, 143.85it/s]
Calculating SVGs: 100%|██████████| 11955/11955 [01:30<00:00, 131.57it/s]
Calculating SVGs: 100%|██████████| 12191/12191 [01:27<00:00, 139.06it/s]
Calculating SVGs: 100%|██████████| 12495/12495 [01:08<00:00, 183.03it/s]
Calculating SVGs: 100%|██████████| 12966/12966 [00:47<00:00, 272.88it/s]
Calculating SVGs: 100%|██████████| 11224/11224 [00:58<00:00, 193.35it/s]
Calculating SVGs: 100%|██████████| 12022/12022 [01:03<00:00, 190.16it/s]
Calculating SVGs: 100%|██████████| 12346/12346 [00:19<00:00, 628.97it/s]
Calculating SVGs: 100%|██████████| 11769/11769 [00:56<00:00, 206.85it/s]
Calculating SVGs: 100%|██████████| 13264/13264 [01:08<00:00, 194.32it/s]
Calculating SVGs: 100%|██████████| 12280/12280 [00:18<00:00, 675.07it/s]
Calculating SVGs: 100%|██████████| 11580/11580 [00:36<00:00, 315.34it/s]
1.2 Create an integrated AnnData object
Next we read back the saved samples and take a look at the H&E.
[4]:
ch_samples = glob(data_dir + '/ch_*.h5ad')
# read AnnData objects into a list
adatas = [sc.read_h5ad(ad) for ad in ch_samples]
# get the sample names stored from the 'sample_id' column
sample_names = [ad.obs["sample_id"][0] for ad in adatas]
# look at the H&Es
plt.rcParams['figure.dpi'] = 60
fig, ax = plt.subplots(3, 4, figsize=(4 * 4, 3 * 4))
ax = ax.flatten()
for a in ax:
a.axis('off')
for idx, ad in enumerate(adatas):
sc.pl.spatial(ad, color=None, size=1.5, alpha=0,
ax=ax[idx], show=False, cmap='viridis')
ax[idx].set_title(f'{ad.obs["sample_id"][0]}')
plt.tight_layout()
plt.show()
Next, we can use ch.plot_svg_matrix
to visualize the SVG overlap matrix. By examining the pairwise overlaps between spatially variable genes across samples, we can gain insights into the expected variability in tissue compartments. This analysis also enables the identification of significant overlaps, particularly between biological replicates.
[5]:
# plot spatially variable gene overlaps
plt.rcParams['figure.dpi'] = 80
ch.plot_svg_matrix(adatas, figsize=(8, 7), obs_name='sample_id', cluster=True)
plt.show()
To initiate the integration process, simply invoke ch.integrate_adatas
on the list of AnnData objects. This function concatenates the samples into a single AnnData instance, creating an outer-joined SVG list.
[6]:
# concatenate AnnData objects
adata = ch.integrate_adatas(adatas, sample_names, sample_col='ch_sample_id')
print(adata)
AnnData object with n_obs × n_vars = 18146 × 12438
obs: 'in_tissue', 'array_row', 'array_col', 'ground_truth', 'sample_id', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'n_counts', 'n_genes', 'ch_sample_id'
var: 'gene_ids', 'feature_types', 'genome', 'gene_symbols', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', "Moran's I", 'spatially_variable'
uns: 'spatial'
obsm: 'spatial'
varm: 'spatially_variable', "Moran's I"
We perform PCA here without integration and save the results to 'chr_X_pca_uncorrected'
in .obsm
in order to compare the results later.
[7]:
# replace ENSEMBL IDs with the gene symbols and make them unique
adata.var_names = list(adata.var['gene_symbols'])
adata.var_names_make_unique()
# run PCA
ch.pca(adata, n_pcs=50)
# save uncorrected PCs to compare the results later
adata.obsm['chr_X_pca_uncorrected'] = adata.obsm['chr_X_pca'].copy()
1.3 Integration with Harmony
We will use ch.harmony_integration
to run harmonypy
on the PCA embeddings. This will overwrite adata.obsm['chr_X_pca']
unless specified otherwise.
[8]:
# perform harmony integration
ch.harmony_integration(adata, 'ch_sample_id', random_state=42, block_size=0.05)
# set 'sample_id' as categorical variable for plotting
# adata.obs['ch_sample_id'] = adata.obs['ch_sample_id'].astype('category')
plt.rcParams['figure.dpi'] = 100
ch.plot_explained_variance(adata)
plt.show()
2024-01-08 09:33:03,056 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
2024-01-08 09:33:10,982 - harmonypy - INFO - sklearn.KMeans initialization complete.
2024-01-08 09:33:11,079 - harmonypy - INFO - Iteration 1 of 10
2024-01-08 09:33:15,313 - harmonypy - INFO - Iteration 2 of 10
2024-01-08 09:33:23,389 - harmonypy - INFO - Iteration 3 of 10
2024-01-08 09:33:30,830 - harmonypy - INFO - Iteration 4 of 10
2024-01-08 09:33:34,859 - harmonypy - INFO - Converged after 4 iterations
1.4 Integration with Scanorama
To run scanorama, we can utilize the correct_scanpy
function. This function operates on a list of AnnData objects and produces a list containing the corrected expression matrices, so we perform this step prior to running ch.integrate_adatas
. Lastly, we copy the PCA embedding to the main AnnData object (adata
).
[9]:
# dropping the 'ch_sample_id' as it was already added by running 'ch.integrate_adatas' in the previous cell.
for ad in adatas:
ad.obs.drop(columns=['ch_sample_id'], inplace=True)
scr_adata = scanorama.correct_scanpy(adatas, return_dimred=True)
scr_adata = ch.integrate_adatas(scr_adata, sample_col='ch_sample_id')
# replace ENSEMBL IDs with the gene symbols and make them unique
scr_adata.var_names = list(scr_adata.var['gene_symbols'])
scr_adata.var_names_make_unique()
# run PCA
ch.pca(scr_adata, n_pcs=50)
# save uncorrected PCs
adata.obsm['chr_X_pca_scanorama'] = scr_adata.obsm['chr_X_pca'].copy()
Found 12438 genes among all datasets
[[0.00000000e+00 8.73689010e-01 0.00000000e+00 0.00000000e+00
8.44594595e-03 4.88102502e-03 9.25925926e-03 0.00000000e+00
7.58533502e-03 1.45918833e-02 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 9.16590284e-04 0.00000000e+00
2.21283784e-01 1.95241001e-02 3.76639865e-02 0.00000000e+00
3.79266751e-03 3.62017804e-02 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 9.09504838e-01
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
2.10586227e-02 1.22025625e-03 1.70745589e-03 0.00000000e+00
6.32111252e-04 5.93471810e-04 7.31707317e-02 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 1.00061013e-01 1.05574324e-01 8.71391076e-01
2.78128951e-02 3.04054054e-02 5.18292683e-02 2.95275591e-03]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 7.87065284e-02 1.09823063e-02
9.05815424e-01 5.34124629e-03 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 5.24934383e-02
1.95954488e-02 7.98219585e-01 6.09756098e-03 4.13943355e-02]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
5.24934383e-03 5.24934383e-02 4.57317073e-02 3.14960630e-02]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 1.10619469e-01 0.00000000e+00 2.95275591e-03]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 3.26219512e-01 5.01968504e-02]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 9.48170732e-01]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]
Processing datasets (10, 11)
Processing datasets (2, 3)
Processing datasets (5, 8)
Processing datasets (0, 1)
Processing datasets (4, 7)
Processing datasets (6, 9)
Processing datasets (9, 10)
Processing datasets (1, 4)
Processing datasets (8, 9)
Processing datasets (4, 6)
Processing datasets (4, 5)
1.6 Visualization
Now we can run ch.aa
and ch.plot_samples
to visualize the tissue compartments and inspect the results of the integration.
[10]:
# uncorrected
ch.aa(adata, n_pcs=20, n_archetypes=10, pca_key='chr_X_pca_uncorrected')
plt.rcParams['figure.dpi'] = 60
ch.plot_samples(adata, rows=3, cols=4, dim=10, suptitle='Uncorrected', sample_col='ch_sample_id', show_title=True)
[12]:
# harmony integration
ch.aa(adata, n_pcs=20, n_archetypes=10)
plt.rcParams['figure.dpi'] = 60
ch.plot_samples(adata, rows=3, cols=4, dim=10, suptitle='Harmony', sample_col='ch_sample_id', show_title=True)
[13]:
# scanorama integration
ch.aa(adata, n_pcs=20, n_archetypes=10, pca_key='chr_X_pca_scanorama')
plt.rcParams['figure.dpi'] = 60
ch.plot_samples(adata, rows=3, cols=4, dim=10, suptitle='Scanorama', sample_col='ch_sample_id', show_title=True)
1.7 PCA plot
We can also take a look at the PCA embeddings to inspect the alignment.
[14]:
plt.rcParams['figure.dpi'] = 80
fig, axs = plt.subplots(1, 3, figsize=(12, 4))
axs[0].scatter(x=adata.obsm['chr_X_pca_uncorrected'][:, 0], y=adata.obsm['chr_X_pca_uncorrected'][:, 1],
rasterized=True, s=4, c=adata.obs['ch_sample_id'].cat.codes, cmap='tab20')
axs[0].set_title('Uncorrected')
axs[0].set_xlabel('PC 0')
axs[0].set_ylabel('PC 1')
axs[1].scatter(x=adata.obsm['chr_X_pca'][:, 0], y=adata.obsm['chr_X_pca'][:, 1],
rasterized=True, s=4, c=adata.obs['ch_sample_id'].cat.codes, cmap='tab20')
axs[1].set_title('Harmony')
axs[1].set_xlabel('PC 0')
axs[1].set_ylabel('PC 1')
axs[2].scatter(x=adata.obsm['chr_X_pca_scanorama'][:, 0], y=adata.obsm['chr_X_pca_scanorama'][:, 1],
rasterized=True, s=4, c=adata.obs['ch_sample_id'].cat.codes, cmap='tab20')
axs[2].set_title('Scanorama')
axs[2].set_xlabel('PC 0')
axs[2].set_ylabel('PC 1')
plt.tight_layout()
plt.show()
II. Renal Ischemia
In this section we integrate the mouse renal ischemia dataset. Input files can be downloaded from here.
[15]:
data_dir = '/mnt/c/Users/demeter_turos/PycharmProjects/chrysalis/data/renal_ischemia'
[16]:
samples = glob(data_dir + '/*.h5ad')
for sample in samples:
ad = sc.read_h5ad(sample)
# replace .uns dictionary key with the `sample_id` column to make it easier reach the H&E images
spatial_key = list(ad.uns['spatial'].keys())[0]
ad.uns['spatial'][ad.obs['sample_id'][0]] = ad.uns['spatial'][spatial_key]
del ad.uns['spatial'][spatial_key]
# we can optionally use ENSEMBL gene IDs instead of gene symbols to deal with non-unique gene identifiers
# this circumvents issues during integration
ad.var['gene_symbols'] = ad.var_names
ad.var_names = ad.var['gene_ids']
assert len(np.unique(ad.var_names)) == len(ad.var)
# ad.var_names_make_unique()
# filter low quality spots
sc.pp.calculate_qc_metrics(ad, inplace=True)
sc.pp.filter_genes(ad, min_cells=10)
sc.pp.filter_cells(ad, min_counts=2000)
sc.pp.filter_cells(ad, min_genes=500)
# normalize counts
sc.pp.normalize_total(ad, inplace=True, target_sum=1e4, exclude_highly_expressed=True)
sc.pp.log1p(ad)
# detect spatially variable genes
ch.detect_svgs(ad, min_morans=0.05, min_spots=0.05)
# save samples
ad.write_h5ad(data_dir + f'/ch_{ad.obs["sample_id"][0]}.h5ad')
Calculating SVGs: 100%|██████████| 7325/7325 [00:42<00:00, 173.06it/s]
Calculating SVGs: 100%|██████████| 9705/9705 [00:57<00:00, 168.30it/s]
Calculating SVGs: 100%|██████████| 11019/11019 [01:20<00:00, 136.71it/s]
Calculating SVGs: 100%|██████████| 11283/11283 [00:57<00:00, 196.33it/s]
Calculating SVGs: 100%|██████████| 9958/9958 [00:50<00:00, 197.38it/s]
[17]:
ch_samples = glob(data_dir + '/ch_*.h5ad')
# read AnnData objects into a list
adatas = [sc.read_h5ad(ad) for ad in ch_samples]
# get the sample names stored from the 'sample_id' column
sample_names = [ad.obs["sample_id"][0] for ad in adatas]
# look at the H&Es
plt.rcParams['figure.dpi'] = 60
fig, ax = plt.subplots(2, 3, figsize=(3 * 4, 2 * 4))
ax = ax.flatten()
for a in ax:
a.axis('off')
for idx, ad in enumerate(adatas):
sc.pl.spatial(ad, color=None, size=1.5, alpha=0,
ax=ax[idx], show=False, cmap='viridis')
ax[idx].set_title(f'{ad.obs["sample_id"][0]}')
plt.tight_layout()
plt.show()
[18]:
# plot spatially variable gene overlaps
plt.rcParams['figure.dpi'] = 80
ch.plot_svg_matrix(adatas, figsize=(8, 7), obs_name='sample_id', cluster=True)
plt.show()
[19]:
# concatenate AnnData objects
adata = ch.integrate_adatas(adatas, sample_names, sample_col='ch_sample_id')
print(adata)
AnnData object with n_obs × n_vars = 8771 × 12217
obs: 'in_tissue', 'array_row', 'array_col', 'sample_id', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'n_counts', 'n_genes', 'ch_sample_id'
var: 'gene_ids', 'feature_types', 'genome', 'gene_symbols', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', "Moran's I", 'spatially_variable'
uns: 'spatial'
obsm: 'spatial'
varm: 'spatially_variable', "Moran's I"
[20]:
# replace ENSEMBL IDs with the gene symbols and make them unique
adata.var_names = list(adata.var['gene_symbols'])
adata.var_names_make_unique()
# run PCA
ch.pca(adata, n_pcs=50)
# save uncorrected PCs to compare the results later
adata.obsm['chr_X_pca_uncorrected'] = adata.obsm['chr_X_pca'].copy()
[21]:
# harmony integration
ch.harmony_integration(adata, 'ch_sample_id', random_state=42, block_size=0.05)
# set 'sample_id' as categorical variable for plotting
# adata.obs['ch_sample_id'] = adata.obs['ch_sample_id'].astype('category')
plt.rcParams['figure.dpi'] = 100
ch.plot_explained_variance(adata)
plt.show()
2024-01-08 09:50:39,647 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
2024-01-08 09:50:43,533 - harmonypy - INFO - sklearn.KMeans initialization complete.
2024-01-08 09:50:43,580 - harmonypy - INFO - Iteration 1 of 10
2024-01-08 09:50:45,973 - harmonypy - INFO - Iteration 2 of 10
2024-01-08 09:50:47,773 - harmonypy - INFO - Iteration 3 of 10
2024-01-08 09:50:51,101 - harmonypy - INFO - Iteration 4 of 10
2024-01-08 09:50:53,032 - harmonypy - INFO - Converged after 4 iterations
[22]:
# dropping the 'ch_sample_id' as it was already added by running 'ch.integrate_adatas' in the previous cell.
for ad in adatas:
ad.obs.drop(columns=['ch_sample_id'], inplace=True)
# scanorama integration
scr_adata = scanorama.correct_scanpy(adatas, return_dimred=True)
scr_adata = ch.integrate_adatas(scr_adata, sample_col='ch_sample_id')
# replace ENSEMBL IDs with the gene symbols and make them unique
scr_adata.var_names = list(scr_adata.var['gene_symbols'])
scr_adata.var_names_make_unique()
# run PCA
ch.pca(scr_adata, n_pcs=50)
# save uncorrected PCs
adata.obsm['chr_X_pca_scanorama'] = scr_adata.obsm['chr_X_pca'].copy()
Found 12217 genes among all datasets
[[0. 0.38012959 0.07921999 0.06276661 0.06827048]
[0. 0. 0.46652268 0.05714286 0.01170351]
[0. 0. 0. 0.49142857 0.11768531]
[0. 0. 0. 0. 0.71846554]
[0. 0. 0. 0. 0. ]]
Processing datasets (3, 4)
Processing datasets (2, 3)
Processing datasets (1, 2)
Processing datasets (0, 1)
Processing datasets (2, 4)
[23]:
# uncorrected
ch.aa(adata, n_pcs=20, n_archetypes=10, pca_key='chr_X_pca_uncorrected')
plt.rcParams['figure.dpi'] = 60
ch.plot_samples(adata, rows=2, cols=3, dim=10, suptitle='Uncorrected', sample_col='ch_sample_id', show_title=True, rotation=30)
[24]:
# harmony integration
ch.aa(adata, n_pcs=20, n_archetypes=10)
plt.rcParams['figure.dpi'] = 60
ch.plot_samples(adata, rows=2, cols=3, dim=10, suptitle='Harmony', sample_col='ch_sample_id', show_title=True, rotation=30)
[25]:
# scanorama integration
ch.aa(adata, n_pcs=20, n_archetypes=10, pca_key='chr_X_pca_scanorama')
plt.rcParams['figure.dpi'] = 60
ch.plot_samples(adata, rows=2, cols=3, dim=10, suptitle='Scanorama', sample_col='ch_sample_id', show_title=True, rotation=30)
[26]:
plt.rcParams['figure.dpi'] = 80
fig, axs = plt.subplots(1, 3, figsize=(12, 4))
axs[0].scatter(x=adata.obsm['chr_X_pca_uncorrected'][:, 0], y=adata.obsm['chr_X_pca_uncorrected'][:, 1],
rasterized=True, s=4, c=adata.obs['ch_sample_id'].cat.codes, cmap='tab10')
axs[0].set_title('Uncorrected')
axs[0].set_xlabel('PC 0')
axs[0].set_ylabel('PC 1')
axs[1].scatter(x=adata.obsm['chr_X_pca'][:, 0], y=adata.obsm['chr_X_pca'][:, 1],
rasterized=True, s=4, c=adata.obs['ch_sample_id'].cat.codes, cmap='tab10')
axs[1].set_title('Harmony')
axs[1].set_xlabel('PC 0')
axs[1].set_ylabel('PC 1')
axs[2].scatter(x=adata.obsm['chr_X_pca_scanorama'][:, 0], y=adata.obsm['chr_X_pca_scanorama'][:, 1],
rasterized=True, s=4, c=adata.obs['ch_sample_id'].cat.codes, cmap='tab10')
axs[2].set_title('Scanorama')
axs[2].set_xlabel('PC 0')
axs[2].set_ylabel('PC 1')
plt.tight_layout()
plt.show()