Reproducing the original dandelion method/paper
2025-01-13
vignette_reproduce_original.rmd
In this vignette, we will demonstrate how to perform TCR trajectory
analysis starting from data that has already been processed by
dandelion
in python. This is to demonstrate that the
original method/results in the dandelion
paper can be
reproduced.
Load data
First, we will load the demo data. This is a down-sampled dataset
from the Suo et al 2024
Nature Biotechnology paper. It contains 10,000 cells with the TCR
information and the dimensionality reduced data (scVI) that we need for
this tutorial. The gene expression matrix is not required for this
tutorial so it is not included in the demo data. We will show in a
separate tutorial how to start with data from
scRepertoire
.
data(sce_vdj)
We will set the seed so that the plots and results are consistent.
set.seed(100)
Filter the data
To begin, we will filter the data and extract the TCR information so that we can construct pseudobulks.
Because the colData
of this single-cell object is
populated with the TCR information from dandelion
in python
(through this method: dandelion
-> anndata
-> anndata2ri
, which essentially converts an
AnnData
object in python to
SingleCellExperiment
in R), we can directly use the
setupVdjPseudobulk
function to extract the TCR information
and construct the pseudobulks.
Here, we also need to specify the allowed_chain_status
to keep to the relevant contigs. Default for
allowed_chain_status
is NULL
, which will keep
all contigs. In the standard R workflow that starts from
scRepertoire
, we will assume that all the QC and filtering
has already been handled by scRepertoire
.
sce_vdj <- setupVdjPseudobulk(sce_vdj,
already.productive = FALSE,
allowed_chain_status = c("Single pair", "Extra pair", "Extra pair-exception", "Orphan VDJ", "Orphan VDJ-exception")
)
We can visualise the UMAP of the filtered data.
plotUMAP(sce_vdj, color_by = "anno_lvl_2_final_clean")
Milo object and neighbourhood graph construction
We will use miloR to create the pseudobulks based on the gene expression data. The goal is to construct a neighbourhood graph with many neighbors with which we can sample the representative neighbours to form the objects.
library(miloR)
traj_milo <- Milo(sce_vdj)
milo_object <- buildGraph(traj_milo, k = 50, d = 20, reduced.dim = "X_scvi")
milo_object <- makeNhoods(milo_object, reduced_dims = "X_scvi", d = 20)
Construct pseudobulked VDJ feature space
Next, we will construct the pseudobulked VDJ feature space using the neighbourhood graph constructed above. We will also run PCA on the pseudobulked VDJ feature space.
pb.milo <- vdjPseudobulk(milo_object, col_to_take = "anno_lvl_2_final_clean")
# pbs = milo_object@nhoods
pb.milo <- runPCA(pb.milo, assay.type = "Feature_space")
We can visualise the PCA of the pseudobulked VDJ feature space.
plotPCA(pb.milo, color_by = "anno_lvl_2_final_clean")
TCR trajectory inference using Absorbing Markov Chain
In the original dandelion
python package, the trajectory
inference is done using the palantir
package. Here, we
implement the absorbing markov chain approach in dandelionR to infer the
trajectory, leveraging on destiny
for diffusion map
computation.
Define root and branch tips
library(SingleCellExperiment)
# extract the PCA matrix
pca <- t(as.matrix(reducedDim(pb.milo, type = "PCA")))
# define the CD8 terminal cell as the top-most cell and CD4 terminal cell as the bottom-most cell
branch.tips <- c(which.max(pca[2, ]), which.min(pca[2, ]))
names(branch.tips) <- c("CD8+T", "CD4+T")
# define the start of our trajectory as the right-most cell
root <- which.max(pca[1, ])
Construct diffusion map
library(destiny)
# Run diffusion map on the PCA
dm <- DiffusionMap(t(pca), n_pcs = 50, n_eigs = 10)
Compute diffussion pseudotime on diffusion map
# the root is automatically called DPT + index of the root cell
DPTroot <- paste0("DPT", root)
# store pseudotime in milo object
pb.milo$pseudotime <- dif.pse[[DPTroot]]
# set the colours for pseudotime
pal <- colorRampPalette(rev((RColorBrewer::brewer.pal(9, "RdYlBu"))))(255)
plotPCA(pb.milo, color_by = "pseudotime") + scale_colour_gradientn(colours = pal)
Markov chain construction on the pseudobulk VDJ feature space
pb.milo <- markovProbability(
milo = pb.milo,
diffusionmap = dm,
terminal_state = branch.tips,
root_cell = root,
pseudotime_key = "pseudotime"
)
Visualising branch probabilities
With the Markov chain probabilities computed, we can visualise the branch probabilities towards CD4+ or CD8+ T-cell fate on the PCA plot.
plotPCA(pb.milo, color_by = "CD8+T") + scale_color_gradientn(colors = pal)
plotPCA(pb.milo, color_by = "CD4+T") + scale_color_gradientn(colors = pal)
Transfer
The next step is to project the pseudotime and the branch probability information from the pseudobulks back to each cell in the dataset. If the cell do not belong to any of the pseudobulk, it will be removed. If a cell belongs to multiple pseudobulk samples, its value should be calculated as a weighted average of the corresponding values from each pseudobulk, where each weight is inverse of the size of the pseudobulk.
Project pseudobulk data to each cell
cdata <- projectPseudotimeToCell(milo_object, pb.milo, branch.tips)
Visualise the trajectory data on a per cell basis
plotUMAP(cdata, color_by = "anno_lvl_2_final_clean", dimred = "UMAP_knngraph")
plotUMAP(cdata, color_by = "pseudotime", dimred = "UMAP_knngraph") + scale_color_gradientn(colors = pal)
plotUMAP(cdata, color_by = "CD4+T", dimred = "UMAP_knngraph") + scale_color_gradientn(colors = pal)
plotUMAP(cdata, color_by = "CD8+T", dimred = "UMAP_knngraph") + scale_color_gradientn(colors = pal)
And that’s it! We have successfully inferred the trajectory of the T-cells in this dataset!
Session info
## R Under development (unstable) (2025-01-11 r87563)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] miloR_2.3.0 edgeR_4.5.1
## [3] limma_3.63.3 scater_1.35.0
## [5] scuttle_1.17.0 SingleCellExperiment_1.29.1
## [7] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [9] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [11] IRanges_2.41.2 S4Vectors_0.45.2
## [13] BiocGenerics_0.53.3 generics_0.1.3
## [15] MatrixGenerics_1.19.1 matrixStats_1.5.0
## [17] scRepertoire_2.2.1 ggplot2_3.5.1
## [19] dandelionR_0.99.0 destiny_3.21.0
## [21] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] cubature_2.1.1 splines_4.5.0 tibble_3.2.1
## [4] polyclip_1.10-7 xts_0.14.1 lifecycle_1.0.4
## [7] globals_0.16.3 lattice_0.22-6 MASS_7.3-64
## [10] magrittr_2.0.3 vcd_1.4-13 sass_0.4.9
## [13] rmarkdown_2.29 jquerylib_0.1.4 yaml_2.3.10
## [16] spam_2.11-0 sp_2.1-4 cowplot_1.1.3
## [19] RColorBrewer_1.1-3 abind_1.4-8 purrr_1.0.2
## [22] ggraph_2.2.1 nnet_7.3-20 pracma_2.4.4
## [25] tweenr_2.0.3 evmix_2.12 GenomeInfoDbData_1.2.13
## [28] ggrepel_0.9.6 irlba_2.3.5.1 listenv_0.9.1
## [31] iNEXT_3.0.1 MatrixModels_0.5-3 RSpectra_0.16-2
## [34] parallelly_1.41.0 pkgdown_2.1.1 codetools_0.2-20
## [37] smoother_1.3 DelayedArray_0.33.3 ggforce_0.4.2
## [40] tidyselect_1.2.1 UCSC.utils_1.3.0 farver_2.1.2
## [43] ScaledMatrix_1.15.0 viridis_0.6.5 jsonlite_1.8.9
## [46] BiocNeighbors_2.1.2 e1071_1.7-16 tidygraph_1.3.1
## [49] progressr_0.15.1 Formula_1.2-5 survival_3.8-3
## [52] ggalluvial_0.12.5 systemfonts_1.1.0 tools_4.5.0
## [55] ragg_1.3.3 stringdist_0.9.15 Rcpp_1.0.14
## [58] glue_1.8.0 gridExtra_2.3 SparseArray_1.7.2
## [61] laeken_0.5.3 xfun_0.50 ranger_0.17.0
## [64] TTR_0.24.4 ggthemes_5.1.0 dplyr_1.1.4
## [67] withr_3.0.2 numDeriv_2016.8-1.1 BiocManager_1.30.25
## [70] fastmap_1.2.0 boot_1.3-31 bluster_1.17.0
## [73] SparseM_1.84-2 VIM_6.2.2 digest_0.6.37
## [76] rsvd_1.0.5 R6_2.5.1 textshaping_0.4.1
## [79] colorspace_2.1-1 gtools_3.9.5 tidyr_1.3.1
## [82] hexbin_1.28.5 data.table_1.16.4 robustbase_0.99-4-1
## [85] class_7.3-23 graphlayouts_1.2.1 httr_1.4.7
## [88] htmlwidgets_1.6.4 S4Arrays_1.7.1 scatterplot3d_0.3-44
## [91] uwot_0.2.2 pkgconfig_2.0.3 gtable_0.3.6
## [94] lmtest_0.9-40 XVector_0.47.2 htmltools_0.5.8.1
## [97] carData_3.0-5 dotCall64_1.2 SeuratObject_5.0.2
## [100] scales_1.3.0 knn.covertree_1.0 ggdendro_0.2.0
## [103] knitr_1.49 rjson_0.2.23 reshape2_1.4.4
## [106] curl_6.1.0 proxy_0.4-27 cachem_1.1.0
## [109] zoo_1.8-12 stringr_1.5.1 parallel_4.5.0
## [112] vipor_0.4.7 desc_1.4.3 pillar_1.10.1
## [115] grid_4.5.0 vctrs_0.6.5 pcaMethods_1.99.0
## [118] VGAM_1.1-12 car_3.1-3 BiocSingular_1.23.0
## [121] beachmat_2.23.6 cluster_2.1.8 beeswarm_0.4.0
## [124] evaluate_1.0.3 truncdist_1.0-2 cli_3.6.3
## [127] locfit_1.5-9.10 compiler_4.5.0 rlang_1.1.4
## [130] crayon_1.5.3 future.apply_1.11.3 labeling_0.4.3
## [133] plyr_1.8.9 fs_1.6.5 ggbeeswarm_0.7.2
## [136] stringi_1.8.4 viridisLite_0.4.2 BiocParallel_1.41.0
## [139] assertthat_0.2.1 gsl_2.1-8 munsell_0.5.1
## [142] quantreg_5.99.1 Matrix_1.7-1 RcppHNSW_0.6.0
## [145] RcppEigen_0.3.4.0.2 patchwork_1.3.0 future_1.34.0
## [148] statmod_1.5.0 evd_2.3-7.1 igraph_2.1.3
## [151] memoise_2.0.1 bslib_0.8.0 DEoptimR_1.1-3-1
## [154] ggplot.multistats_1.0.1