--- title: "Explore non-linear embeddings with liminal" author: "Stuart Lee" output: rmarkdown::html_vignette link-citations: yes bibliography: liminal.bib vignette: > %\VignetteIndexEntry{liminal} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(ggplot2) theme_set(theme_bw()) library(liminal) ``` The **liminal** R package is designed to help you understand and interrogate non-linear dimension reduction results via the (grand) tour and linked graphics. It offers two main functions that are useful for exploring high-dimensional datasets, `limn_tour()` and `limn_tour_link()`. Both of these functions create interactive visualisations that are embedded in either the RStudio Viewer pane or on a web browser through **shiny** [@R-shiny]. # Touring high-dimensional space A tour is a dynamic visualisation that is displayed as a smooth animation of low-dimensional projections. A high dimensional dataset is projected onto a sequence of lower dimensional targets (also called bases) enabling the tour to explore the subspace of all lower dimensional projections. The way that targets are generated is called the tour path, in **liminal** we default to using the grand tour, which generates random Gaussian targets, although any tour path available in the **tourr** package can be used [@tourr2011; @R-tourr]. For more details, see @Lee2021-gn for a review of tour methods and @tourr2011 for their implementation in R. Let's take a look at an example, which has been adapted from the **phateR** package [@R-phateR; @phateR2019].. The **liminal** package comes with a built in `data.frame` which is a high-dimensional tree structured dataset called `fake_trees`. It consists of `r nrow(fake_trees)` observations over 100 numeric variables. The tree can really be embedded in 2 dimensions and has 10 branches. First, let's generate principal components: ```{r pc-view} library(liminal) data("fake_trees") pcs <- prcomp(fake_trees[, -ncol(fake_trees)]) # var explained head(cumsum(pcs$sdev / sum(pcs$sdev))) ``` And visualise the results as scatter plot by augmenting the original data: ```{r pca-xy} library(ggplot2) fake_trees <- dplyr::bind_cols(fake_trees, as.data.frame(pcs$x)) ggplot(fake_trees, aes(x = PC1, y = PC2, color = branches)) + geom_point() + scale_color_manual(values = limn_pal_tableau10()) ``` We see some separation of the branches in first two principal components, however, we can't see each of the branches clearly or their relation to each other. We can tour the components that represent most of the variation in the data to get a sense of the underlying structure. The following code generates a shiny application, that provides an interface to the tour: ```{r, eval = FALSE} # this loads a shiny app on the first fifteen PCs limn_tour(fake_trees, cols = PC1:PC15, color = branches) ``` The interface consists of the tour view which is a dynamic scatterplot and an axis view which corresponds to magnitude and direction of the generated targets. ```{r interface, echo = FALSE, fig.align='center'} knitr::include_graphics("figures/limn_tour_basic.png") ``` From the tour view, we can see that the blue branch is hidden (after highlighting it and letting the animation play) and forms the backbone of the tree. Brushing on the tour view is activated with the shift key plus a mouse drag. It will highlight points that fall inside the brush and pause the current view. There are several additional interactions available on this view: * There is a play button, that when pressed will start the tour. * There is also a text view of the half range which is the maximum squared Euclidean distance between points in the tour view. The half range is a scale factor for projections and can be thought of as a way of zooming in and out on points. It can be dynamically modified by scrolling (via a mouse-wheel). To reset double click the tour view. * The legend can be toggled to highlight groups of points with shift+mouse-click. Multiple groups can be selected in this way. To reset double click the legend title. # Adding non-linear embeddings We can also compare this to t-SNE embedding run with default settings: ```{r tsne-xy} set.seed(2099) tsne <- Rtsne::Rtsne(dplyr::select(fake_trees, dplyr::starts_with("dim"))) tsne_df <- data.frame(tsneX = tsne$Y[,1], tsneY = tsne$Y[,2]) ggplot(tsne_df, aes(x = tsneX, y = tsneY, color = fake_trees$branches)) + geom_point() + scale_color_manual(values = limn_pal_tableau10()) ``` The topology is a little messed up as the blue branch is now broken into two distinct pieces. We can see where our embedding is different via a linked tour: ```{r, eval = FALSE} limn_tour_link(embed_data = tsne_df, tour_data = fake_trees, cols = PC1:PC10, # tour columns to select color = branches # variable to highlight across both view, can come for either data frames ) ``` This function requires two tables that will be linked together in separate views. The tour interface is the same as above, except now brushing on the tour view will highlight points on the right hand side scatter plot. The right hand side scatterplot view is an interactive scatterplot. Brushing on the right view is activated via click and drag movements. ```{r limn-interface, echo = FALSE, fig.align='center'} knitr::include_graphics("figures/limn_tour_linked.png") ``` We can see from brushing on the right, where t-SNE has broken up the global structure in the data and distorted the distance between points. For either interface, you can assign the results to an R object, that will return a list consisting of the selected basis (target) and the brushing bounding boxes. ```{r, eval = FALSE} res <- limn_tour_link(embed_data = tsne_df, tour_data = fake_trees, cols = PC1:PC10, # tour columns to select color = branches # variable to highlight across both view, can come for either data frames ) ``` # References {-}