The liminal R
package is designed to help you understand and interrogate non-linear
dimension reduction results via the (grand) tour and linked graphics. It
offers two main functions that are useful for exploring high-dimensional
datasets, limn_tour()
and limn_tour_link()
.
Both of these functions create interactive visualisations that are
embedded in either the RStudio Viewer pane or on a web browser through
shiny (Chang et al. 2021).
A tour is a dynamic visualisation that is displayed as a smooth animation of low-dimensional projections. A high dimensional dataset is projected onto a sequence of lower dimensional targets (also called bases) enabling the tour to explore the subspace of all lower dimensional projections. The way that targets are generated is called the tour path, in liminal we default to using the grand tour, which generates random Gaussian targets, although any tour path available in the tourr package can be used (Wickham et al. 2011; Wickham and Cook 2021). For more details, see Lee et al. (2021) for a review of tour methods and Wickham et al. (2011) for their implementation in R.
Let’s take a look at an example, which has been adapted from the phateR package (Srinivasan 2021; Moon et al. 2019)..
The liminal package comes with a built in
data.frame
which is a high-dimensional tree structured
dataset called fake_trees
. It consists of 3000 observations
over 100 numeric variables. The tree can really be embedded in 2
dimensions and has 10 branches.
First, let’s generate principal components:
library(liminal)
data("fake_trees")
pcs <- prcomp(fake_trees[, -ncol(fake_trees)])
# var explained
head(cumsum(pcs$sdev / sum(pcs$sdev)))
#> [1] 0.05569726 0.09768259 0.13586130 0.17200716 0.20405094 0.23398570
And visualise the results as scatter plot by augmenting the original data:
library(ggplot2)
fake_trees <- dplyr::bind_cols(fake_trees, as.data.frame(pcs$x))
ggplot(fake_trees, aes(x = PC1, y = PC2, color = branches)) +
geom_point() +
scale_color_manual(values = limn_pal_tableau10())
We see some separation of the branches in first two principal components, however, we can’t see each of the branches clearly or their relation to each other.
We can tour the components that represent most of the variation in the data to get a sense of the underlying structure.
The following code generates a shiny application, that provides an interface to the tour:
# this loads a shiny app on the first fifteen PCs
limn_tour(fake_trees, cols = PC1:PC15, color = branches)
The interface consists of the tour view which is a dynamic scatterplot and an axis view which corresponds to magnitude and direction of the generated targets.
From the tour view, we can see that the blue branch is hidden (after highlighting it and letting the animation play) and forms the backbone of the tree.
Brushing on the tour view is activated with the shift key plus a mouse drag. It will highlight points that fall inside the brush and pause the current view.
There are several additional interactions available on this view: * There is a play button, that when pressed will start the tour. * There is also a text view of the half range which is the maximum squared Euclidean distance between points in the tour view. The half range is a scale factor for projections and can be thought of as a way of zooming in and out on points. It can be dynamically modified by scrolling (via a mouse-wheel). To reset double click the tour view. * The legend can be toggled to highlight groups of points with shift+mouse-click. Multiple groups can be selected in this way. To reset double click the legend title.
We can also compare this to t-SNE embedding run with default settings:
set.seed(2099)
tsne <- Rtsne::Rtsne(dplyr::select(fake_trees, dplyr::starts_with("dim")))
tsne_df <- data.frame(tsneX = tsne$Y[,1],
tsneY = tsne$Y[,2])
ggplot(tsne_df, aes(x = tsneX, y = tsneY, color = fake_trees$branches)) +
geom_point() +
scale_color_manual(values = limn_pal_tableau10())
The topology is a little messed up as the blue branch is now broken into two distinct pieces.
We can see where our embedding is different via a linked tour:
limn_tour_link(embed_data = tsne_df,
tour_data = fake_trees,
cols = PC1:PC10, # tour columns to select
color = branches # variable to highlight across both view, can come for either data frames
)
This function requires two tables that will be linked together in
separate views. The tour interface is the same as above, except now
brushing on the tour view will highlight points on the right hand side
scatter plot. The right hand side scatterplot view is an interactive
scatterplot.
Brushing on the right view is activated via click and drag
movements.
We can see from brushing on the right, where t-SNE has broken up the global structure in the data and distorted the distance between points. For either interface, you can assign the results to an R object, that will return a list consisting of the selected basis (target) and the brushing bounding boxes.