---
title: "Using liminal to understand high dimensional parameter space"
output: rmarkdown::html_vignette
link-citations: yes
bibliography: liminal.bib
vignette: >
  %\VignetteIndexEntry{geometry_parameter_space}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(ggplot2)
theme_set(theme_bw())
```


This example is modified from the examples tours described in @Cook2018-jm.
Here we use a tour to explore principal components space and 
any non-linear structure and clusters via t-SNE.

## Setting up the data 

Data were obtained from CT14HERA2 parton distribution function
fits as used in @Cook2018-jm. There are 28 directions in the parameter
space of parton distribution function fit, each point in the variables
labelled X1-X56 indicate moving +- 1 standard deviation from the 'best' 
(maximum likelihood estimate) fit of the function. Each observation has
all predictions of the corresponding measurement from an experiment. 
 (see table 3 in that paper for more explicit details). 
 
The remaining columns are:
 
 * InFit: A flag indicating whether an observation entered the fit of
   CT14HERA2 parton distribution function
 * Type: First number of ID
 * ID: contains the identifier of experiment, 1XX/2XX/5XX corresponds
 to Deep Inelastic Scattering (DIS) / Vector Boson Production (VBP) / 
  Strong Interaction (JET). Every ID points to an experimental paper.   
 * pt: the per experiment observational id
 * x,mu: the kinematics of a parton. x is the parton momentum fraction, and
 mu is the factorisation scale.

First, we take the load the data as a data.frame:

```{r pdfsense-prepare}
library(liminal)
data(pdfsense)
```

## Linear embeddings and the tour

First we can estimate all `nrow(pdfsense)` principal components
using on the parton distribution fits:
```{r pdfsense}
pcs  <- prcomp(pdfsense[, 7:ncol(pdfsense)])
```

Using this data structure, we can produce a screeplot:
```{r, echo = TRUE}
res <- data.frame(
  component = 1:56, 
  variance_explained = cumsum(pcs$sdev / sum(pcs$sdev))
)

ggplot(res, aes(x = component, y = variance_explained)) +
  geom_point() +
  scale_x_continuous(
    breaks = seq(0, 60, by = 5)
  ) +
  scale_y_continuous(
    labels = function(x) paste0(100*x, "%")
  )
```

Approximately 70% of the variance in the pdf fits are explained by the first 15 principal components.

Next we augment our original data with the principal components:

```{r}
pdfsense <- dplyr::bind_cols(
  pdfsense, 
  as.data.frame(pcs$x)
)
pdfsense$Type <- factor(pdfsense$Type)
```

We can view a simple tour via`limn_tour()` and color points
by their experimental group

```{r, eval = FALSE}
limn_tour(pdfsense, PC1:PC6, Type)
```


## Non-Linear embeddings

Now we can set up a non-linear embedding via t-SNE, here
we embed all 56 principal components.
```{r}
set.seed(3099)
start <- clamp_sd(as.matrix(dplyr::select(pdfsense, PC1, PC2)), sd = 1e-4)
tsne <- Rtsne::Rtsne(
  dplyr::select(pdfsense, PC1:PC56),
  pca = FALSE, 
  normalize = TRUE,
  perplexity = 50,
  exaggeration_factor = nrow(pdfsense) / 100,
  Y_init = start
)
```

Once we have run t-SNE we tidy it into a `data.frame`, to perform a linked
tour.

```{r tsne}
tsne_embedding <- as.data.frame(tsne$Y)
tsne_embedding <- dplyr::rename(tsne_embedding, tsneX = V1, tsneY = V2)
tsne_embedding$Type <- pdfsense$Type
```

We can view the clusters using a static scatter plot:
```{r}
ggplot(tsne_embedding, 
       aes(x = tsneX, y = tsneY, color = Type)) +
  geom_point() +
  scale_color_manual(values = limn_pal_tableau10())
```

We can link a tour view next to the embedding to give us
a clear picture of the clustering:

```{r, eval = FALSE}
limn_tour_link(
  tour_data = pdfsense,
  embed_data = tsne_embedding,
  cols = PC1:PC6,
  color = Type
)
```


# References {-}