library(knitr)
library(scatterplot3d)
library(Rtsne)
library(coRanking)
npoints <- 1000
theta <- runif(npoints, 0, 2 * pi)
u <- runif(npoints, -1, 0.8)
data <- list()
data$x <- sqrt(1 - u ^ 2) * cos(theta)
data$y <- sqrt(1 - u ^ 2) * sin(theta)
data$z <- u
data$col <-
rgb(colorRamp(colors = c("red", "yellow", "green"))( (data$z + 1) / 2),
maxColorValue = 255)
data <- as.data.frame(data, stringsAsFactors = F)
The co-ranking matrix is a tool to assess the quality of a dimensionality reduciton.
The fishbowl data set consists of a sphere with a hole on top:
Dimensionality reductions:
dim.red <- list()
## dim.red$isomap <- isomap(dist(data[c("x","y","z")]), k = 20)
## dim.red$kpca <- kpca(~x + y + z, data)
dim.red$tsne <- Rtsne(data[c("x", "y", "z")])
dim.red$pca <- princomp(data[c("x", "y", "z")])
## plot(dim.red$isomap$points, col = data$col)
## plot(rotated(dim.red$kpca), col = data$col)
plot(dim.red$tsne$Y, col = data$col,
xlab = "tsne I", ylab = "tsne II",
main = "t-SNE")
plot(dim.red$pca$scores, col = data$col,
xlab = "PCA I", ylab = "PCA II",
main = "PCA")
the corresponding co-ranking matrices are:
Q.tsne <- coranking(data[c("x", "y", "z")], dim.red$tsne$Y)
Q.pca <- coranking(data[c("x", "y", "z")], dim.red$pca$scores[, 1:2])
imageplot(Q.tsne, main = "t-SNE")
imageplot(Q.pca, main = "PCA")
As one can observe, the t-SNE tears the data, points that were close are far apart, e.g. the upper border of the fish bowl. This is reflected in the co-ranking matrix in higher values in the upper right part. The PCA crushes the points, the data set gets flattened and points that were far away are very close now, which manifests in more values in the lower left part of the co-ranking matrix.
t-SNE also tends to keep the smaller distances between points, i.e. values close to the diagonal are higher in the upper left part, while PCA keeps the larger distances, i.e. the values close to the diagonal on the lower right are higher.
Defined by Lueks et al. (2011). A general quality measure depending on K could be defined as:
$$Q_{NX}(K) = \frac{1}{KN} \sum_{k=1}^{K}\sum_{l=1}^{K}Q_{kl}$$
The values of QNX(K) are split into local and global values by
$$K_\max = \arg\max_K\text{LCMC}(K) = \arg\max_K\left(Q_{NX} - \frac{K}{N-1}\right)$$
qnx.tsne <- coRanking:::Q_NX(Q.tsne)
qnx.pca <- coRanking:::Q_NX(Q.pca)
lcmc.tsne <- LCMC(Q.tsne)
lcmc.pca <- LCMC(Q.pca)
Kmax.tsne <- which.max(lcmc.tsne)
Kmax.pca <- which.max(lcmc.pca)
yrange <- range(c(qnx.tsne, qnx.pca))
plot(qnx.tsne, xlab = "K", ylab = expression(Q[NX]), type = "l", ylim = yrange, col = 1)
abline(v = Kmax.tsne, col = 1, lty = 2)
text(Kmax.tsne, mean(yrange) + 0.1, expression(K[max]), col = 1, pos = 4)
lines(qnx.pca, main = "PCA", xlab = "K", ylab = expression(Q[NX]), ylim = 0:1, col = 2)
abline(v = Kmax.pca, col = 2, lty = 2)
text(Kmax.pca, mean(yrange) - 0.1, expression(K[max]), col = 2, pos = 4)
legend("bottomright", legend = c("t-SNE", "PCA"), lty = 1, col = 1:2)
Qlocal is the mean over the values left of the maximum, Qglobal is the mean over the values right of the maximum.
Qlocal is considered to be more important than Qglobal.