Measures of similarity in the 20 newsgroups dataset

The 20 newsgroups dataset is a data set of posts on 20 topics, ranging from cryptology to guns to baseball. I looked at 3 measures of similarity: Jaccard, cosine, and L2. Comparing each article with every other article, and taking the average similarity for that newsgroup, we get the following heat maps:

Cosine similarity heat map
Cosine similarity heat map
Jaccard similarity heat map
Jaccard similarity heat map
L2 similarity heat map
L2 similarity heat map

Cosine similarity seems the most reasonable, because it considers the relative frequency of words instead of the actual frequency. Take the case where there are two articles, A and B, and article A is the same as article B, except each word in A appears twice as many times in B. The similarity measure ought to indicate the articles are highly similar. The Jaccard similarity would be 0.5, cosine similarity would be 1, and L2 similarity would be some non-zero number. With Jaccard and L2 similarity, the number of words in each article has some influence on the similarity measure, so when one article has a lot more words than another, they will appear more dissimilar.

Let’s look at the cosine similarity plot, but with values < 0.45 removed:

Cosine similarity > 0.45 heat map
Cosine similarity > 0.45 heat map

Pairs of similar newsgroups include soc.religion.christian + soc.religion.christian,  talk.politics.guns + talk.politics.guns, soc.religion.christian + talk.politics.guns. Perhaps these two newsgroups have similar demographics. Other similar pairs include soc.religion.christian + alt.atheism and soc.religion.christian + talk.religion.misc. This seems plausible, that there is some overlap discussing religion or lack of it.


Next, we look at nearest-neighbor counts. For each article in a newsgroup, there is an article in another newsgroup that has largest similarity.

Jaccard similarity nearest-neighbor heat map
Jaccard similarity nearest-neighbor heat map

The average similarity plots are symmetric, because in the formulas for different similarity measures, for any article x and y, (x,y) and (y, x) return the same value, there’s nothing dependent on the order of the bag-of-words vectors.

The nearest-neighbor plot is asymmetric. If an article A has the largest Jaccard similarity to an article B, that does not mean that B has the largest Jaccard similarity to A. For example, say there are three articles X, Y, and Z. X and Y are similar, but Z is very different from both. If Z is most similar to, say, X, that does not mean X is most similar to Z, in this case X is most similar to Y. So, just because an article in a newsgroup M has the largest similarity to an article in a newsgroup N, does not mean that an article in newsgroup N will have the largest similarity to an article in newsgroup M.

Looking at the Jaccard nearest-neighbor heat map, these groups are similar: talk.religion.misc + alt.atheism, soc.religion.christian + alt.atheism, rec.sport.hockey + rec.sport.baseball, comp.sys.ibm.pc.hardware + comp.os.ms-windows.misc,  comp.sys.mac.hardware + comp.sys.ibm.pc.hardware.

Comparing the Jaccard plots, there is some overlap in similar newsgroups, such as soc.religion.christian + alt.atheism. In the nearest-neighbor plot, there are some newsgroups that appear similar that do not seem similar in the average similarity plot, such as comp.sys.mac.hardware + comp.sys.ibm.pc.hardware and rec.sport.hockey + rec.sport.baseball. Average similarity plots appear to have a more even distribution of similarity measures, whereas the counts in the nearest-neighbor plot are mostly low with some high counts.

Using average similarity is more suited to comparing newsgroups. With nearest-neighbors, each article has some discrete influence on similarity, so disparate newsgroups could wrongfully appear similar. It could be the case that the articles in a newsgroup are extremely dissimilar to articles in other newsgroups, such as the articles in misc.forsale. Looking at the Jaccard and cosine average similarity plots, it appears misc.forsale is dissimilar to the other newsgroups. In the nearest-neighbor plot, a noticeable number of articles in misc.forsale are nearest-neighbors to comp.sys.ibm.pc.hardware, probably because there are a lot of PCs for sale, but not the other way around. Likewise, the articles in rec.sport.hockey and rec.sport.baseball might not be similar to each other, but they are more similar to each other than to other newsgroups.


Next, we look at how reducing the number of dimensions affects the quality of results for measures of similarity. Here’s the cosine similarity nearest-neighbor heat map:

Cosine similarity nearest-neighbor heat map
Cosine similarity nearest-neighbor heat map

Now we reduce the dimensions by randomly drawing the features with a standard normal distribution.

Cosine similarity nearest-neighbor heat map, d=10
Cosine similarity nearest-neighbor heat map, d=10
Cosine similarity nearest-neighbor heat map, d=25
Cosine similarity nearest-neighbor heat map, d=25
Cosine similarity nearest-neighbor heat map, d=50
Cosine similarity nearest-neighbor heat map, d=50
Cosine similarity nearest-neighbor heat map, d=100
Cosine similarity nearest-neighbor heat map, d=100

 

Wall-clock times (seconds)

With no dimension reduction, calculating cosine similarities took 202.858168125 sec, finding nearest neighbors took 0.902053117752 sec.

d dimension reduction calculating cosine similarities finding nearest neighbors
10 1.89096689224 34.9237360954 0.762381076813
25 4.87242698669 81.7924189568 0.696530103683
50 8.96683502197 158.616721869 0.77707695961
100 18.9475579262 319.640784025 0.732910871506

For dimension reduction and calculating cosine similarities, wall-clock time increased linearly with d.

Target dimension d=100 gave comparable results to the original embedding.


Now let’s look at a single article, and see how cosine similarities compare after dimension reduction.

Cosine similarities for a single article, d=10
Cosine similarities for a single article, d=10
Cosine similarities for a single article, d=25
Cosine similarities for a single article, d=25
Cosine similarities for a single article, d=50
Cosine similarities for a single article, d=50
Cosine similarities for a single article, d=100
Cosine similarities for a single article, d=100

The error is the vertical distance from a point on the scatterplot to y=x. As d increases, the sum of the errors and the standard deviation of the errors gets smaller, because more of the information about the original words in full dimensions has been retained.

Looking at the target dimension vs. sum of errors:

d          sum of errors

10        300.6589293

25        113.3587640

50        74.9733475

100      67.0568351

It appears that the sum of errors asymptotically decreases as d increases.


Now we try dimension reduction with a random sign (±1) instead of a normal distribution.

Cosine similarity nearest-neighbor heat map, random sign, d=10
Cosine similarity nearest-neighbor heat map, random sign, d=10
Cosine similarity nearest-neighbor heat map, random sign, d=25
Cosine similarity nearest-neighbor heat map, random sign, d=25
Cosine similarity nearest-neighbor heat map, random sign, d=50
Cosine similarity nearest-neighbor heat map, random sign, d=50
Cosine similarity nearest-neighbor heat map, random sign, d=100
Cosine similarity nearest-neighbor heat map, random sign, d=100
Cosine similarities for a single article, random sign, d=10
Cosine similarities for a single article, random sign, d=10
Cosine similarities for a single article, random sign, d=25
Cosine similarities for a single article, random sign, d=25
Cosine similarities for a single article, random sign, d=50
Cosine similarities for a single article, random sign, d=50
Cosine similarities for a single article, random sign, d=100
Cosine similarities for a single article, random sign, d=100
d sum of errors, random normal distribution sum of errors (d), random sign
10 300.6589293 191.6908585
25 113.358764 111.0387588
50 74.9733475 84.4925114
100 67.0568351 66.6889494

The results of dimension reduction by random sign and random normal distribution were similar. For both dimensionally-reduced matrices, the plot for d=100 was comparable to the one with full dimensions.

PCA on data from the 1000 genomes project

Taking a dataset of individuals from the 1000 genomes project, with a subsample of ~10,000 nucleobases for each individual, the nucleobases were given a binary encoding based on the mode for that nucleobase position.


The individuals were from 7 African populations:

YRI: Yoruba in Ibadan, Nigeria
LWK: Luhya in Webuye, Kenya
GWD: Gambian in Western Divisions in the Gambia
MSL: Mende in Sierra Leone
ESN: Esan in Nigeria
ASW: Americans of African Ancestry in SW USA
ACB: African Caribbeans in Barbados

A map of the populations in the data
A map of the populations in the data

Plotting the first and second principal components, we see the components capture geographic information.

v1 vs. v2 components, grouped by population
v1 vs. v2 components, grouped by population

On the v1 axis, the populations appear genetically similar except for LWK, ACB, and ASW. The LWK population in east Africa is relatively dissimilar to the populations on the west coast of Africa. Populations ACB and ASW are even more dissimilar and have a wide spread. Perhaps there is greater genetic diversity for the ACB and ASW populations because they are more likely to have mixed ancestry. So the first principal component captures genetic similarity to west African coast populations.

On the v2 axis, we see GWD in a cluster and MSL in a cluster, and ESN, YRI, and LWK in a cluster. ACB and ASW span both the MSL and the ESN/YRI/LWK clusters. So the second principal component captures the split between the two populations on the western part of the coast (GWD + MSL) from the other central and eastern populations (ESN/YRI/LWK), while suggesting individuals in the ACB and ASW populations could have ancestry from either region.


Plotting the first and third principal components, we see the third component captures gender.

v1 vs. v3 components, grouped by gender
v1 vs. v3 components, grouped by gender

Plotting the first and fourth principal components, we see the fourth component captures whether the individual belongs to the LWK population.

v1 vs. v4 components, grouped by population
v1 vs. v4 components, grouped by population

Python libraries used: numpy, scikit-learn, pandas, matplotlib