Measures of similarity in the 20 newsgroups dataset

The 20 newsgroups dataset is a data set of posts on 20 topics, ranging from cryptology to guns to baseball. I looked at 3 measures of similarity: Jaccard, cosine, and L2. Comparing each article with every other article, and taking the average similarity for that newsgroup, we get the following heat maps:

Cosine similarity heat map
Cosine similarity heat map
Jaccard similarity heat map
Jaccard similarity heat map
L2 similarity heat map
L2 similarity heat map

Cosine similarity seems the most reasonable, because it considers the relative frequency of words instead of the actual frequency. Take the case where there are two articles, A and B, and article A is the same as article B, except each word in A appears twice as many times in B. The similarity measure ought to indicate the articles are highly similar. The Jaccard similarity would be 0.5, cosine similarity would be 1, and L2 similarity would be some non-zero number. With Jaccard and L2 similarity, the number of words in each article has some influence on the similarity measure, so when one article has a lot more words than another, they will appear more dissimilar.

Let’s look at the cosine similarity plot, but with values < 0.45 removed:

Cosine similarity > 0.45 heat map
Cosine similarity > 0.45 heat map

Pairs of similar newsgroups include soc.religion.christian + soc.religion.christian,  talk.politics.guns + talk.politics.guns, soc.religion.christian + talk.politics.guns. Perhaps these two newsgroups have similar demographics. Other similar pairs include soc.religion.christian + alt.atheism and soc.religion.christian + talk.religion.misc. This seems plausible, that there is some overlap discussing religion or lack of it.


Next, we look at nearest-neighbor counts. For each article in a newsgroup, there is an article in another newsgroup that has largest similarity.

Jaccard similarity nearest-neighbor heat map
Jaccard similarity nearest-neighbor heat map

The average similarity plots are symmetric, because in the formulas for different similarity measures, for any article x and y, (x,y) and (y, x) return the same value, there’s nothing dependent on the order of the bag-of-words vectors.

The nearest-neighbor plot is asymmetric. If an article A has the largest Jaccard similarity to an article B, that does not mean that B has the largest Jaccard similarity to A. For example, say there are three articles X, Y, and Z. X and Y are similar, but Z is very different from both. If Z is most similar to, say, X, that does not mean X is most similar to Z, in this case X is most similar to Y. So, just because an article in a newsgroup M has the largest similarity to an article in a newsgroup N, does not mean that an article in newsgroup N will have the largest similarity to an article in newsgroup M.

Looking at the Jaccard nearest-neighbor heat map, these groups are similar: talk.religion.misc + alt.atheism, soc.religion.christian + alt.atheism, rec.sport.hockey + rec.sport.baseball, comp.sys.ibm.pc.hardware + comp.os.ms-windows.misc,  comp.sys.mac.hardware + comp.sys.ibm.pc.hardware.

Comparing the Jaccard plots, there is some overlap in similar newsgroups, such as soc.religion.christian + alt.atheism. In the nearest-neighbor plot, there are some newsgroups that appear similar that do not seem similar in the average similarity plot, such as comp.sys.mac.hardware + comp.sys.ibm.pc.hardware and rec.sport.hockey + rec.sport.baseball. Average similarity plots appear to have a more even distribution of similarity measures, whereas the counts in the nearest-neighbor plot are mostly low with some high counts.

Using average similarity is more suited to comparing newsgroups. With nearest-neighbors, each article has some discrete influence on similarity, so disparate newsgroups could wrongfully appear similar. It could be the case that the articles in a newsgroup are extremely dissimilar to articles in other newsgroups, such as the articles in misc.forsale. Looking at the Jaccard and cosine average similarity plots, it appears misc.forsale is dissimilar to the other newsgroups. In the nearest-neighbor plot, a noticeable number of articles in misc.forsale are nearest-neighbors to comp.sys.ibm.pc.hardware, probably because there are a lot of PCs for sale, but not the other way around. Likewise, the articles in rec.sport.hockey and rec.sport.baseball might not be similar to each other, but they are more similar to each other than to other newsgroups.


Next, we look at how reducing the number of dimensions affects the quality of results for measures of similarity. Here’s the cosine similarity nearest-neighbor heat map:

Cosine similarity nearest-neighbor heat map
Cosine similarity nearest-neighbor heat map

Now we reduce the dimensions by randomly drawing the features with a standard normal distribution.

Cosine similarity nearest-neighbor heat map, d=10
Cosine similarity nearest-neighbor heat map, d=10
Cosine similarity nearest-neighbor heat map, d=25
Cosine similarity nearest-neighbor heat map, d=25
Cosine similarity nearest-neighbor heat map, d=50
Cosine similarity nearest-neighbor heat map, d=50
Cosine similarity nearest-neighbor heat map, d=100
Cosine similarity nearest-neighbor heat map, d=100

 

Wall-clock times (seconds)

With no dimension reduction, calculating cosine similarities took 202.858168125 sec, finding nearest neighbors took 0.902053117752 sec.

d dimension reduction calculating cosine similarities finding nearest neighbors
10 1.89096689224 34.9237360954 0.762381076813
25 4.87242698669 81.7924189568 0.696530103683
50 8.96683502197 158.616721869 0.77707695961
100 18.9475579262 319.640784025 0.732910871506

For dimension reduction and calculating cosine similarities, wall-clock time increased linearly with d.

Target dimension d=100 gave comparable results to the original embedding.


Now let’s look at a single article, and see how cosine similarities compare after dimension reduction.

Cosine similarities for a single article, d=10
Cosine similarities for a single article, d=10
Cosine similarities for a single article, d=25
Cosine similarities for a single article, d=25
Cosine similarities for a single article, d=50
Cosine similarities for a single article, d=50
Cosine similarities for a single article, d=100
Cosine similarities for a single article, d=100

The error is the vertical distance from a point on the scatterplot to y=x. As d increases, the sum of the errors and the standard deviation of the errors gets smaller, because more of the information about the original words in full dimensions has been retained.

Looking at the target dimension vs. sum of errors:

d          sum of errors

10        300.6589293

25        113.3587640

50        74.9733475

100      67.0568351

It appears that the sum of errors asymptotically decreases as d increases.


Now we try dimension reduction with a random sign (±1) instead of a normal distribution.

Cosine similarity nearest-neighbor heat map, random sign, d=10
Cosine similarity nearest-neighbor heat map, random sign, d=10
Cosine similarity nearest-neighbor heat map, random sign, d=25
Cosine similarity nearest-neighbor heat map, random sign, d=25
Cosine similarity nearest-neighbor heat map, random sign, d=50
Cosine similarity nearest-neighbor heat map, random sign, d=50
Cosine similarity nearest-neighbor heat map, random sign, d=100
Cosine similarity nearest-neighbor heat map, random sign, d=100
Cosine similarities for a single article, random sign, d=10
Cosine similarities for a single article, random sign, d=10
Cosine similarities for a single article, random sign, d=25
Cosine similarities for a single article, random sign, d=25
Cosine similarities for a single article, random sign, d=50
Cosine similarities for a single article, random sign, d=50
Cosine similarities for a single article, random sign, d=100
Cosine similarities for a single article, random sign, d=100
d sum of errors, random normal distribution sum of errors (d), random sign
10 300.6589293 191.6908585
25 113.358764 111.0387588
50 74.9733475 84.4925114
100 67.0568351 66.6889494

The results of dimension reduction by random sign and random normal distribution were similar. For both dimensionally-reduced matrices, the plot for d=100 was comparable to the one with full dimensions.

Leave a Reply

Your email address will not be published. Required fields are marked *