The 20 newsgroups dataset is a data set of posts on 20 topics, ranging from cryptology to guns to baseball. I looked at 3 measures of similarity: Jaccard, cosine, and L2. Comparing each article with every other article, and taking the average similarity for that newsgroup, we get the following heat maps:
Cosine similarity seems the most reasonable, because it considers the relative frequency of words instead of the actual frequency. Take the case where there are two articles, A and B, and article A is the same as article B, except each word in A appears twice as many times in B. The similarity measure ought to indicate the articles are highly similar. The Jaccard similarity would be 0.5, cosine similarity would be 1, and L2 similarity would be some non-zero number. With Jaccard and L2 similarity, the number of words in each article has some influence on the similarity measure, so when one article has a lot more words than another, they will appear more dissimilar.
Let’s look at the cosine similarity plot, but with values < 0.45 removed:
Pairs of similar newsgroups include soc.religion.christian + soc.religion.christian, talk.politics.guns + talk.politics.guns, soc.religion.christian + talk.politics.guns. Perhaps these two newsgroups have similar demographics. Other similar pairs include soc.religion.christian + alt.atheism and soc.religion.christian + talk.religion.misc. This seems plausible, that there is some overlap discussing religion or lack of it.
Next, we look at nearest-neighbor counts. For each article in a newsgroup, there is an article in another newsgroup that has largest similarity.
The average similarity plots are symmetric, because in the formulas for different similarity measures, for any article x and y, (x,y) and (y, x) return the same value, there’s nothing dependent on the order of the bag-of-words vectors.
The nearest-neighbor plot is asymmetric. If an article A has the largest Jaccard similarity to an article B, that does not mean that B has the largest Jaccard similarity to A. For example, say there are three articles X, Y, and Z. X and Y are similar, but Z is very different from both. If Z is most similar to, say, X, that does not mean X is most similar to Z, in this case X is most similar to Y. So, just because an article in a newsgroup M has the largest similarity to an article in a newsgroup N, does not mean that an article in newsgroup N will have the largest similarity to an article in newsgroup M.
Looking at the Jaccard nearest-neighbor heat map, these groups are similar: talk.religion.misc + alt.atheism, soc.religion.christian + alt.atheism, rec.sport.hockey + rec.sport.baseball, comp.sys.ibm.pc.hardware + comp.os.ms-windows.misc, comp.sys.mac.hardware + comp.sys.ibm.pc.hardware.
Comparing the Jaccard plots, there is some overlap in similar newsgroups, such as soc.religion.christian + alt.atheism. In the nearest-neighbor plot, there are some newsgroups that appear similar that do not seem similar in the average similarity plot, such as comp.sys.mac.hardware + comp.sys.ibm.pc.hardware and rec.sport.hockey + rec.sport.baseball. Average similarity plots appear to have a more even distribution of similarity measures, whereas the counts in the nearest-neighbor plot are mostly low with some high counts.
Using average similarity is more suited to comparing newsgroups. With nearest-neighbors, each article has some discrete influence on similarity, so disparate newsgroups could wrongfully appear similar. It could be the case that the articles in a newsgroup are extremely dissimilar to articles in other newsgroups, such as the articles in misc.forsale. Looking at the Jaccard and cosine average similarity plots, it appears misc.forsale is dissimilar to the other newsgroups. In the nearest-neighbor plot, a noticeable number of articles in misc.forsale are nearest-neighbors to comp.sys.ibm.pc.hardware, probably because there are a lot of PCs for sale, but not the other way around. Likewise, the articles in rec.sport.hockey and rec.sport.baseball might not be similar to each other, but they are more similar to each other than to other newsgroups.
Next, we look at how reducing the number of dimensions affects the quality of results for measures of similarity. Here’s the cosine similarity nearest-neighbor heat map:
Now we reduce the dimensions by randomly drawing the features with a standard normal distribution.
Wall-clock times (seconds)
With no dimension reduction, calculating cosine similarities took 202.858168125 sec, finding nearest neighbors took 0.902053117752 sec.
|d||dimension reduction||calculating cosine similarities||finding nearest neighbors|
For dimension reduction and calculating cosine similarities, wall-clock time increased linearly with d.
Target dimension d=100 gave comparable results to the original embedding.
Now let’s look at a single article, and see how cosine similarities compare after dimension reduction.
The error is the vertical distance from a point on the scatterplot to y=x. As d increases, the sum of the errors and the standard deviation of the errors gets smaller, because more of the information about the original words in full dimensions has been retained.
Looking at the target dimension vs. sum of errors:
d sum of errors
It appears that the sum of errors asymptotically decreases as d increases.
Now we try dimension reduction with a random sign (±1) instead of a normal distribution.
|d||sum of errors, random normal distribution||sum of errors (d), random sign|
The results of dimension reduction by random sign and random normal distribution were similar. For both dimensionally-reduced matrices, the plot for d=100 was comparable to the one with full dimensions.