PCA on data from the 1000 genomes project

Taking a dataset of individuals from the 1000 genomes project, with a subsample of ~10,000 nucleobases for each individual, the nucleobases were given a binary encoding based on the mode for that nucleobase position.


The individuals were from 7 African populations:

YRI: Yoruba in Ibadan, Nigeria
LWK: Luhya in Webuye, Kenya
GWD: Gambian in Western Divisions in the Gambia
MSL: Mende in Sierra Leone
ESN: Esan in Nigeria
ASW: Americans of African Ancestry in SW USA
ACB: African Caribbeans in Barbados

A map of the populations in the data
A map of the populations in the data

Plotting the first and second principal components, we see the components capture geographic information.

v1 vs. v2 components, grouped by population
v1 vs. v2 components, grouped by population

On the v1 axis, the populations appear genetically similar except for LWK, ACB, and ASW. The LWK population in east Africa is relatively dissimilar to the populations on the west coast of Africa. Populations ACB and ASW are even more dissimilar and have a wide spread. Perhaps there is greater genetic diversity for the ACB and ASW populations because they are more likely to have mixed ancestry. So the first principal component captures genetic similarity to west African coast populations.

On the v2 axis, we see GWD in a cluster and MSL in a cluster, and ESN, YRI, and LWK in a cluster. ACB and ASW span both the MSL and the ESN/YRI/LWK clusters. So the second principal component captures the split between the two populations on the western part of the coast (GWD + MSL) from the other central and eastern populations (ESN/YRI/LWK), while suggesting individuals in the ACB and ASW populations could have ancestry from either region.


Plotting the first and third principal components, we see the third component captures gender.

v1 vs. v3 components, grouped by gender
v1 vs. v3 components, grouped by gender

Plotting the first and fourth principal components, we see the fourth component captures whether the individual belongs to the LWK population.

v1 vs. v4 components, grouped by population
v1 vs. v4 components, grouped by population

Python libraries used: numpy, scikit-learn, pandas, matplotlib

Leave a Reply

Your email address will not be published. Required fields are marked *