Friday, June 15, 2012

Principal Component Analysis Possibly Not Picking Up Low Frequency Saami Connections

I'm not sure this 2010 observation is still scientifically relevant in mid-2012 (as the PCA method of SNP analysis may have been refined since then, particularly given the rapidity at which this field gains new knowledge), an interesting comment posted by ohwilleke regarding U5b1b:

” Whatever ancient affinities the Sami may have with Southern Europeans via mtDNA haplogroup U5, it is not evident in the total genome content.”

If connections illustrated by mtDNA haplogroups found in 90% of Saami aren’t showing up in PCA analysis of total genome content, then something is wrong with either the PCA analysis method, or the Southern European total genome sample.

My suspicion is that the Southern European total genome sample, which has far fewer individuals than the Southern European mtDNA datasets is failing to pick up traits that while predominant in the Saami that a very rare in other places where these haplogroups are found today, either due to founder effects in the Saami population (perhaps the entire pre-modern population had roots in people on a handful of coastal canoes who were genetically atypical), or due to subsequent dilution of the European and North African populations by overwhelming levels immigration from elsewhere (for example, upon the arrival of Neolithic farmers), or both.

“[The characteristic U5 mtDNA haplogroup subtype is] found generally at low frequencies (<2%) in Berber populations and in other African groups (such as the Fulbe) known to have intermingled with Berbers (Rosa et al. 2004). The motif also shows similarly low frequencies in virtually all European populations, except the Saami of northern Scandinavia, in which it reaches ~48% (Tambets et al 2004).”

In other words, I think that it is likely the the much smaller sample sizes used in whole genome studies simply miss the entirely the low frequency contributions to the mix, because the smaller sample size lacks people who have these low frequency genes.

PCA analysis, because it is focused on the average similarity of whole genome comparisons, rather than having the ability to focus on phylogenetically notable outlier components, likewise obscure the contribution of low frequency genetic elements, as does a tool like Admixture until you have a very large number of source populations in the mix.

The Saami affinity to Berbers and Southern Europeans is not really to these populations as a whole, but to an ancestral population that has left a less than 2% trace in the modern Berber and Southern European populations, but has left a 90% impact on modern Saami populations in the matriline (and hence, presumably, has had a total genome impact in the Saami of 45% give or take, or much more, if there was also a major Y-DNA contribution).

It would really make more sense to see how much of a Saami contribution there is in Berber and Southern European whole genomes, since the relative uniformity of Saami genetics and the Saami cultural context suggests that it may be a relatively pure ancestral type that is no longer found in other populations, than to try to look at the amount of contributions to the Saami total genome that come from other source populations. If the analysis proceeded that way, one would expect to see a small component (under 2%) of whatever color was assigned to the Saami, in a wide variety of European and Berber populations.

