although Steve Hsu is a physicist and doesn't work in the field evolutionary genetics, I find this explanation of genetic clusters very clear and concise
Information Processing: Metric on the space of genomes and the scientific basis for race
Information Processing: Metric on the space of genomes and the scientific basis for race
Suppose that the human genome has 30,000 distinct genes, which we will label as i = 1,2, ... N, where N = 30k. Next, suppose that there are n_i variants or alleles (mutations) of the i-th gene. Then, each human's genetic information can be described as a point on a lattice of size n_1 x n_2 x n_3 ... n_N, or equivalently an N-tuple of integers, each of whose values range from 1 to n_i. For the simplified case where there are exactly 10 variants of each gene, the number of points in this N dimensional space is 10^N or 10^{30k}, one for each distinct 30k digit number. It's a space of very high dimension, but this doesn't stop us from defining a metric, or measure of distance between any two points in the space. (For simplicity we ignore restrictions on this space which might result from incompatibility of certain combinations, etc.)
Note that the genomes of all of the humans who have ever lived occupy only a small subset of this space -- most possible variations have never been realized. For this reason, the surprise expressed by biologists that humans have so few genes (not many more than a worm, and far less than the 100k of earlier estimates) is no cause for concern -- the number of possible organisms that might result from 30k genes is enormous -- far more than the number of molecules in the visible universe.
This clustering is a natural consequence of geographical isolation, inheritance and natural selection operating over the last 50k years since humans left Africa.
Every allele probably occurs in each ethnic group, but with varying frequency. Suppose that for a particular gene there are 3 common variants (v1, v2, v3) all the rest being very rare. Then, for example, one might find that in ethnic group A the distribution is v1 75%, v2 15%, v3 10%, while for ethnic group B the distribution is v1 2% v2 6% v3 92%. Suppose this pattern is repeated for several genes, with the common variants in population A being rare in population B, and vice versa. Then, one might find a very dramatic difference in expressed phenotype between the two populations. For example, if skin color is determined by (say) 10 genes, and those genes have the distribution pattern given above, nearly all of population A might be fair skinned while all of population B is dark, even though there is complete overlap in the set of common alleles. Perhaps having the third type of variant v3 in 7 out of 10 pigmentation genes makes you dark. This is highly likely for an individual in population B with the given probabilities, but highly unlikely in population A.
We see that there can be dramatic group differences in phenotypes even if there is complete allele overlap between two groups - as long as the frequency or probability distributions are distinct. But it is these distributions that are measured by the metric we defined earlier. Two groups that form distinct clusters are likely to exhibit different frequency distributions over various genes, leading to group differences.