Haplotype, LD, and Association Tests

Table of content

This is the first post of my QE prep series. I’ll be taking notes on my study for general knowledge section of my QE. I’m gonna start this off with my favorite topic in our GGG series classes: population genetics

What Are Haplotype and Linkage Disequilibrium (LD)

To properly understand linkage disequilibrium, I want to go back to the “Dark Age” of genetics, when Gregor Mendel brilliantly discovered (Mendelian) laws of inheritance without any knowledge of the molecular basis.

  • The First Mendelian Law (Law of Segregation) states that each organism has two alleles for each trait and one of the two alleles is randomly passed on to an offspring during sexual reproduction.

    • For example, an individual with an AB genotype is eqaully likely to pass an A allele or a B allele to its offspring (there are many assumption being made in this statement but I’ll roll with it for now).
  • The Second Mendelian Law (Law of Independent Assortment) states that alleles for different traits are passed on independantly from each other during sexual reproduction.

    • For example, if allelea A/a controls for trait 1 and alleles B/b controls for trait 2 (assuming they are not at the same locus), then whether an individual passes on allele A or a is independent of wheter it passes on B or b. The Independent Assortment implicts the product rule:

      Suppose an individual with AaBb genotype (Aa at locus 1 and Bb at locus 2) mates with another individual with same AaBb genotype, the probability of their offspring being aabb is 1/16

    • In case you’re wondering how to get this number, here is my favorite way to calculate:

In the above diagram, first step involves the Law of Segregation, by which I calculated the probability of each gamete containing each allele separately. The second step is where the Law of Independent Assortment comes into play: The probability of each combination of two alleles is the product of the probabilities of each allele (in one gamete). And the last step is just random combination of two gametes (for a diploid organism).

However, just as every law has an exception, it was discovered that many of traits in various species didn’t really follow the Law of Independent Assortment. The following example shows a case of “dependent assortment”:

Sweet peas have purple (P) or red (p) flowers and long (L) or short (l) pollen grains. Peas from two pure lines (PPLL and ppll) were crossed. The offspring (F1) are all purple flower and long grain (PpLl). They are then self crossed (PpLl X PpLl).

Assuming indenpendent assortment, we expect to observe all 4 phenotypes in F2 with the following frequencies:

Phenotype Frequency
Purple (P), Long (L) 916
Purple (P_), Short (ll) 316
Red (pp), Long (L_) 316
Red (pp), Short (ll) 116

However, large deviations from expected phenotype frequency were observed:

Phenotype Exp. Count Obs. Count
Purple, Long (P_L_) 216 284
Purple, Short (P_ll) 72 21
Red, Long (ppL_) 72 21
Red, Short (ppll) 24 55

Data from Bateson et al.

We know that in the parent line (PPLL x ppll) P and L as well as p and l are always together because they are both homozygotes. But as the Mendel’s second law predicts, they should’ve independently assorted from F1 to F2. We, however, see that instead of random combination, P still appear more often together with L and so does p and l. It’s as if P and L were somehow linked. This phenomenon is termed Linkage. We now know that this is because the two loci are closeby on the same chromsome and that during meiosis, DNA on one chromsome are largely segregated together (except some recombination).

When two loci are linked, product rule no longer applies. This causes a non-random association of alleles at different loci in a given population. This is termed Linkage Disequilibrium or LD. (Technically LD does not always mean physical linkage. More on this later.)

When we talk about linkage, knowing genotype of an individual alone is no longer enough: we wish to know which alleles are “linked” or belong to the same chromsome. In genetics, we call alleles that are on the same chromsome and likely to be inherited together a haplotype. For example, an individual with a PpLl genotype may have these different haplotypes:

  • PL and pl
  • Pl and pL

Quantify LD

There are several different ways to quntify the extent of linkage between any two loci:

  • Coefficient of Linkage Disequilibrium (D)
  • Normalized coefficient of Linkage Disequilibrium (D’)
  • Correlation coefficient ($r^2$) and $\chi^2$ test

I’ll briefly go over them below.

Linkage Disequilibrium Coefficient ($D$):

As mentioned before, if two loci are independent, their frequencies follow product rule:

For two independent loci A and B, each with two alleles ($A_1$, $A_2$ and $B_1$, $B_2$, respectively), we have: $$f_{A_1B_1} = f_{A_1}f_{B_1}$$ Now if A and B are not independent, this equation no longer stands true. In that case we have: $$f_{A_1B_1} = f_{A_1}f_{B_1} + D$$ Here we have D as a measurement of the extent of linkage disequilibrium between $A_1$ and $B_1$. We can calculate the same for all 4 haplotypes:

A $A_1$ $A_2$
B frequency $f_{A_1}$ $f_{A_2}$
$B_1$ $f_{B_1}$ $f_{A_1}f_{B_1} + D_{A_1B_1}$ $f_{A_2}f_{B_1} + D_{A_2B_1}$
$B_2$ $f_{B_2}$ $f_{A_1}f_{B_2} + D_{A_1B_2}$ $f_{A_2}f_{B_2} + D_{A_2B_2}$

Solve above equations and we get:

$D_{A_1B_1} = f_{A_1B_1} - f_{A_1}f_{B_1}$

$D_{A_1B_2} = f_{A_1B_2} - f_{A_1}f_{B_2}$

$D_{A_2B_1} = f_{A_2B_1} - f_{A_2}f_{B_1}$

$D_{A_2B_2} = f_{A_2B_2} - f_{A_2}f_{B_2}$

For a given population, we have

$f_{A_1} + f_{A_2} = 1$

$f_{B_1} + f_{B_2} = 1$

$f_{A_1B_1} + f_{A_1B_2} = f_{A_1}$

$f_{A_1B_1} + f_{A_2B_1} = f_{B_1}$

$f_{A_2B_1} + f_{A_2B_2} = f_{A_2} = 1 - f_{A_1}$

$f_{A_1B_2} + f_{A_2B_2} = f_{B_2} = 1 - f_{B_1}$

We can then further derive above equations:

$D_{A_2B_2} \\\ = f_{A_2B_2} - (1 - f_{A_1})(1 - f_{B_1}) \\\\ = 1 - f_{B_1} - f_{A_1B_2} - (1 - f_{A_1})(1 - f_{B_1}) \\\ = 1 - f_{B_1} - f_{A_1} - f_{A_1B_1} - (1 - f_{A_1})(1 - f_{B_1}) \\\ = 1 - f_{B_1} - f_{A_1} + f_{A_1B_1} - 1 + f_{B_1} + f_{A_1} - f_{A_1}f_{B_1} \\\ = f_{A_1B_1} - f_{A_1}f_{B_1} = D_{A_1B_1}$

Similarly we can prove that $D_{A_1B_1} = - D_{A_1B_2} = - D_{A_2B_1} = D_{A_2B_2} = D$ (note that this is only true for diallelic loci. For loci with more than two alleles, we need to calculate D for each allele pair separately.)

This makes sense because linkage disequilibrium should be a measurement of two loci not any two specific alleles in those loci. Therefore the D for any haplotype between the two given loci should be the same (or at least the absolute value of it). We then have:

$D = D_{A_1B_1} = f_{A_1B_1} - f_{A_1}f_{B_1}$

$D = -D_{A_1B_2} = -f_{A_1B_2} + f_{A_1}f_{B_2}$

$D = -D_{A_2B_1} = -f_{A_2B_1} + f_{A_2}f_{B_1}$

$D = D_{A_2B_2} = f_{A_2B_2} - f_{A_2}f_{B_2}$

Next we can prove that $D = f_{A_1B_1}f_{A_2B_2} - f_{A_1B_2}f_{A_2B_1}$:

$f_{A_1B_1}f_{A_2B_2} \\\ = (f_{A_1}f_{B_1} + D)(f_{A_2}f_{B_2} + D) \\\ = f_{A_1}f_{A_2}f_{B_1}f_{B_2} + D(f_{A_1}f_{B_1} + f_{A_2}f_{B_2}) + D^2$

$f_{A_1B_2}f_{A_2B_1} \\\ = (f_{A_1}f_{B_2} - D)(f_{A_2}f_{B_1} - D) \\\ = f_{A_1}f_{A_2}f_{B_1}f_{B_2} - D(f_{A_1}f_{B_2} + f_{A_2}f_{B_1}) + D^2$

Taske substraction:

$f_{A_1B_1}f_{A_2B_2} - f_{A_1B_2}f_{A_2B_1} \\\ = f_{A_1}f_{A_2}f_{B_1}f_{B_2} + D(f_{A_1}f_{B_1} + f_{A_2}f_{B_2}) + D^2 - (f_{A_1}f_{A_2}f_{B_1}f_{B_2} - D(f_{A_1}f_{B_2} + f_{A_2}f_{B_1}) + D^2) \\\ = D(f_{A_1}f_{B_1} + f_{A_2}f_{B_2} + f_{A_1}f_{B_2} + f_{A_2}f_{B_1}) = D * 1 = D$

Now we have a measure of linkage disequilibrium. By definition, we know that $D=0$ indicates independent loci (no linkage). What should the maxium value of D be?

Say we have a population with following genotype frequency: $f_{A_1} = f_{A_2} = 0.5 \\\ f_{B_1} = f_{B_2} = 0.5$

Suppose there is complete linkage between $A_1$ and $B_1$, meaning we have following haplotype frequencies: $f_{A_1B_1} = f_{A_2B_2} = 0.5 \\\ f_{A_1B_2} = f_{A_2B_1} = 0$

We can then calculate $D = f_{A_1B_1}f_{A_2B_2} - f_{A_1B_2}f_{A_2B_1} = 0.5 * 0.5 - 0 = 0.25$

When two loci are in complete linkage, we have $D=0.25$. For any two loci, we always have $D \in [-0.25,0.25]$.

This scale is not very intuitive but we can work with it. Now next question is, does equal $D$ mean equal linkage disequilibrium?

Consider the following two populations:

  1. Pop 1 has haplotype frequencies as follow: $$f_{A_1B_1} = f_{A_2B_2} = 0.34 \\\ f_{A_1B_2} = f_{A_2B_1} = 0.16$$ We can calculate $D = 0.34^2 - 0.16^2 = 0.09$
  2. Pop 2 has haplotype frequencies as follow: $$f_{A_1B_1} = 0.9 \\\ f_{A_2B_2} = 0.1 \\\ f_{A_1B_2} = f_{A_2B_1} = 0$$ We can calculate $D = 0.9 * 0.1 - 0 = 0.09$

Even though $D$ at these loci for both populations is exactly the same, we can clearly see that pop 2 has comlete linkage between these two loci while pop 1 does not.

This example shows a major problem of $D$ for measuring linkage disequilibrium: It’s ignorant of allele frequencies in a given population.

Standardized Linkage Disequilibrium Coefficient ($D’$)

To address this problem, we can standardize $D$ against allele frequency: $$D’ = \frac{D}{D_{max}}$$

$D_{max}$ is the maxium possible $D$ in a population with same allele frequencies (but different haplotype frequencies). We have $$D_{max} = f_{maxA_1B_1} - f_{A_1}f_{B_1} \\\ = min(f_{A_1},f_{B_1}) - f_{A_1}f_{B_1}$$ We can also prove that for every haplotype between two given loci, we can get the same $D’$.

For example, in the above two populations:

  1. Pop 1: $$f_{A_1} = f_{A_2} = f_{B_1} = f_{B_2} = 0.5 \\\ D_{max} = 0.5 - 0.5 * 0.5 = 0.25 \\\ D’ = \frac{D}{D_{max}} = \frac{0.09}{0.25} = \frac{9}{25} $$
  2. Pop 2: $$f_{A_1} = f_{B_1} = 0.9 \\\ f_{A_2} = f_{B_2} = 0.1 \\\ D_{max} = 0.9 - 0.9 * 0.9 = 0.09 \\\ D’ = \frac{D}{D_{max}} = \frac{0.09}{0.09} = 1$$

Now we can see that $D’$ for population 2 is much higher than that of population 1, indicating a much higher linkage disequilibrium between these loci in pop 2. By definition, we know $D’ \in [0,1]$

However, $D’$ also has its own problem. Consider the following scenario:

Suppose we genotyped 100 individuals in a population, and count occurrance of each haplotype between loci A and B as follow:

Count Total
$A_1B_1$ 50 100
$A_1B_2$ 49 100
$A_2B_1$ 0 100
$A_2B_2$ 1 100

We can get haplotype and allele frequencies: $$f_{A_1B_1} = 0.5, f_{A_1B_2} = 0.49 \\\ f_{A_2B_1} = 0, f_{A_2B_2} = 0.01 \\\ f_{A_1} = 0.99, f_{A_2} = 0.01 \\\ f_{B_1} = 0.5, f_{B_2} = 0.5$$ And calculate $D’$: $$D’ = \frac{f_{A_1B_1} * f_{A_2B_2} - f_{A_1B_2} * f_{A_2B_1}}{min(f_{A_1}, f_{B_1}) - f_{A_1}f_{B_1}} = \frac{0.5 * 0.01 - 0.49 * 0}{0.5 - 0.5 * 0.99} = 1$$

In this case, $D’$ indicates strong linkage between the two loci. However, looking at data, we can’t confidently make any conclusions regarding the linkage between the two loci. The reason is that in all samples but one, they have either $A_1B_1$ or $A_1B_2$ haplotype. The only one that has an $A_2$ allele can simply be a result of a spontaneous mutaion rather than inheriting from a parent We do not have enough information to say whether or not A locus is linked with B locus because $A_2$ is mostly missing from data.

In other words, $D’$ does not tell us how significant a linkage is or how confident we are in decalring a linkage disequilibrium. For this purpose, we turn to Pearson’s correlation coefficient.

Correlation coefficient ($\gamma^2$)

Consider the population above:

Count Total
$A_1B_1$ 50 100
$A_1B_2$ 49 100
$A_2B_1$ 0 100
$A_2B_2$ 1 100

We calculated that $D=0.005$, now instead of standardizing it like what we did with $D’$, we try a different approach: $$\gamma^2 = \frac{D^2}{f_{A_1}f_{A_2}f_{B_1}f_{B_2}}$$

We have $\gamma^2 = \frac{0.005^2}{0.99 * 0.01 * 0.5 * 0.5} = 0.01$

Now this seems to align well with our assessment of the data! As a matter of fact, this is actually the correlation coefficient of the two variables in this population!

Let’s arbitratrily assign 1 to an $A_1$ allele and -1 to an $A_2$ allele. And 1 to $B_1$, -1 to $B_2$. Now we can plot the data and calculate $\gamma$:

Correlation plot

In the above plot we have $R=0.1$ this is exactly what we calculated above ($\gamma^2 = 0.01$)! Now that we have a correlation coefficient, we can use a $\chi^2$ test on our dataset:

$$\chi_S^2=\gamma^2*N=0.01*100=1\\\ P(\chi^2>\chi_S^2, df=1) = 0.317 $$

The p-value is also shown in the above plot. Clearly, we don’t see a significant correlation (i.e linkage) between the two loci.

This is the beauty of $\gamma^2$: it not only tells us the strength of a linkage ($\gamma^2$) but only indicates confidence in such linkage given data ($\chi^2$)!

Now remember, the absence of significance is NOT evidence of insignificance! Look at the above figure and we notice that the confidence interval of the plot is very large towards $x=-1$. This is because we have only a single data point at $x=-1$. This could easily be a sampling error or like mentioned above, a spontaneous mutation. It is possible, that there are more individuals with $A_2$ allele but we just didn’t include them in our samples for unknown reasons. In this case, since we can’t estimate frequencies of $A_2$ in haplotype $A_2B_1$ or $A_2B_2$, we can’t conclude with any confidence whether or not there is linkage between the two loci.

Now why do we care about LD and haplotype?

LD is an indication of deviation from Hardy-Weinberg equilibrium.

Sichong Peng
PhD student

I study equine genetics/genomics at UC Davis Veterinary school. My primary interest is functional annotation of non-model organisms and its applications.