Computer Algorithms Applied to Allele Frequency Estimation

Categories: Biology

Abstract

In this report, we explore the estimation of Maximum-Likelihood functions using the iterative Newton-Raphson method. However, this algorithm is not guaranteed to converge, prompting the need to modify the step-size updates in each iteration. In some cases, the original problem can be reformulated as a "missing" data problem, leading to the application of more robust methods like the Expectation-Maximization algorithm. We illustrate these concepts using an example from statistical genetics.

Introduction

The ABO gene, located on chromosome 9, has three alleles: A, B, and O.

An individual's ABO blood type is determined by the inheritance of one of these three alleles from each parent, resulting in various possible combinations:

Parent Alleles Genotype Phenotype
A AA A
A AO A
B AB B
B BB B
O AO O
O BO O
O OO O

Both A and B alleles are dominant over O, so individuals with an AO genotype exhibit an A phenotype. Individuals with type O blood have OO genotypes.

Get quality help now
Sweet V
Sweet V
checked Verified writer

Proficient in: Biology

star star star star 4.9 (984)

“ Ok, let me say I’m extremely satisfy with the result while it was a last minute thing. I really enjoy the effort put in. ”

avatar avatar avatar
+84 relevant experts are online
Hire writer

The A and B alleles are codominant, meaning that if an individual inherits an A allele from one parent and a B allele from the other, their phenotype will be AB.

Table 1: Genotype to Phenotype

Genotype Phenotype
AA A
BB B
AB AB
OO O

In a large random sample obtained from Berlin, we observe the following blood type frequencies:

  • 9123 blood type A (nA = 9123)
  • 2987 blood type B (nB = 2987)
  • 1269 blood type AB (nAB = 1269)
  • 7725 blood type O (nO = 7725)

The goal of this study is to estimate the allele frequencies of A, B, and O, represented as p (freq(allele A)), q (freq(allele B)), and r (freq(allele O) = 1 - p - q), respectively.

Get to Know The Price Estimate For Your Paper
Topic
Number of pages
Email Invalid email

By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email

"You must agree to out terms of services and privacy policy"
Write my paper

You won’t be charged yet!

Assuming Hardy–Weinberg equilibrium, the genotype frequencies can be calculated using the trinomial expansion of (p + q + r)2. The blood type distribution follows a multinomial distribution, and thus, the likelihood function can be expressed as:

L(p, q|nA, nB, nAB, nO) = nCnA * nCnB * nCnAB * (p2 + 2p(1 - p - q))nA * (q2 + 2q(1 - p - q))nB * (2pq)nAB * ((1 - p - q)2)nO

where:

  • nA, nB, nAB, and nO represent the observed counts for each blood type.
  • n is the total sample size.
  • p, q, and r are the allele frequencies of A, B, and O, respectively.

By taking the natural logarithm of the likelihood function, we obtain the log-likelihood function:

l = ln(L(p, q)) ∼ nA * ln(p2 + 2p(1 - p - q)) + nB * ln(q2 + 2q(1 - p - q)) + nAB * ln(2pq) + nO * ln((1 - p - q)2)

To proceed with estimating the allele frequencies, we need to compute the partial derivatives of the log-likelihood function with respect to p and q:

pl = nA * [(2(1 - p - q) * p2 + 2p(1 - p - q)) / (p)] + nB * [(-2q * q2 + 2q(1 - p - q)) / (q)] + nAB * [(1 / p)] + nO * [(-2(1 - p - q) / (1 - p - q))]

ql = nA * [(-2p * p2 + 2p(1 - p - q)) / (p)] + nB * [(2(1 - p - q) * q2 + 2q(1 - p - q)) / (q)] + nAB * [(1 / q)] + nO * [(-2(1 - p - q) / (1 - p - q))]

The motivation for utilizing computer algorithms in this context stems from the fact that the Maximum Likelihood Estimation (MLE) cannot be solved explicitly for the given equation.

Methods

Grid Search

The purpose of grid search is to visualize the likelihood function. It involves evaluating the log-likelihood at all possible values that p and q can take. Since p, q, and r are all non-zero, the constraints are 0 < p, q, r < 1. The interval [0, 1] is evenly divided into 400 intervals for both p and q. Consequently, [0, 1]2 is divided into 160,000 sub-squares, and the likelihood is evaluated at each corner of these sub-squares. It is important to note that the log-likelihood is only defined within a triangular region due to the constraint r = 1 - p - q > 0. Therefore, for {(p, q) | p + q ≥ 1}, not applicable (NAs) are assigned. This grid search method is preferable for subsequent 3D visualization.

Newton-Raphson

Instead of directly searching for maximum values of log-likelihood, we aim to find the point (p0, q0) where ∇l(p0, q0) = ~0. Subsequently, we verify if (p0, q0) is indeed a maximum point using the second partial derivative test. The classical numerical method employed to find this point is the Newton-Raphson method.

The algorithm begins with an initial guess of (p1, q1). To measure the proximity of ∇l(p1, q1) to ~0, we calculate the Euclidean metric, denoted as d(∇l(p1, q1), ~0). If d is less than a predefined small value ε, the algorithm has converged. If not, the following steps are executed:

We compute the Hessian matrix H(p1, q1) as follows:

H(p1, q1) = 
(
    ∂²/∂p² l(p1, q1),   ∂²/∂p∂q l(p1, q1),
    ∂²/∂q∂p l(p1, q1),   ∂²/∂q² l(p1, q1)
)

Where:

∂²/∂p² l = nA * [-2(p2 + 2pr + 2r2) / (p2 + 2pr)2 + nB * [-4q2 / (q2 + 2qr)2] - nAB / p2 + nO / r2
∂²/∂q² l = nA * [-4p2 / (p2 + 2pr)2] + nB * [-2(q2 + 2qr + 2r2) / (q2 + 2qr)2] - nAB / q2 + nO / r2
∂²/∂p∂q l = ∂²/∂q∂p l = nA * [-2p2 / (p2 + 2pr)2] + nB * [-2q2 / (q2 + 2qr)2] + nO / r2

Then, we update (p2, q2) as (p1, q1) - [H(p1, q1)]-1 ∇l(p1, q1). This process is iteratively continued until convergence is achieved (d < ε).

It's worth noting that the Newton-Raphson method is not guaranteed to converge. To potentially improve convergence in this constrained and small parameter space, one can consider using a different step constant η in the update equation: (p2, q2) = (p1, q1) - η[H(p1, q2)]-1 ∇l(p1, q1).

Methods

Expectation-Maximization Algorithm

The primary challenge in using the ordinary calculus approach to find the Maximum Likelihood Estimation (MLE) lies in the presence of latent variables. Specifically, nAA, nAO, nBB, and nBO are not directly observed, but rather, nA, nB, and nAA + nAO = nA and nBB + nBO = nB are the observed values. An effective approach for addressing such a missing data problem is the Expectation-Maximization (EM) algorithm.

E-step:

Given the initial parameter values p0 and q0, as well as r0 = 1 - p0 - q0, the following expected values are computed:

  • E[nAA] = n0AA = (p02) / (p02 + 2p0r0) * nA
  • E[nAO] = n0AO = (2p0r0) / (p02 + 2p0r0) * nA
  • E[nBB] = n0BB = (q02) / (q02 + 2q0r0) * nB
  • E[nBO] = n0BO = (2q0r0) / (q02 + 2q0r0) * nB

M-step:

The parameter values p1 and q1 are updated using the MLE of the complete likelihood function:

  • p1 = pMLE = (2n0AA + n0AO + nAB) / (2n)
  • q1 = qMLE = (2n0BB + n0BO + nAB) / (2n)

Subsequently, the new values of p1 and q1 are used to iterate the E-step, and this process continues until the convergence criterion d(pnew - pold, qnew - qold) < ε is met, where ε is a small predefined value. For a detailed derivation of the E-step and M-step, please refer to the appendix.

Results

The three highest likelihoods obtained from the grid search are as follows:

p q Likelihood
0.2882206 0.1052632 -24823.11
0.2882206 0.1077694 -24823.13
0.2857143 0.1077694 -24823.29

Based on this estimation, the log-likelihood function is maximized at p = 0.2882 and q = 0.1052.

The left graph represents a 3D surface plot calculated using 160,000 points. Due to the uncontrolled behavior near the boundary, the shape of the region with the highest likelihood is unclear. The right plot, calculated with 40,000 points, sacrifices some accuracy but provides a clearer visualization of the area of interest. In the right graph, the purple region represents the area with the highest likelihood, focusing on small values of p and q.

Analysis and Consistency

Based on the previously generated graph, our initial guess for the Newton-Raphson algorithm was set to (0.2, 0.2).

Newton-Raphson Algorithm

The Newton-Raphson algorithm converged in 6 steps, resulting in the Maximum Likelihood Estimation (MLE) at (0.2876856, 0.106555).

Upon convergence, the Hessian matrix at this critical point was evaluated:

Column 1 Column 2
Row 1 -178604.05 -53661.04
Row 2 -53661.04 -434906.59

With a determinant |H| > 0 and H11 < 0, it can be concluded that this critical point corresponds to a local maximum, as per the second partial derivative test.

Considering that graphing the likelihood function might not always be feasible, sampling initial points from a uniform distribution within the triangular region is a reasonable alternative. The algorithm was run 5 times with random initial points, yielding the following results:

1st run:

  • Initial point: (0.6774393, 0.1207194)
  • Modified Newton-Raphson algorithm converged in 27 steps
  • MLE achieved at (0.2876856, 0.106555)

2nd run:

  • Initial point: (0.2191412, 0.336263)
  • Oops, this algorithm did not converge!

3rd run:

  • Initial point: (0.2370166, 0.01267092)
  • Modified Newton-Raphson algorithm converged in 31 steps
  • MLE achieved at (0.2876856, 0.106555)

4th run:

  • Initial point: (0.7543569, 0.0003512141)
  • Modified Newton-Raphson algorithm converged in 34 steps
  • MLE achieved at (0.2876856, 0.106555)

5th run:

  • Initial point: (0.01996564, 0.8115543)
  • Modified Newton-Raphson algorithm converged in 44 steps
  • MLE achieved at (0.2876856, 0.106555)

It is important to note that the Newton method is not guaranteed to converge. To potentially enhance convergence, a modified Newton-Raphson algorithm was implemented with η = 0.5, half the value of the traditional Newton-Raphson.

1st run:

  • Initial point: (0.6469028, 0.1392)
  • Modified Newton-Raphson algorithm converged in 27 steps
  • MLE achieved at (0.2876856, 0.106555)

2nd run:

  • Initial point: (0.6185018, 0.1819331)
  • Modified Newton-Raphson algorithm converged in 28 steps
  • MLE achieved at (0.2876856, 0.106555)

3rd run:

  • Initial point: (0.1360972, 0.05821356)
  • Modified Newton-Raphson algorithm converged in 31 steps
  • MLE achieved at (0.2876856, 0.106555)

4th run:

  • Initial point: (0.1291526, 0.3423457)
  • Modified Newton-Raphson algorithm converged in 34 steps
  • MLE achieved at (0.2876856, 0.106555)

5th run:

  • Initial point: (0.002582699, 0.6186041)
  • Modified Newton-Raphson algorithm converged in 44 steps
  • MLE achieved at (0.2876856, 0.106555)

The use of (0.2, 0.2) as an initial point for the EM algorithm was encouraged by the 3D plot.

EM Algorithm

The EM algorithm converged in 5 steps, resulting in the MLE at (0.2876842, 0.1065549).

To verify the consistency of this result, the EM algorithm was repeated for the same random starting points as used in the Newton-Raphson algorithm. The following results were obtained:

1st run:

  • Initial point: (0.6774393, 0.1207194)
  • EM algorithm converged in 7 steps
  • MLE achieved at (0.2876861, 0.106555)

2nd run:

  • Initial point: (0.2191412, 0.336263)
  • EM algorithm converged in 5 steps
  • MLE achieved at (0.2876868, 0.1065551)

3rd run:

  • Initial point: (0.2370166, 0.01267092)
  • EM algorithm converged in 5 steps
  • MLE achieved at (0.2876833, 0.1065548)

4th run:

  • Initial point: (0.7543569, 0.0003512141)
  • EM algorithm converged in 7 steps
  • MLE achieved at (0.287686, 0.106555)

5th run:

  • Initial point: (0.01996564, 0.8115543)
  • EM algorithm converged in 5 steps
  • MLE achieved at (0.287684, 0.1065549)

Discussion

Grid search is the most intuitive and straightforward method, but it has limitations. It requires a solid understanding of the parameter space since it's not feasible to search over all real numbers. In the case of the ABO example, the parameter space (Θ = {(p, q) ∈ ℝ² | p + q < 1}) can be easily defined. However, no matter how fine the search granularity is, it can only cover a finite number of parameter pairs, leaving out countless potential maximum points. Additionally, as the dimensionality of the problem increases, the computational cost becomes prohibitively high. The goal is to find the closest possible solution within reasonable computational constraints.

From the trajectory plots shown above, it's evident that when the initial starting point is far from the maximum, the EM algorithm converges more rapidly towards the maximum point compared to the Newton method. Notably, there is a case where the Newton method failed to converge. This failure occurred because, when updating the new point using the Newton method, there is no control over the distance moved along the gradient direction. This can lead to an update that falls outside the parameter space. For instance, starting from (0.2191412, 0.336263), the first Newton method update led to (0.3268845, -0.1251772). To address this issue, the step size of each update was modified, as indicated above. However, smaller step sizes result in slower convergence, and a trade-off between convergence speed and computational cost must be considered.

In contrast, the EM algorithm exhibits greater robustness. Each update in the EM algorithm is guaranteed to be at least as good as the previous one in terms of the likelihood function. Both the Newton and EM algorithms may converge to local maxima, and to assess their consistency, it is advisable to start from random points and run both algorithms multiple times. Grid search, on the other hand, is unlikely to find local maxima for a continuous function.

Appendix

Detailed Mathematical Derivation of E-step and M-step

The genotype at the ith observed data point is denoted as Gi. The likelihood of complete data is given by:

Lcomplete ∝ Πi=1 to n (p²)1{Gi=AA} (2pr)1{Gi=AO} (q²)1{Gi=BB} (2qr)1{Gi=BO} (2pq)1{Gi=AB} (r²)1{Gi=OO}

So, the log-likelihood of complete data is:

lcomplete ∼ Σi=1 to n [1{Gi=AA} ln(p²) + 1{Gi=AO} ln(2pr) + 1{Gi=BB} ln(q²) + 1{Gi=BO} ln(2qr) + 1{Gi=AB} ln(2pq) + 1{Gi=OO} ln(r²)]

Given some initial values p0 and q0, to compute E[lcomplete|p0, q0], the only uncertainties are Genotypes AA, AO, BB, and BO. In this case, lcomplete is linear with respect to the missing data, and E is linear. Therefore, we just need to compute E[nAA], E[nAO], E[nBB], and E[nBO], then E[lcomplete|p0, q0] is completely determined. Let r0 = 1 - p0 - q0.

E[nAA|nA, p0, q0] = E(Σi=1 to n 1{Gi=AA}|nA) = Σi=1 to n P(Gi=AA|nA) = n/nA * (p0² / (p0² + 2p0r0)) = p0² / (p0² + 2p0r0) * nA = n0AA

E[nAO|nA, p0, q0] = nA - E[nAA] = (1 - p0² / (p0² + 2p0r0)) * nA = (2p0r0 / (p0² + 2p0r0)) * nA = n0AO

E[nBB|nB, p0, q0] = E(Σi=1 to n 1{Gi=BB}|nB) = Σi=1 to n P(Gi=BB|nB) = n/nB * (q0² / (q0² + 2q0r0)) = q0² / (q0² + 2q0r0) * nB = n0BB

E[nBO|nB, p0, q0] = nB - E[nBB] = (1 - q0² / (q0² + 2q0r0)) * nB = (2q0r0 / (q0² + 2q0r0)) * nB = n0BO

Now we have:

E[lcomplete|p0, q0] ∼ n0AA ln(p0²) + n0AO ln(2p0r0) + n0BB ln(q0²) + n0BO ln(2q0r0) + n0AB ln(2pq0) + n0O ln(r0²)

For the M-step, we can find the maximum of E[lcomplete] using ordinary calculus:

∂/∂p E[lcomplete] = (2n0AA / p0) + (n0AO (r0 - p0) / (pr0)) - (n0BO / r0) + (n0AB / p0) - (2n0O / r0) = 0

This leads to: (2n0AA + n0AO + n0AB)r0 = (n0AO + n0BO + 2n0O)p0

∂/∂q E[lcomplete] = (2n0BB / q0) + (n0BO (r0 - q0) / (qr0)) - (n0AO / r0) + (n0AB / q0) - (2n0O / r0) = 0

This leads to: (2n0BB + n0BO + n0AB)r0 = (n0AO + n0BO + 2n0O)q0

After substituting r0 into the first equation, we get the expression for p0, and after substituting r0 into the second equation, we get the expression for q0.

Updated: Jan 18, 2024
Cite this page

Computer Algorithms Applied to Allele Frequency Estimation. (2024, Jan 18). Retrieved from https://studymoose.com/document/computer-algorithms-applied-to-allele-frequency-estimation

Live chat  with support 24/7

👋 Hi! I’m your smart assistant Amy!

Don’t know where to start? Type your requirements and I’ll connect you to an academic expert within 3 minutes.

get help with your assignment