To install StudyMoose App tap and then “Add to Home Screen”
Save to my list
Remove from my list
In this report, we explore the estimation of Maximum-Likelihood functions using the iterative Newton-Raphson method. However, this algorithm is not guaranteed to converge, prompting the need to modify the step-size updates in each iteration. In some cases, the original problem can be reformulated as a "missing" data problem, leading to the application of more robust methods like the Expectation-Maximization algorithm. We illustrate these concepts using an example from statistical genetics.
The ABO gene, located on chromosome 9, has three alleles: A, B, and O.
An individual's ABO blood type is determined by the inheritance of one of these three alleles from each parent, resulting in various possible combinations:
Parent Alleles | Genotype | Phenotype |
---|---|---|
A | AA | A |
A | AO | A |
B | AB | B |
B | BB | B |
O | AO | O |
O | BO | O |
O | OO | O |
Both A and B alleles are dominant over O, so individuals with an AO genotype exhibit an A phenotype. Individuals with type O blood have OO genotypes.
The A and B alleles are codominant, meaning that if an individual inherits an A allele from one parent and a B allele from the other, their phenotype will be AB.
Genotype | Phenotype |
---|---|
AA | A |
BB | B |
AB | AB |
OO | O |
In a large random sample obtained from Berlin, we observe the following blood type frequencies:
The goal of this study is to estimate the allele frequencies of A, B, and O, represented as p (freq(allele A)), q (freq(allele B)), and r (freq(allele O) = 1 - p - q), respectively.
Assuming Hardy–Weinberg equilibrium, the genotype frequencies can be calculated using the trinomial expansion of (p + q + r)2. The blood type distribution follows a multinomial distribution, and thus, the likelihood function can be expressed as:
L(p, q|nA, nB, nAB, nO) = nCnA * nCnB * nCnAB * (p2 + 2p(1 - p - q))nA * (q2 + 2q(1 - p - q))nB * (2pq)nAB * ((1 - p - q)2)nO
where:
By taking the natural logarithm of the likelihood function, we obtain the log-likelihood function:
l = ln(L(p, q)) ∼ nA * ln(p2 + 2p(1 - p - q)) + nB * ln(q2 + 2q(1 - p - q)) + nAB * ln(2pq) + nO * ln((1 - p - q)2)
To proceed with estimating the allele frequencies, we need to compute the partial derivatives of the log-likelihood function with respect to p and q:
∇pl = nA * [(2(1 - p - q) * p2 + 2p(1 - p - q)) / (p)] + nB * [(-2q * q2 + 2q(1 - p - q)) / (q)] + nAB * [(1 / p)] + nO * [(-2(1 - p - q) / (1 - p - q))]
∇ql = nA * [(-2p * p2 + 2p(1 - p - q)) / (p)] + nB * [(2(1 - p - q) * q2 + 2q(1 - p - q)) / (q)] + nAB * [(1 / q)] + nO * [(-2(1 - p - q) / (1 - p - q))]
The motivation for utilizing computer algorithms in this context stems from the fact that the Maximum Likelihood Estimation (MLE) cannot be solved explicitly for the given equation.
The purpose of grid search is to visualize the likelihood function. It involves evaluating the log-likelihood at all possible values that p and q can take. Since p, q, and r are all non-zero, the constraints are 0 < p, q, r < 1. The interval [0, 1] is evenly divided into 400 intervals for both p and q. Consequently, [0, 1]2 is divided into 160,000 sub-squares, and the likelihood is evaluated at each corner of these sub-squares. It is important to note that the log-likelihood is only defined within a triangular region due to the constraint r = 1 - p - q > 0. Therefore, for {(p, q) | p + q ≥ 1}, not applicable (NAs) are assigned. This grid search method is preferable for subsequent 3D visualization.
Instead of directly searching for maximum values of log-likelihood, we aim to find the point (p0, q0) where ∇l(p0, q0) = ~0. Subsequently, we verify if (p0, q0) is indeed a maximum point using the second partial derivative test. The classical numerical method employed to find this point is the Newton-Raphson method.
The algorithm begins with an initial guess of (p1, q1). To measure the proximity of ∇l(p1, q1) to ~0, we calculate the Euclidean metric, denoted as d(∇l(p1, q1), ~0). If d is less than a predefined small value ε, the algorithm has converged. If not, the following steps are executed:
We compute the Hessian matrix H(p1, q1) as follows:
H(p1, q1) = ( ∂²/∂p² l(p1, q1), ∂²/∂p∂q l(p1, q1), ∂²/∂q∂p l(p1, q1), ∂²/∂q² l(p1, q1) )
Where:
∂²/∂p² l = nA * [-2(p2 + 2pr + 2r2) / (p2 + 2pr)2 + nB * [-4q2 / (q2 + 2qr)2] - nAB / p2 + nO / r2 ∂²/∂q² l = nA * [-4p2 / (p2 + 2pr)2] + nB * [-2(q2 + 2qr + 2r2) / (q2 + 2qr)2] - nAB / q2 + nO / r2 ∂²/∂p∂q l = ∂²/∂q∂p l = nA * [-2p2 / (p2 + 2pr)2] + nB * [-2q2 / (q2 + 2qr)2] + nO / r2
Then, we update (p2, q2) as (p1, q1) - [H(p1, q1)]-1 ∇l(p1, q1). This process is iteratively continued until convergence is achieved (d < ε).
It's worth noting that the Newton-Raphson method is not guaranteed to converge. To potentially improve convergence in this constrained and small parameter space, one can consider using a different step constant η in the update equation: (p2, q2) = (p1, q1) - η[H(p1, q2)]-1 ∇l(p1, q1).
The primary challenge in using the ordinary calculus approach to find the Maximum Likelihood Estimation (MLE) lies in the presence of latent variables. Specifically, nAA, nAO, nBB, and nBO are not directly observed, but rather, nA, nB, and nAA + nAO = nA and nBB + nBO = nB are the observed values. An effective approach for addressing such a missing data problem is the Expectation-Maximization (EM) algorithm.
Given the initial parameter values p0 and q0, as well as r0 = 1 - p0 - q0, the following expected values are computed:
The parameter values p1 and q1 are updated using the MLE of the complete likelihood function:
Subsequently, the new values of p1 and q1 are used to iterate the E-step, and this process continues until the convergence criterion d(pnew - pold, qnew - qold) < ε is met, where ε is a small predefined value. For a detailed derivation of the E-step and M-step, please refer to the appendix.
The three highest likelihoods obtained from the grid search are as follows:
p | q | Likelihood |
---|---|---|
0.2882206 | 0.1052632 | -24823.11 |
0.2882206 | 0.1077694 | -24823.13 |
0.2857143 | 0.1077694 | -24823.29 |
Based on this estimation, the log-likelihood function is maximized at p = 0.2882 and q = 0.1052.
The left graph represents a 3D surface plot calculated using 160,000 points. Due to the uncontrolled behavior near the boundary, the shape of the region with the highest likelihood is unclear. The right plot, calculated with 40,000 points, sacrifices some accuracy but provides a clearer visualization of the area of interest. In the right graph, the purple region represents the area with the highest likelihood, focusing on small values of p and q.
Based on the previously generated graph, our initial guess for the Newton-Raphson algorithm was set to (0.2, 0.2).
The Newton-Raphson algorithm converged in 6 steps, resulting in the Maximum Likelihood Estimation (MLE) at (0.2876856, 0.106555).
Upon convergence, the Hessian matrix at this critical point was evaluated:
Column 1 | Column 2 | |
---|---|---|
Row 1 | -178604.05 | -53661.04 |
Row 2 | -53661.04 | -434906.59 |
With a determinant |H| > 0 and H11 < 0, it can be concluded that this critical point corresponds to a local maximum, as per the second partial derivative test.
Considering that graphing the likelihood function might not always be feasible, sampling initial points from a uniform distribution within the triangular region is a reasonable alternative. The algorithm was run 5 times with random initial points, yielding the following results:
It is important to note that the Newton method is not guaranteed to converge. To potentially enhance convergence, a modified Newton-Raphson algorithm was implemented with η = 0.5, half the value of the traditional Newton-Raphson.
The use of (0.2, 0.2) as an initial point for the EM algorithm was encouraged by the 3D plot.
The EM algorithm converged in 5 steps, resulting in the MLE at (0.2876842, 0.1065549).
To verify the consistency of this result, the EM algorithm was repeated for the same random starting points as used in the Newton-Raphson algorithm. The following results were obtained:
Grid search is the most intuitive and straightforward method, but it has limitations. It requires a solid understanding of the parameter space since it's not feasible to search over all real numbers. In the case of the ABO example, the parameter space (Θ = {(p, q) ∈ ℝ² | p + q < 1}) can be easily defined. However, no matter how fine the search granularity is, it can only cover a finite number of parameter pairs, leaving out countless potential maximum points. Additionally, as the dimensionality of the problem increases, the computational cost becomes prohibitively high. The goal is to find the closest possible solution within reasonable computational constraints.
From the trajectory plots shown above, it's evident that when the initial starting point is far from the maximum, the EM algorithm converges more rapidly towards the maximum point compared to the Newton method. Notably, there is a case where the Newton method failed to converge. This failure occurred because, when updating the new point using the Newton method, there is no control over the distance moved along the gradient direction. This can lead to an update that falls outside the parameter space. For instance, starting from (0.2191412, 0.336263), the first Newton method update led to (0.3268845, -0.1251772). To address this issue, the step size of each update was modified, as indicated above. However, smaller step sizes result in slower convergence, and a trade-off between convergence speed and computational cost must be considered.
In contrast, the EM algorithm exhibits greater robustness. Each update in the EM algorithm is guaranteed to be at least as good as the previous one in terms of the likelihood function. Both the Newton and EM algorithms may converge to local maxima, and to assess their consistency, it is advisable to start from random points and run both algorithms multiple times. Grid search, on the other hand, is unlikely to find local maxima for a continuous function.
The genotype at the ith observed data point is denoted as Gi. The likelihood of complete data is given by:
Lcomplete ∝ Πi=1 to n (p²)1{Gi=AA} (2pr)1{Gi=AO} (q²)1{Gi=BB} (2qr)1{Gi=BO} (2pq)1{Gi=AB} (r²)1{Gi=OO}
So, the log-likelihood of complete data is:
lcomplete ∼ Σi=1 to n [1{Gi=AA} ln(p²) + 1{Gi=AO} ln(2pr) + 1{Gi=BB} ln(q²) + 1{Gi=BO} ln(2qr) + 1{Gi=AB} ln(2pq) + 1{Gi=OO} ln(r²)]
Given some initial values p0 and q0, to compute E[lcomplete|p0, q0], the only uncertainties are Genotypes AA, AO, BB, and BO. In this case, lcomplete is linear with respect to the missing data, and E is linear. Therefore, we just need to compute E[nAA], E[nAO], E[nBB], and E[nBO], then E[lcomplete|p0, q0] is completely determined. Let r0 = 1 - p0 - q0.
E[nAA|nA, p0, q0] = E(Σi=1 to n 1{Gi=AA}|nA) = Σi=1 to n P(Gi=AA|nA) = n/nA * (p0² / (p0² + 2p0r0)) = p0² / (p0² + 2p0r0) * nA = n0AA
E[nAO|nA, p0, q0] = nA - E[nAA] = (1 - p0² / (p0² + 2p0r0)) * nA = (2p0r0 / (p0² + 2p0r0)) * nA = n0AO
E[nBB|nB, p0, q0] = E(Σi=1 to n 1{Gi=BB}|nB) = Σi=1 to n P(Gi=BB|nB) = n/nB * (q0² / (q0² + 2q0r0)) = q0² / (q0² + 2q0r0) * nB = n0BB
E[nBO|nB, p0, q0] = nB - E[nBB] = (1 - q0² / (q0² + 2q0r0)) * nB = (2q0r0 / (q0² + 2q0r0)) * nB = n0BO
Now we have:
E[lcomplete|p0, q0] ∼ n0AA ln(p0²) + n0AO ln(2p0r0) + n0BB ln(q0²) + n0BO ln(2q0r0) + n0AB ln(2pq0) + n0O ln(r0²)
For the M-step, we can find the maximum of E[lcomplete] using ordinary calculus:
∂/∂p E[lcomplete] = (2n0AA / p0) + (n0AO (r0 - p0) / (pr0)) - (n0BO / r0) + (n0AB / p0) - (2n0O / r0) = 0
This leads to: (2n0AA + n0AO + n0AB)r0 = (n0AO + n0BO + 2n0O)p0
∂/∂q E[lcomplete] = (2n0BB / q0) + (n0BO (r0 - q0) / (qr0)) - (n0AO / r0) + (n0AB / q0) - (2n0O / r0) = 0
This leads to: (2n0BB + n0BO + n0AB)r0 = (n0AO + n0BO + 2n0O)q0
After substituting r0 into the first equation, we get the expression for p0, and after substituting r0 into the second equation, we get the expression for q0.
Computer Algorithms Applied to Allele Frequency Estimation. (2024, Jan 18). Retrieved from https://studymoose.com/document/computer-algorithms-applied-to-allele-frequency-estimation
👋 Hi! I’m your smart assistant Amy!
Don’t know where to start? Type your requirements and I’ll connect you to an academic expert within 3 minutes.
get help with your assignment