Chapter 4 EM Algorithm

4.1 Introduction

The EM algorithm is an application of the MM algorithm. Proposed by Dempster, Laird, and Rubin (1977), it is one of the pillars of modern computational statistics. Every EM algorithm has some notion of missing data.

Setup:

  • Complete data \(X = (Y, Z)\), with density \(f(x | \theta)\).
  • Observed data \(Y\). Some function \(t(X) = Y\) collapses \(X\) into \(Y\).
  • Missing data \(Z\).

The definition of \(X\) is left up to the creativity of the statistician. The general idea is to choose \(X\) so that the MLE is trivial for complete data.

4.2 EM Algorithm

The EM algorithm iterates between the two steps until convergence:

  1. E-step: compute the conditional expectation \[\begin{equation*} Q(\theta | \theta_n) = E [\log f(X | \theta) | Y = y, \theta_n], \end{equation*}\] where \(\theta_n\) is the current estimate of \(\theta\). Note that expectation needed is for functions of missing data \(Z\) instead of \(Z\) itself, although sometimes it can be \(Z\) itself.

  2. M-step: maximize \(Q(\theta | \theta_n)\) with respect to \(\theta\) to obtain the new estimate \(\theta_{n + 1}\).

Ascent property: Let \(g(y | \theta)\) be the observed likelihood. Then the EM algorithm enjoys the ascent property: \[\begin{equation*} \log g(y | \theta_{n + 1}) \ge \log g(y | \theta_n). \end{equation*}\]

It is sufficient to show the minorization inequality: \[\begin{equation*} \log g(y | \theta) \ge Q(\theta | \theta_n) + \log g(y | \theta_n) - Q(\theta_n | \theta_n). \end{equation*}\]

Information inequality: For densities \(h_1\) and \(h_2\) with respect to measure \(\mu\), \(E_{h_1} \log h_1(X) \ge E_{h_1} \log h_2(X)\) with equality only if \(h_1 = h_2\) almost everywhere relative to \(\mu\).

To prove the ascent property, let \(h_1(x | y, \theta) = f(x|\theta) / g(y | \theta)\) and \(h_2(x | y, \theta_n) = f(x | \theta_n) / g(y | \theta_n)\) be conditional densities of \(X\) on the set \(\{x: t(x) = y\}\) with respect to some measure \(\mu_y\).

4.3 Example: Clustering by EM

Suppose that \(y_1, \ldots, y_n\) form a random sample from a mixture density \(h(y) = \sum_{j=1}^k \pi_j h_j(y | \theta)\), where \(\pi_j > 0\), \(j = 1, \ldots, n\), and \(\sum_{j=1}^k \pi_j = 1\), \(h_j\) is the density function of group \(j\) with parameter \(\theta\). For example, each group density can be normal with common variance \(\sigma^2\) but with group specific mean \(\mu_j\)’s.

Problem: estimate the parameters \(\pi_j\)’s and \(\theta\).

Missing data: let \(z_{ij}\) be the indicator such that \(z_{ij} = 1\) if observation \(y_i\) is from group \(j\), and zero otherwise.

Complete data likelihood: \[ \sum_{i=1}^n \sum_{j=1}^k z_{ij}[\log \pi_j + \log h_j(y_i | \theta)]. \]

E-step: conditional expectation of \(z_{ij}\), \(w_{ij}\), given current \(\pi_j\)’s and \(\theta\). By Bayes’ rule, \[\begin{equation*} w_{ij} = \frac{\pi_j h_j(y_i | \theta)}{\sum_{l=1}^k \pi_l h_l(y_i | \theta)} \end{equation*}\]

M-step: with missing values filled by their conditional expectations \(w_{ij}\), the maximization step separates \(\pi\) from \(\theta\).

Question: what if there is no closed form E-step or M-step?

4.4 Variants of EM

4.4.1 MCEM

Classroom examples may give an impression that the E-step consists of replacing the missing data by their conditional expectations given the observed data at current parameter values. Although in many examples this may be the case as the complete loglikelihood is a linear function of the missing data \(Z\), it is not quite so in general.

When the E-step has no closed-form, it can be approximated by a Monte Carlo process, and this variant of the EM algorithm is known as the Monte Carlo EM (MCEM) (Wei and Tanner 1990). The Monte Carlo E-step goes as the following.

The sample size can be different from iteration to iteration; smaller sample size may by sufficient earlier on but larger sample sizes are needed in the vicinity of the convergence to control the Monte Carlo error. Importance sampling can be used as well.

MCEM routines need to address two challenges (Levine and Casella 2001) (1) how do we minimize the computational cost in obtaining an sample? and (2) how do we choose the Monte Carlo sample size? Rejection sampling and importance sampling can be used for the first. For the second, the number of simulations at iterations in which the change in the parameter value is swamped by Monte Carlo error needs to be increased in an automated way.

4.4.2 ECM

The Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin 1993) is a class of generalized EM (GEM) algorithms (Dempster, Laird, and Rubin 1977), where the M-step is only partially implemented, with the new estimate improving the likelihood found in the E-step, but not necessarily maximizing it.

The ECM algorithm replaces the M-step of the EM algorithm with several computationally simpler conditional maximization (CM) steps. Each of these CM-steps maximizes the \(Q\)-function found in the preceding E-step subject to constraints on \(\theta\), where the collection of all constraints is such that the maximization is over the full parameter space of \(\theta\). It is essentially coordinate ascend!

A CM-step might be in closed form or it might itself require iteration, but because the CM maximizations are over smaller dimensional spaces, often they are simpler, faster, and more stable than the corresponding full maximizations called for on the M-step of the EM algorithm, especially when iteration is required. The ECM algorithm typically converges more slowly than the EM in terms of number of iterations, but can be faster in total computer time. More importantly, the ECM algorithm preserves the appealing convergence properties of the EM algorithm, such as its monotone convergence.

4.5 Standard Errors

4.5.1 Supplemental EM (SEM)

Meng and Rubin (1991) proposed a general automated algorithm named SEM to obtain numerically stable asymptotic variance matrix of the estimator from the EM algorithm. The method uses the fact that the rate of convergence of EM is governed by the fractions of the missing information to find the increased variability due to missing information to add to the complete-data variance matrix.

Keeping the notation of \(X = (Y, Z)\), from the factorization \[ f(X | \theta) = g(Y | \theta) k(Z | Y; \theta), \] where \(f\), \(g\), and \(k\) are the joint, marginal and conditional density of their arguments, respectively. The observed loglikelihood is the difference between the complete loglikelihood and the conditional loglikelihood \[ L(\theta | Y) = L(\theta | X) - \log k(Z | Y; \theta). \] Taking the second derivatives, averaging over \(k(Z | Y; \theta)\), and evaluate at the MLE \(\theta = \theta^*\), we have \[ I_o(\theta^* | Y) = I_{oc} - I_{om}, \] where \[ I_o(\theta | Y) = - \frac{\partial^2 \log g(Y | \theta)}{\partial\theta\partial\theta^{\top}} \] is the observed information matrix, \[ I_{oc} = E[I_o(\theta | Y) \vert Y, \theta] \big\vert_{\theta = \theta^*} \] the conditional expectation of the complete-data observed information given the observed data, and \[ I_{om} = E\left[ - \frac{\partial^2 \log k(Z | Y; \theta)}{\partial\theta\partial\theta^2} \big\vert Y, \theta\right] \big\vert_{\theta = \theta^*} \] is viewed as the missing information. The interpretation is appealing: \[ \mbox{observed information} = \mbox{complete information} - \mbox{missing information} \] which is known as the ``missing information principal’’.

The observed information can be written as \[ I_o(\theta^* | Y) = (I - R) I_{oc}, \] where \(R = I_{om} I_{oc}^{-1}\). It has been shown that the convergence rate of the EM algorithm is \(R\) (Dempster, Laird, and Rubin 1977). If an estimate of \(R\) is available, the target variance matrix can be estimated using \[ I_{oc}^{-1} (I - R)^{-1} = I_{oc}^{-1} + I_{oc}^{-1}R (I - R)^{-1}. \]

The SEM algorithm needs to evaluate \(I_{oc}^{-1}\) and \(R\). Evaluation of \(I_{oc}\) is simplified if \(X\) is from an exponential family. It can be obtained simply by substituting the conditional expectation of the sufficient statistics \(S(Y)\) found at the last E-step. Non-exponential family cases can be handled by linearization of the complete loglikelihood in terms of \(S(Y)\).

The computation of \(R\) is done numerically. Each element of \(R\) is estimated by the component-wise rate of convergence of a ``forced EM’’. Let \(r_{ij}\) be the \((i,j)\)th element of \(R\). Let \[ \theta^{(t)}(i) = (\theta_1^*, \ldots, \theta_{i-1}^*, \theta_i^{(t)}, \theta_{i+1}^*, \ldots, \theta_d^*), \] that is, only the \(i\)th component in the \(d\)-dimensional vector \(\theta^{t}(i)\) is active in the sense that other components are fixed at their MLE’s. Given \(\theta^*\) and \(\theta^{(t)}\), one iteration of the EM algorithm can be run to obtain \(\tilde\theta^{(t+1)}(i) = M\big(\theta^{(t)}(i)\big)\). Then, obtain \[ r_{ij}^{(t)} = \frac{\tilde\theta_j^{(t+1)}(i) - \theta_j^*}{\theta_i^{(t)} - \theta_i^*}, \] for \(j = 1, \ldots, d\). This rate approximates the slope of the map of \(\theta^{(t + 1)} = M(\theta^{(t)})\).

4.5.2 Direct Calculation of the Information Matrix

Oakes (1999) derived an explicit formula for the observed information matrix in terms of derivatives of the \(Q\) function (conditional expectation of the complete-data loglikelihood given the observed data) invoked by the EM algorithm.

Start from the fundamental identity: \[\begin{equation} L(\phi, X) = Q(\phi' | \phi) - E_{X\mid Y,\phi} \log k(X \mid Y; \phi') \tag{4.1} \end{equation}\] where \(Q(\phi' \mid \phi) = E_{X \mid Y,\phi} L_0(\phi', X)\), and \(L_0\) is the complete-data loglikelihood.

Assuming that the usual exchange of expectation with respect to \(X\) and differentiation in \(\phi\) hold for \(\log k(X | Y, \phi)\). This gives two identities \[ E_{X\mid Y,\phi} \frac{\partial\log k(X\mid Y; \phi)}{\partial \phi} = 0 \] and \[ E_{X\mid Y,\phi} \frac{\partial^2 \log k(X\mid Y; \phi)}{\partial\phi^2} + E_{X\mid Y,\phi} \frac{\partial \log k(X\mid Y; \phi)}{\partial \phi} \left[\frac{\partial \log k(X\mid Y; \phi)}{\partial \phi}\right]^{\top}. \]

Differentiation of (4.1) in \(\phi'\) gives \[\begin{equation} \frac{\partial L}{\partial \phi} = \frac{\partial Q(\phi' | \phi)}{\partial \phi'} - E_{X\mid Y, \phi} \frac{\partial \log k(X\mid Y, \phi')}{\partial \phi'} . \tag{4.2} \end{equation}\] Substituting \(\phi\) with \(\phi'\) makes the second term vanish, which leads to \[ \frac{\partial L}{\partial \phi} = \left\{\frac{\partial Q(\phi' | \phi)}{\partial \phi'} \right\}_{\phi = \phi'}. \]

Differentiation of (4.2) in \(\phi'\) and \(\phi\) gives respectively \[ \frac{\partial^2 L}{\partial \phi'^2} = \frac{\partial^2 L}{\partial \phi'^2} - E_{X\mid Y, \phi} \frac{\partial^2 \log k(X\mid Y, \phi')}{\partial \phi'^2}, \] and \[ 0 = \frac{\partial^2 Q(\phi'|\phi)}{\partial \phi'\partial \phi} - E_{X\mid Y, \phi} \frac{\partial \log k(X\mid Y, \phi')}{\partial \phi'} \left[\frac{\partial \log k(X\mid Y, \phi)}{\partial \phi} \right]^{\top}. \] Substituting \(\phi = \phi'\), adding the two equations and using the information identity give \[ \frac{\partial^2 L}{\partial \phi^2} = \left\{\frac{\partial^2 Q(\phi'|\phi)}{\partial \phi'^2} + \frac{\partial^2 Q(\phi'|\phi)}{\partial \phi'\partial \phi} \right\}_{\phi' = \phi}, \] which is valid for all \(\phi\). The second term is the ``missing information’’ due to the fact that only \(Y\) instead of \(X\) is observed.

If the complete data is from an exponential family, the second term reflects the sensitivity of the imputed complete-data sufficient statistic to changes in hypothesized parameter value.

4.6 Acceleration

Can we obtain fast convergence without sacrificing the stability of EM?

4.7 Example: Hidden Markov Model

A hidden Markov model (HMM) is a dependent mixture model. The model consists of two parts: 1) a Markov process for the unobserved state process \(\{S_t: t = 1, 2, \ldots\}\); and 2) a state-dependent process for the observed data \(\{X_t: t = 1, 2, \ldots\}\) such that when \(S_t\) is known, the distribution of \(X_t\) depends on \(S_t\) only and not on previous states or observations. It can be characterized by \[\begin{align} \Pr(S_t \mid S_1, \ldots, S_{t - 1}) &= \Pr(S_t \mid S_{t-1}),\quad t = 2, 3, \ldots, \\ \Pr(X_t \mid S_1, \ldots, S_t; X_1, \ldots, X_{t-1}) &= \Pr(X_t \mid S_t), \quad t = 1, 2, \ldots. \end{align}\] HMMs have a wide range of applications.

Suppose that process \(S_t\) has \(m\) states. Let \(\delta\) be the parameters of the initial distribution of \(S_t\). Let \(\Gamma\) be the probability transition matrix of \(S_t\). Let \(\theta\) be the parameter vector in \(\Pr(X_t | S_t)\). Given the observed data \(\{x_t: t = 1, \ldots, T\}\), these parameters can be estimated by an application of the EM algorithm where the unobserved states \(\{s_t: t = 1, \ldots, T\}\) are treated as missing data.

To devise the EM algorithm, we need the complete data loglikelihood. Let \(u_j(t) = I(s_t = j)\), \(t = 1, \ldots, T\), \(j = 1, \ldots, m\), and \(v_{jk}(t) = I(s_{t-1} = j, s_t = k)\), \(t = 2, \ldots, T\), and \(j, k \in \{1, \ldots, m\}\). The complete data loglikelihood is \[\begin{align*} &{=} \log p(s_1, \ldots, s_T, x_1, \ldots, x_T)\\ &= \log \delta_{s_1} \prod_{t=2}^T \Gamma_{s_{t-1}, s_t} \prod_{t=1}^T p(x_t | s_t; \theta)\\ &= \log \delta_{s_1} + \sum_{t=2}^T \log \Gamma_{s_{t-1}, s_t} + \sum_{t=1}^T \log p(x_t | s_t; \theta)\\ &= \sum_{i=1}^m u_j(1) \log \delta_j + \sum_{j=1}^m \sum_{k=1}^m \sum_{t=2}^T v_{jk}(t) \log \Gamma_{jk} + \sum_{j=1}^m \sum_{t=1}^T u_j(t) \log p(x_t | s_t = j; \theta) \end{align*}\]

In the E-step, quantities \(u_j(t)\) and \(v_{jk}(t)\) needs be replaced with their conditional expectations given \(\{x_t: t = 1, \ldots, T\}\). Define row vector \(\alpha(t)\) as \[ \alpha(t) = \delta \prod_{r=2}^t \Gamma P(x_r), \] where \(P(x_r | \theta) = diag[p(x_r | 1; \theta), \ldots, p(x_r | m; \theta)]\). This is indeed a probability vector, known as the forward probability, because the \(j\)th component is \(\alpha_j(t) = p(x_1, \ldots, x_t, s_t = j)\). Define vector \(\beta(t)\) as \[ \beta(t) = \prod_{r = t + 1}^T \Gamma P(x_r) \mathbf{1}, \] with the convention that an empty product is the identity matrix, where \(\mathbf{1}\) is the column vector of ones. This is also a probability vector, known as the backward probability, because the \(j\)th component is \(\beta_j(t) = p(x_{t+1}, \ldots, x_T | s_t = j)\). The forward probabilities are affected by the initial probabilities \(\delta\), and the model does not need to assume stationarity of \(S_t\). The backward probabilities, however, are not affected by stationarity of \(S_t\).

It follows that for all \(t \in \{1, \ldots, T\}\) and \(j \in \{1, \ldots, m\}\), \[ \alpha_j(t) \beta_j(t) = p(x_1, \ldots, x_T, s_t = j). \] Consequently, \(\alpha(t) \beta(t) = p(x_1, \ldots, x_T) = L_T\) for each \(t\). It can be verified that, for \(t \in \{1, \ldots, T\}\), \[\begin{equation} p(S_t = j | x_1, \ldots, x_T) = \alpha_j(t) \beta_j(t) / L_T, \tag{4.3} \end{equation}\] and that, for \(t \in \{2, \ldots, T\}\), \[\begin{equation} p(s_{t - 1} = j, s_t = k | x_1, \ldots, x_T) = \alpha_j(t - 1) \Gamma_{jk} p(x_t | k; \theta) \beta_k(t) / L_T. \tag{4.4} \end{equation}\]

Equations (4.3) and (4.4), respectively, give the needed conditional expectation in the E-step for \(u_j(t)\) and \(v_{jk}(t)\) given the observations \(\{x_t: t = 1, \ldots, T\}\).

In the M-step, the maximization with respect to three sets of parameters \(\delta\), \(\Gamma\), and \(\theta\) can be done separately; each term in the summation depends only on one set. Maximization with respect to \(\delta\) and \(\Gamma\) has closed-form solution, while maximization with respect to \(\theta\) needs to be done numerically in general.

4.8 Exercises

4.8.1 Finite mixture regression

Given \(n\) independent observations of the response \(Y \in \mathbb{R}\) and predictor \(\mathbf{X} \in \mathbb{R}^p\), multiple linear regression models are commonly used to explore the conditional mean structure of \(Y\) given \(\mathbf{X}\). However, in many applications, the underlying assumption that the regression relationship is homogeneous across all the observations \((y_1, \mathbf{x}_1),\ldots,(y_n, \mathbf{x}_n)\) can be easily violated. Instead, the observations may form several distinct clusters indicating mixed relationships between the response and the predictors. Such heterogeneity can be more appropriately modeled by a finite mixture regression model, consisting of, say, \(m\) homogeneous groups/components.

Suppose the density of \(y_i\) (conditional on \(\mathbf{x}_i\)), is given by \[\begin{eqnarray} f(y_i\mid \mathbf{x}_i,\boldsymbol{\Psi})= \sum_{j=1}^{m} \pi_{j}\phi(y_i;\mathbf{x}_i^{\top}\boldsymbol{\beta}_{j}, \sigma^2),\qquad i=1,\ldots,n, \tag{4.5} \end{eqnarray}\] where \(\pi_j\)s are called mixing proportions, \(\boldsymbol{\beta}_j\) is the regression coefficient vector for the \(j\)th group, \(\phi(\cdot\:;\mu,\sigma^2)\) denotes the density function of \(N(\mu,\sigma^2)\), and \(\boldsymbol{\Psi}=(\pi_1,\boldsymbol{\beta}_1,\ldots,\pi_m,\boldsymbol{\beta}_m,\sigma)^T\) collects all the unknown parameters.

Maximum likelihood estimation is commonly used to infer the unknown arameter \(\boldsymbol{\Psi}\) in (4.5), i.e., \[\begin{equation} \hat{\boldsymbol{\Psi}}_{\mbox{mle}}=\arg\max_{\boldsymbol{\Psi}}\sum_{i=1}^n\log\left\{\sum_{j=1}^m\pi_j\phi(y_i;\textbf{x}_i^{\top}\boldsymbol{\beta}_{j},\sigma^2)\right\}. \tag{4.6} \end{equation}\] The MLE does not have an explicit form and the problem is usually solved by the EM algorithm.

Let \(z_{ij} = 1\) if \(i\)th observation is from \(j\)th component} and zero otherwise. Denote the complete data by \(\{(\mathbf{x}_{i}, \mathbf{z}_{i}, y_{i}); i =1,2,\ldots,n\}\), where the component labels \(\mathbf{z}_{i} = (z_{i1}, z_{i2}, \ldots, z_{im})\) are not observable or “missing” in practice. The complete log-likelihood can be written as \[ l_{n}^{c}(\boldsymbol{\Psi})=\sum_{i=1}^{n}\sum_{j=1}^{m}z_{ij}\log \left\{\pi_{j}\phi(y_{i}-\mathbf{x}_{i}^{\top}\boldsymbol{\beta}_{j};0,\sigma^{2})\right\}. \] In the E-step, we calculate the conditional expectation of the complete log-likelihood with respect to \(\mathbf{z}_{i}\), and in the M-step, we maximize the obtained conditional expectation with respect to \(\boldsymbol{\Psi}\).

At the \(k\)th iteration, the E-Step computes the conditional expectation of \(l_{n}^{c}(\boldsymbol{\Psi})\): \[\begin{align*} Q(\boldsymbol{\Psi}\mid \boldsymbol{\Psi}^{(k)}) =\sum_{i=1}^{n}\sum_{j=1}^{m}p_{ij}^{(k+1)} \left\{\log\pi_{j}+\log\phi(y_{i}-\textbf{x}_{i}^{\top}\boldsymbol{\beta}_{j};0,\sigma^{2})\right\}, \end{align*}\] where \[\begin{align*} p_{ij}^{(k+1)}&=E(z_{ij}\mid y_{i},\textbf{x}_{i};\boldsymbol{\Psi}^{(k)}) =\frac{\pi_{j}^{(k)}\phi(y_{i}-\textbf{x}_{i}^{\top}\boldsymbol{\beta}_{j}^{(k)};0,\sigma^{2^{(k)}})}{\sum_{j=1}^{m}\pi_{j}^{(k)}\phi(y_{i}-\textbf{x}_{i}^{\top}\boldsymbol{\beta}_{j}^{(k)};0,\sigma^{2^{(k)}})}. \end{align*}\] The M-Step maximizes \(Q(\boldsymbol{\Psi} \mid \boldsymbol{\Psi}^{(k)})\) to obtain \[\begin{align*} \pi_{j}^{(k+1)} &=\frac{\sum_{i=1}^{n}p_{ij}^{(k+1)}}{n}\\ \boldsymbol{\beta}_{j}^{(k+1)}&=\left(\sum_{i=1}^{n}\textbf{x}_i\textbf{x}_{i}^{\top}p_{ij}^{(k+1)}\right)^{-1}\left(\sum_{i=1}^{n}\textbf{x}_ip_{ij}^{(k+1)}y_i\right),\qquad j=1,\ldots,m;\\ \sigma^{2^{(k+1)}}&=\frac{\sum_{j=1}^{m}\sum_{i=1}^{n}p_{ij}^{(k+1)}(y_{i}-\textbf{x}_{i}^{\top}\boldsymbol{\beta}_{j}^{(k+1)})^{2}}{n}. \end{align*}\]

  1. Follow the lecture notes to verify the validity of the provided E- and M-steps. That is, derive the updating rules in the given algorithm based on the construction of an EM algorithm.

  2. Implement this algorithm in R with a function regmix_em. The inputs of the functions are y for the response vector, xmat for the design matrix, pi.init for initial values of \(\pi_j\)’s (a vector of \(K\times 1\) vector), beta.init for initial values of \(\boldsymbol{\beta}_j\)’s (a matrix of \(p \times K\) where \(p\) is ncol(xmat) and \(K\) is the number of components in the mixture), sigma.init for initial values of \(\sigma\), and a control list for controlling max iteration number and convergence tolerance. The output of this function is the EM estimate of all the parameters.

  3. Here is a function to generate data from the mixture regression model.

Generate data with the following and estimate the parameters.

4.8.2 Acceleration of EM algorithm

Use the SQUAREM package to accelerate the EM algorithm of the finite mixture regression. See Ravi Varadhan’s article for his example.

  1. Write a function regmix_em1step() to implement the one-step EM iteration.

  2. Wrap your implementation of the EM algorithm to a function that uses regmix_em1step().

  3. Call SQUAREM::squarem() with appropriate inputs to find the MLE.

  4. Compare the speed of the two versions with package microbenchmark.

4.8.3 A Poisson-HMM for earthquake data

Consider a \(m\)-state HMM where the observed data follows a Poisson distribution with mean \(\lambda_i\) for state \(i\), \(i = 1, \ldots, m\), respectively. This model has three sets of parameters initial distribution \(\delta\), transition probability matrix \(\Gamma\), and Poisson means \(\lambda = (\lambda_1, \ldots, \lambda_m)\). Suppose the observed data is a vector \(x\). Write a function poisson.hmm.em() that implements the EM algorithm for this model with these argument:

  • x: the observed data;
  • m: the number of states;
  • lambda: initial value of \(\lambda\);
  • Gamma: initial value of \(\Gamma\);
  • delta: initial value of \(\delta\);
  • control: a named list similar to that specifies the tolerance, maximum number of iteration, and whether or not trace the iteration.

Apply the function to model the frequency of major earthquakes (magnitude 7 or above) in the world from 1900 to 2015 with a 2-state HMM and a 3-state HMM.

References

Dempster, Arthur P, Nan M Laird, and Donald B Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) 39 (1). JSTOR: 1–38.

Levine, Richard A, and George Casella. 2001. “Implementations of the Monte Carlo EM Algorithm.” Journal of Computational and Graphical Statistics 10 (3). Taylor & Francis: 422–39.

Meng, Xiao-Li, and Donald B Rubin. 1991. “Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm.” Journal of the American Statistical Association 86 (416). Taylor & Francis: 899–909.

Meng, Xiao-Li, and Donald B Rubin. 1993. “Maximum Likelihood Estimation via the ECM Algorithm: A General Framework.” Biometrika 80 (2). Biometrika Trust: 267–78.

Oakes, David. 1999. “Direct Calculation of the Information Matrix via the EM.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (2). Wiley Online Library: 479–82.

Wei, Greg CG, and Martin A Tanner. 1990. “A Monte Carlo Implementation of the EM Algorithm and the Poor Man’s Data Augmentation Algorithms.” Journal of the American Statistical Association 85 (411). Taylor & Francis: 699–704.