Theory#

What is the Ensemble Smoother with Multiple Data Assimilations (ESMDA) ?

If you are interested in the math behind pyesmda, you’ve comed to the right place. The following is an extract summary coming from the PhD of Antoine COLLET (see chapter 4.4.1, 4.4.3 and 4.4.4 in Collet [2024]).

General “black-box” framework#

Although these three “black-box” or “derivative-free” inversion methods may at first appear to be unrelated, not least because their respective authors present them in this way and do not use the same notations, they are in fact very similar in that they minimize the same objective function and can be derived from the same approximation form of Newton’s method: Gauss-Newton iterations. But they differ in the computation of certain matrices and in the implementation.

Bayesian framework#

Considering the measurement equation

\[\mathbf{d}_{\mathrm{obs}} = \mathcal{F}(\mathbf{s}) + \epsilon, \quad \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{C}_{\mathrm{obs}}),\]

where \(\mathcal{F}\) designates the forward operator predictions masked in the data domain to match observations, the history matching problem consists in finding \(\mathbf{s}\) knowing \(\mathbf{d}_{\mathrm{obs}}\). This is equivalent to maximizing the probability function \(P(\mathbf{s}|\mathbf{d}_{\mathrm{obs}})\) to obtain \(\widehat{\mathbf{s}}\) the maximum likelihood estimate (MLE). Using Bayes theorem, we can calculate the posterior probability density function (PDF) of an event that will occur based on another event that has already occurred with

\[\begin{split}\begin{aligned} \mathrm{Bayes' rule:} \quad P(\mathbf{s}|\mathbf{d}_{\mathrm{obs}}) &= P(\mathbf{s}) \times \dfrac{P(\mathbf{d}_{\mathrm{obs}}|\mathbf{s})}{P(\mathbf{d}_{\mathrm{obs}})}, \\ & \propto \underbrace{P(\mathbf{d}_{\mathrm{obs}}|\mathbf{s})}_{\mathrm{likelyhood}} \times \underbrace{P(\mathbf{s})}_{\mathrm{prior}}, \end{aligned}\end{split}\]

where \(P(\mathbf{d}_{\mathrm{obs}})\) is neglected because it is always positive, it does not depend on \(\mathbf{s}\) and it is very difficult to compute. Since the measurement errors are assumed to be Gaussian, i.e., \(\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{C}_{\mathrm{obs}})\), then \(P(\mathbf{d}_{\mathrm{obs}}|\mathbf{s})\) is Gaussian, and noting that \(\mathbb{E}\left[\mathbf{d}_{\mathrm{obs}}|\mathbf{s}\right] = \mathcal{F}(\mathbf{s})\) and \(\mathrm{cov}(\mathcal{F}(\mathbf{s})) = \mathbf{C}_{\mathrm{obs}}\), it follows that

\[P(\mathbf{d}_{\mathrm{obs}}|\mathbf{s}) = \dfrac{1}{(2\pi)^{N_{\mathrm{obs}}/2}} \dfrac{1}{\sqrt{\mathrm{det(\mathbf{C}_{\mathrm{obs}})}}} \exp \left( - \dfrac{1}{2}\left(\mathbf{d}_{\mathrm{obs}} - \mathcal{F}(\mathbf{s})\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1}\left(\mathbf{d}_{\mathrm{obs}} - \mathcal{F}(\mathbf{s})\right) \right).\]

In addition, one always has some prior information about the plausibility of models. In the case of history matching (HM), this includes a geological prior model (hopefully stochastic) constructed from log, core and seismic data, as well as information about the depositional environment. A general assumption is that the updated parameter \(\mathbf{s}\) is a realization of a random \(N_{\mathrm{s}}-\mathrm{dimensional}\) vector \(\mathbf{S}\), which is (multivariate) Gaussian or multinormal with mean \(\mathbf{s}_{\mathrm{prior}}\) and autocovariance \(\mathbf{C}_{\mathrm{prior}}\). In other words, \(\mathbf{S} \sim \mathcal{N}(\mathbf{s}_{\mathrm{prior}}, \mathbf{C}_{\mathrm{prior}})\) and the probability density for \(\mathbf{S}\) is

\[P(\mathbf{s}) = \dfrac{1}{(2\pi)^{N_{\mathrm{s}}/2}} \dfrac{1}{\sqrt{\mathrm{det(\mathbf{C}_{\mathrm{prior}})}}} \exp \left( - \dfrac{1}{2}\left(\mathbf{s} - \mathbf{s}_{\mathrm{prior}}\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{prior}}^{-1}\left(\mathbf{s} - \mathbf{s}_{\mathrm{prior}}\right) \right).\]

Maximizing the product of the prior and the likelihood inserted into Bayes’ theorem yields

\[P(\mathbf{s}|\mathbf{d}_{\mathrm{obs}}) = a \exp\left( -\mathcal{J} \right),\]

with

(1)#\[\begin{split}\begin{aligned} a &= \dfrac{1}{(2\pi)^{N_{\mathrm{obs}}/2}} \dfrac{1}{\sqrt{\mathrm{det(\mathbf{C}_{\mathrm{obs}})}}} \times \dfrac{1}{(2\pi)^{N_{\mathrm{s}}/2}} \dfrac{1}{\sqrt{\mathrm{det(\mathbf{C}_{\mathrm{prior}})}}}, \\ \mathcal{J}(\mathbf{s}) &= \underbrace{\frac{1}{2} \left(\mathcal{F}(\mathbf{s}) - \mathbf{d}_{\mathrm{obs}}\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1}\left(\mathcal{F}(\mathbf{s}) - \mathbf{d}_{\mathrm{obs}}\right)}_{\mathrm{likelihood~(data misfit)}} \underbrace{+ \frac{1}{2} \left(\mathbf{s} - \mathbf{s}_{\mathrm{prior}}\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{prior}}^{-1}\left(\mathbf{s} - \mathbf{s}_{\mathrm{prior}}\right)}_{\mathrm{prior}}, \end{aligned}\end{split}\]

Maximizing \(P(\mathbf{s}|\mathbf{d}_{\mathrm{obs}})\), i.e., finding the MAP estimate (the mode of the posterior distribution), is the same as minimizing its negative logarithm, i.e., the negative log-likelihood \(\mathcal{J}\).

Gauss-Newton iterations#

The MAP or most likely values are obtained by minimizing \(\mathcal{J}\) from (1) with respect to the vector \(\mathbf{s}\). As explained earlier, based on the initial guess (or most recent “good solution”) \(\mathbf{s}_{\ell}\), a Newton-type iterative approach leads to a new solution \(\mathbf{s}_{\ell+1}\) according to

\[\mathbf{s}_{\ell+1} = \mathbf{s}_{\ell} - \gamma_{\ell} \left(\dfrac{\partial^{2} \mathcal{J} (\mathbf{s}_{\ell})}{\partial \mathbf{s}_{\ell}^{2}}\right)^{-1} \dfrac{\partial \mathcal{J} (\mathbf{s}_{\ell})}{\partial \mathbf{s}_{\ell}}.\]

Defining \(\mathbf{J}_{\ell}\), the \((N_{\mathrm{obs}} \times N_{\mathrm{s}})\) Jacobian matrix of \(\mathcal{F}\) at \(\mathbf{s}_{\ell}\) as

\[\begin{split}\mathbf{J}_{\ell} = \left.\dfrac{\partial \mathcal{F}}{\partial \mathbf{s}} \right|_{\mathbf{s}=\mathbf{s}_{\ell}} = \mathcal{F}(\mathbf{s}_{\ell})\nabla_{\mathrm{s}}^{\mathrm{T}} = \displaystyle \begin{bmatrix} \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{0}}{\partial s_{0}} & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{0}}{\partial s_{1}} & \ldots & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{0}}{\partial s_{N_{\mathrm{s}}-1}} \\ \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{1}}{\partial s_{0}} & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{1}}{\partial s_{1}} & \ldots & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{1}}{\partial s_{N_{\mathrm{s}}-1}} \\ \vdots & \vdots & \ddots & \vdots \\ \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{N_{\mathrm{obs}}-1}}{\partial s_{0}} & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{N_{\mathrm{obs}}-1}}{\partial s_{1}} & \ldots & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{N_{\mathrm{obs}}-1}}{\partial s_{N_{\mathrm{s}}-1}} \end{bmatrix},\end{split}\]

with the gradient row vector operator \(\nabla_{\mathrm{s}}^{\mathrm{T}} = \begin{bmatrix} \dfrac{\partial.}{\partial s_{0}}, & \ldots & \dfrac{\partial.}{\partial s_{N_{\mathrm{s}} - 1}} \end{bmatrix}\) and the column vector \(\mathcal{F}(\mathbf{s}_{\ell})\), \(\mathcal{F}\) is linearized around \(\mathbf{s}_{\ell}\) with \(\mathcal{F}(\mathbf{s}_{\ell+1}) \approx \mathcal{F}(\mathbf{s}_{\ell}) + \mathbf{J}_{\ell}(\mathbf{\mathbf{s}_{\ell+1}} - \mathbf{s}_{\ell})\). Then, the gradient of \(\mathcal{J}\) reads

(2)#\[\dfrac{\partial \mathcal{J} (\mathbf{s}_{\ell})}{\partial \mathbf{s}_{\ell}} = \mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \left(\mathcal{F}(\mathbf{s}_{\ell})- \mathbf{d}_{\mathrm{obs}}\right) + \mathbf{C}_{\mathrm{prior}}^{-1} \left(\mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}}\right).\]

The Hessian is approximated in the Gauss-Newton way, i.e., neglecting \(\nabla^{2}_{\mathrm{s}} \mathcal{F}(\mathbf{s}_{j})^{\mathrm{T}}\):

(3)#\[\begin{split}\begin{aligned} \dfrac{\partial^{2} \mathcal{J} (\mathbf{s}_{\ell})}{\partial (\mathbf{s}_{\ell})^{2}} & = \mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \nabla^{2}_{\mathbf{s}} \mathcal{F}(\mathbf{s}_{\ell})^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \left(\mathcal{F}(\mathbf{s}_{\ell}) - \mathbf{d}_{\mathrm{obs}}\right) + \mathbf{C}_{\mathrm{prior}}^{-1}, \\ & \approx \mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \mathbf{C}_{\mathrm{prior}}^{-1}. \end{aligned}\end{split}\]

A Gauss-Newton iteration is then defined as

(4)#\[\begin{split}\begin{aligned} \mathbf{s}_{j, \ell+1} & = \mathbf{s}_{\ell} &- \gamma_{\ell} & \left(\dfrac{\partial^{2} \mathcal{J} (\mathbf{s}_{\ell})}{\partial \mathbf{s}_{\ell}^{2}}\right)^{-1} \dfrac{\partial \mathcal{J} (\mathbf{s}_{\ell})}{\partial \mathbf{s}_{\ell}}, \\ & = \mathbf{s}_{\ell} &- \gamma_{\ell} & \Bigg(\mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \mathbf{C}_{\mathrm{prior}}^{-1}\Bigg)^{-1} \Bigg(\mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \left(\mathcal{F}(\mathbf{s}_{\ell}) - \mathbf{d}_{\mathrm{obs}}\right) + \mathbf{C}_{\mathrm{prior}}^{-1} \left(\mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}}\right)\Bigg), \\ & = \mathbf{s}_{\ell} &- \gamma_{\ell} & \Bigg(\mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \mathbf{C}_{\mathrm{prior}}^{-1}\Bigg)^{-1} \mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \left(\mathcal{F}(\mathbf{s}_{\ell}) - \mathbf{d}_{\mathrm{obs}}\right), \\ & &- \gamma_{\ell} & \Bigg(\mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \mathbf{C}_{\mathrm{prior}}^{-1}\Bigg)^{-1} \mathbf{C}_{\mathrm{prior}}^{-1} \left(\mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}}\right). \end{aligned}\end{split}\]

Since inverting large covariance matrices is not practical, we use the two following matrix inversion lemmas [Golub and Van Loan, 1996, Petersen and Pedersen, 2008] for the second and third right-hand side terms of (4) respectively

(5)#\[\begin{split}\begin{aligned} & \displaystyle \left(\mathbf{A}^{-1}+\mathbf{U}\mathbf{C}^{-1}\mathbf{V}\right)^{-1} \mathbf{U} \mathbf{C}^{-1}=\mathbf{A}\mathbf{U}\left(\mathbf{C}+\mathbf{VAU}\right)^{-1}, \\ & \displaystyle \left(\mathbf{A}^{-1}+\mathbf{U}\mathbf{C}^{-1}\mathbf{V}\right)^{-1}=\mathbf{A}-\mathbf{A}\mathbf{U}\left(\mathbf{C}+\mathbf{VAU}\right)^{-1}\mathbf{VA}, \end{aligned}\end{split}\]

with \(\mathbf{A} = \mathbf{C}_{\mathrm{prior}}\), \(\mathbf{C} = \mathbf{C}_{\mathrm{obs}}\), \(\mathbf{U} = \mathbf{J}_{\ell}^{\mathrm{T}}\) and \(\mathbf{V} = \mathbf{J}_{\ell}\), which gives

(6)#\[\begin{split}\begin{aligned} \mathbf{s}_{\ell+1} & = \mathbf{s}_{\ell} - \gamma_{\ell} \Bigg[&&\mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \Bigg(\mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathcal{F}(\mathbf{s}_{\ell}) - \mathbf{d}_{\mathrm{obs}}\Big) \\ & && + \mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}} - \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \Bigg(\mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \mathbf{J}_{\ell} \Big(\mathbf{s}_{j,l} - \mathbf{s}_{\mathrm{prior}}\Big) \Bigg] \\ & = \mathbf{s}_{\ell} - \gamma_{\ell} \Bigg[ && \mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}} + \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \Bigg(\mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathcal{F}(\mathbf{s}_{\ell}) - \mathbf{d}_{\mathrm{obs}} - \mathbf{J}_{\ell} \Big(\mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}}\Big)\Big) \Bigg]. \end{aligned}\end{split}\]

As previously stated, all implementations described below are derived from this last equation but differ in 1) the way \(\mathbf{J}_{\ell}\) is approximated, 2) the representation of \(\mathbf{C}_{\mathrm{prior}}\), 3) how matrix inversions are performed, and 4) how the step length \(\gamma\) is chosen.

Uncertainty quantification#

Minimizing \(\mathcal{J}\) allows to find the MAP \(\widehat{\mathbf{s}}\) i.e., the mean of the PDF but it does not answer the question of how to sample the full PDF (uncertainty quantification). A classic way relies on the linearization of the measurement operator (through the sensitivity matrix \(\mathbf{J}\)) which, under the assumptions made in sec_bayesian_framework, yields a local Gaussian for the posterior PDF. This local Gaussian is specified by its mean (the MAP) and the posterior covariance matrix \(\mathbf{C}_{\mathrm{post}}\) which can be approximated by the inverse of the Hessian of the negative log-likelihood of the posterior PDF computed at the MAP estimate [Kitanidis, 1995, Lepine et al., 1999, Tarantola, 2005]. Considering iteration \(\ell\) at which \(\mathbf{s}_{\ell} = \widehat{\mathbf{s}}\),

\[\mathbf{C}_{\mathrm{ss}, \ell} \approx \left(\mathbf{H}_{\ell}\right)^{-1} \approx \left(\mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \mathbf{C}_{\mathrm{prior}}^{-1}\right)^{-1}.\]

Using the second lemma in (5), it gives

(7)#\[\mathbf{C}_{\mathrm{ss},\ell} \approx \mathbf{C}_{\mathrm{prior}} - \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \left(\mathbf{C}_{\mathrm{obs}} + \mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \right)^{-1} \mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}},\]

and using the forward operator linearization, the posterior covariance matrix on predictions is expressed following Lepine et al. [1999]

(8)#\[\mathbf{C}_{\mathrm{dd},\ell} \approx \mathbf{J}_{\ell} \mathbf{C}_{\mathrm{ss},\ell} \mathbf{J}_{\ell}^{\mathbf{T}}.\]

For large-scale systems, computing and storing the approximation to \(\mathbf{C}_{\mathrm{ss}}\) is computationally infeasible because the prior covariance matrices arise from finely discretized fields and certain covariance kernels are dense [Saibaba and Kitanidis, 2015]. In addition, computing the dense measurement operator requires solving many forward PDE problems, which can be computationally intractable. Note also that when a quasi-Newton optimization is used such as L-BFGS-B, the BFGS approximation may not converge to the true Hessian matrix [Ren-pu and Powell, 1983], hence, this approximation can not be used as a posterior covariance matrix. To ovecome these issues, uncertainty analysis and optimization can be conducted using randomized sampling.

Randomized Maximum Likelihood (RML)#

Practically, a rigorous sampling procedure of the PDF, e.g., Markov chain Monte Carlo aka MCMC [Bonet-Cunha et al., 1996, Oliver et al., 1997] is untractable for large-scale problems and approximate sampling methods must be used [Emerick and Reynolds, 2013]. The Randomized Maximum Likelihood approach is one of them. It was introduced by both Kitanidis [1995] and Oliver et al. [1996] and consists of sampling both from the prior distribution \(\mathbf{s} \sim \mathcal{N}(\mathbf{s}_{\mathrm{prior}},\mathbf{C}_{\mathrm{prior}})\) and the measurements distribution \(\mathbf{d}_{\mathrm{uc}} \sim \mathcal{N}(\mathbf{d}_{\mathrm{obs}},\mathbf{C}_{\mathrm{obs}})\), forming a set of \(N_{e}\) couples of “perturbed” parameter and observation vectors {\(\mathbf{s}_{j}\), \(\mathbf{d}_{\mathrm{uc}, j}\)}, also called realizations. The subscript “uc” stands for unconditional because \(\mathbf{d}_{\mathrm{uc}_{j}} = \mathbf{d}_{\mathrm{obs}} + \mathbf{v}_{j}\), with \(\mathbf{v}_{j}\) being an unconditional realization of \(\mathbf{C}_{\mathrm{obs}}\) with zero mean. Instead of finding a single vector \(\widehat{\mathbf{s}}\) with \(\mathbf{s}_{0} = \mathbf{s}_{\mathrm{prior}}\) as an initial guess, RML requires solving \(N_{e}\) independent minimization problems – one for each draw which are used as initial guess and measurement vectors – with the following modified stochastic cost function

(9)#\[\mathcal{J}(\mathbf{s}_{j}) = \underbrace{\frac{1}{2} \left(\mathcal{F}(\mathbf{s}_{j}) - \mathbf{d}_{\mathrm{uc},j}\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1}\left(\mathcal{F}(\mathbf{s}_{j}) - \mathbf{d}_{\mathrm{uc}, j}\right)}_{\mathrm{likelihood (data misfit)}} \underbrace{+ \frac{1}{2} \left(\mathbf{s}_{j} - \mathbf{s}_{\mathrm{prior}}\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{prior}}^{-1}\left(\mathbf{s}_{j} - \mathbf{s}_{\mathrm{prior}}\right)}_{\mathrm{prior}},\]

where \(j\) denotes the \(j^{\mathrm{th}}\) draw. After optimizing the \(N_{e}\) problems, one obtains two ensembles of posterior parameter and prediction vectors that can be written under matrix form as

\[\begin{split}\begin{aligned} \mathbf{S} &= \begin{pmatrix}\mathbf{s}_{0}, & \mathbf{s}_{1}, & \dots, & \mathbf{s}_{N_{e}-1}\end{pmatrix}, \\ \mathbf{D} &= \begin{pmatrix}\mathbf{d}_{0}, & \mathbf{d}_{1}, & \dots, & \mathbf{d}_{N_{e}-1}\end{pmatrix}, \end{aligned}\end{split}\]

with shape (\(N_{\mathrm{s}} \times N_{e}\)) and (\(N_{\mathrm{obs}} \times N_{e}\)) respectively. Given two samples consisting of \(N_{e}\) independent realizations \(\mathbf{x}_{0}, ..., \mathbf{x}_{N_{e}-1}\) and \(\mathbf{y}_{0}, \dots, \mathbf{y}_{N_{e}-1}\) of \(N_{x}\) and \(N_{y}-\mathrm{dimensional}\) vectors \(\mathbf{X} \in \mathbb{R}^{N_{x}\times 1}\) and \(\mathbf{Y} \in \mathbb{R}^{N_{y}\times 1}\), an unbiased estimator of the covariance matrix

\[\mathrm{cov}(\mathbf{X},\mathbf{Y}) = \mathbb{E}\left[\left(\mathbf{X}-\mathbb{E}[\mathbf{X}]\right)\left(\mathbf{Y}-\mathbb{E}[\mathbf{Y}]\right)^{\mathrm{T}}\right],\]

is the empirical (or sample) covariance matrix denoted by a \(\sim\)

(10)#\[\widetilde{\mathbf{C}} = \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{x}_{j} - \overline{\mathbf{x}}\right)\left(\mathbf{x}_{j} - \overline{\mathbf{x}} \right)^{\mathrm{T}}.\]

The empirical auto-covariance matrices for \(\mathbf{s}\) and \(\mathbf{d}\) can then be computed from the ensembles \(\mathbf{S}\) and \(\mathbf{D}\) as

(11)#\[\begin{split}\begin{aligned} \widetilde{\mathbf{C}}_{\mathrm{SS}} & = \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{s}_{j} - \overline{\mathbf{s}}\right)\left(\mathbf{s}_{j} - \overline{\mathbf{s}} \right)^{\mathrm{T}} = \mathbf{A}_\mathrm{s}\mathbf{A}_\mathrm{s}^{\mathrm{T}}, \\ \widetilde{\mathbf{C}}^{f}_{\mathrm{DD}} &= \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{d}_{j} -\overline{\mathbf{d}} \right)\left(\mathbf{d}_{j} - \overline{\mathbf{d}} \right)^{\mathrm{T}} = \mathbf{A}_\mathrm{d}\mathbf{A}_\mathrm{d}^{\mathrm{T}}, \\ \end{aligned}\end{split}\]

with \(\mathbf{A}_\mathrm{s}\) and \(\mathbf{A}_\mathrm{d}\) the centered anomaly matrices with size (\(N_{\mathrm{s}} \times N_{e}\)) and (\(N_{\mathrm{obs}} \times N_{e}\)) defined as

(12)#\[\begin{split}\begin{aligned} \mathbf{A}_{\mathrm{s}} &= \dfrac{1}{\sqrt{N_{e}-1}}\mathbf{S} \left(\mathbf{I}_{N_{e}} - \dfrac{1}{N_{e}} \mathbf{11}^{\mathrm{T}} \right), \\ \mathbf{A}_{\mathrm{d}} &= \dfrac{1}{\sqrt{N_{e}-1}}\mathbf{D} \left(\mathbf{I}_{N_{e}} - \dfrac{1}{N_{e}} \mathbf{11}^{\mathrm{T}} \right). \end{aligned}\end{split}\]

where \(\mathbf{1} \in \mathbb{R}^{N_e}\) is defined as a column vector with all elements equal to 1, i.e., \(\mathbf{11}^{\mathrm{T}}\) is a (\(N_e \times N_e\)) matrix filled with one, \(\mathbf{I}_{N_e}\) is the \(N_e\mathrm{-dimensional}\) identity matrix, and the projection \(\left(\mathbf{I}_{N_{e}} - \dfrac{1}{N_{e}} \mathbf{1}^{\mathrm{T}} \right)\) subtracts the mean from the ensemble [Evensen et al., 2019]. Assuming that \(\overline{\mathbf{d}} = \mathcal{F}(\overline{\mathbf{s}})\), a first-order Taylor series expansion gives

(13)#\[\mathbf{d}_{j} - \overline{\mathbf{d}} = \mathcal{F}\left(\mathbf{s}_{j}\right) - \mathcal{F}(\overline{\mathbf{s}}) = \mathbf{J} \left(\mathbf{s}_{j} - \overline{\mathbf{s}}\right).\]

Injecting the previous results in the auto-covariance definitions (11) yields

\[\begin{split}\begin{aligned} \widetilde{\mathbf{C}}_{\mathrm{DD}} &= \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{d}_{j} -\overline{\mathbf{d}} \right)\left(\mathbf{d}_{j} - \overline{\mathbf{d}} \right)^{\mathrm{T}}, \\ & = \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1} \mathbf{J} \left(\mathbf{s}_{j} - \overline{\mathbf{s}}\right) \left(\mathbf{s}_{j} - \overline{\mathbf{s}}\right)^{\mathrm{T}} \mathbf{J}^{\mathrm{T}} = \mathbf{J} \widetilde{\mathbf{C}}_{\mathrm{SS}} \mathbf{J}^{\mathrm{T}}, \end{aligned}\end{split}\]

which is consistent with the analytical expression of \(\mathbf{C}_{\mathrm{dd}}\) given in (8). When used with Gauss-Newton or quasi-Newton optimization coupled to an adjoint state model, the method has been shown to provide correct sampling of the PDF, giving as good a data fit as that generated by MCMC, even for highly nonlinear models [Emerick and Reynolds, 2013]. However, the cost of the method is very high because each optimization problem is solved independently. We will see further that the ensemble-based methods used in this manuscript, ESMDA and SIES, rely on the same idea of randomized sampling but also use the ensemble to compute the sensitivity matrices, thereby avoiding the need for an adjoint state and drastically reducing the number of runs required, while maintaining a correct PDF sampling.

Ensemble smoothers (ES)#

From the Kalman filters to ensemble smoothers#

The Ensemble Smoother is a parameter estimation and data assimilation technique belonging to the family of Kalman filters (Fig. 1). Kalman filters were first introduced in a linear form by Kalman [1960] and have been successively improved. The first improvement was made to handle non-linear systems and resulted in the so-called Extended Kalman Filters, EKF [Schmidt, 1966]. The main limitation of EKF was the cost of explicitly computing and storing the covariance matrix (\(N_{y} \times N_{y}\)), where \(N_{y} = N_{\mathrm{s}} + N_{u}\) the size of the parameter-state vector. The cost of linearization is also important, making the Kalman filter unsuitable for applications to high-dimensional systems (typically, applications in geosciences). For this reason, a Monte Carlo implementation, namely the Ensemble Kalman Filter (EnKF) was introduced by Evensen [1994]. In this case, the covariance matrix is not computed directly but approximated from an ensemble of realizations (og physical fg states of the system), which is much cheaper. This is why these filters are still widely used in meteorological, petroleum and hydrogeological applications among others [Oliver and Chen, 2011]. Following the notation and introduction used by Emerick and Reynolds [2013], in the EnKF, \(\mathbf{y}^{n}\) denotes the sub parameter-state vector for the \(n^{\mathrm{th}}\) data assimilation time step (simulation time step for which there is at least one observation to assimilate). It is defined as

\[\begin{split}\mathbf{y}^{n} = \begin{bmatrix} \mathbf{s}^{n} \\ \mathbf{u}^{n} \end{bmatrix},\end{split}\]

where \(\mathbf{s}^{n}\) and \(\mathbf{u}^{n}\) are the \(N_{\mathrm{s}}\) and \(N_{u}-\mathrm{dimensional}\) vectors of model parameters and state of the dynamical system at iteration \(n\) respectively. In vectorial form, the analysis (update) equation can be written as

(14)#\[\mathbf{y}^{n, a}_{j} = \mathbf{y}^{n, f}_{j} + \widetilde{\mathbf{C}}_{\mathrm{YD}}^{n,f} \left(\widetilde{\mathbf{C}}_{\mathrm{DD}}^{n,f} + \mathbf{C}_{\mathrm{obs}}^{n} \right)^{-1} \left(\mathbf{d}_{\mathrm{uc}, j}^{n} - \mathbf{d}_{j}^{n,f}\right),\]

where \(j\) denotes the index of the member (realization) among the ensemble which is of size \(N_{e}\). The superscript \(a\) stands for og analyzed fg or og assimilated fg while \(f\) denotes the og forecasted fg state before the correction (assimilation). Practically, this means that when observations are available and an assimilation is performed at time index \(n\), \(\mathbf{y}^{n, a}_{j}\) becomes the new system state and it is used instead of \(\mathbf{y}^{n, f}_{j}\) to predict \(\mathbf{y}^{n+1, f}_{j}\). Thus, \(\widetilde{\mathbf{C}}_{\mathrm{YD}}^{n,f}\) is the \(N_{y} \times N_{d}^{n}\) empirical cross-covariance matrix between the forecast state vector and predicted data vector; \(\widetilde{\mathbf{C}}_{\mathrm{DD}}^{n,f}\) is the \(N_{d}^{n} \times N_{d}^{n}\) empirical auto-covariance matrix of predicted data \(\mathbf{d}^{n,f}\); \(\mathbf{C}_{\mathrm{obs}}^{n}\) is the \(N_{d}^{n} \times N_{d}^{n}\) covariance matrix of observed data measurement errors; \(N_{d}^{n}\) is the number of data points (observations) assimilated at the \(n^{\mathrm{th}}\) time index; \(\mathbf{d}_{\mathrm{uc},j}^{n,f}\) is the vector of og perturbed fg observations for time index \(n\), i.e., \(\mathbf{d}_{\mathrm{uc},j}^{n,f} \sim \mathcal{N}\left(\mathbf{d}_{\mathrm{obs}}^{n},\mathbf{C}_{\mathrm{obs}}^{n}\right)\); and \(\mathbf{d}^{n,f}\) is the corresponding \(N_{d}^{n}-\mathrm{dimensional}\) vector of predicted data.

This type of filter is particularly interesting for systems which present chaotic and unstable dynamics such as oceanic and atmospheric models [Evensen, 2007] or water quality models [Wang et al., 2023]. However, EnKF also presents some drawbacks [Evensen, 2007, Reynolds, 2017]:

The computation of empirical cross-covariance matrices \(\mathbf{C}_{\mathrm{YD}}^{n,f}\) and \(\mathbf{C}_{\mathrm{DD}}^{n,f}\) is performed a great number of times (at each new set of observations);
The size of \(\mathbf{C}_{\mathrm{YD}}^{n,f}\) varies linearly with \(N_{\mathrm{u}}\) and \(N_{\mathrm{s}}\) and can be problematic for large-scale system;
Sequential data assimilation with EnKF requires to updating both model parameters and primary variables (parameter-state-estimation problem) which translates to modifications of the forward model (consequently an access to the code is required), with the ability to stop and restart (checkpointing);
It does not necessarily produce a correct initial state as it may needs a few time iterations to reach a model in accordance with the observations. This is satisfactory for real time assimilation, but not suited to invert a field to its initial value, e.g., an uraninite field being dissolved in ISR RTM;
In practice, we may observe inconsistency between the updated parameters and states.

To solve these issues, the ensemble smoother (ES) formulation was introduced targeting primarily reservoir simulation history matching [van Leeuwen and Evensen, 1996]. Going from EnKF to ES, the main assumptions were that 1) reservoir simulation models are typically stable functions of the rock property fields; 2) model uncertainties could be neglected. Hence, when applying ES, the state-parameter estimation problem is downgraded to a simpler parameter-only estimation problem. In this case, all observations are assimilated at once and the analysis equation becomes

(15)#\[\mathbf{s}^{a}_{j} = \mathbf{s}^{f}_{j} + \widetilde{\mathbf{C}}_{\mathrm{SD}}^{f} \left(\widetilde{\mathbf{C}}_{\mathrm{DD}}^{f} + \mathbf{C}_{\mathrm{obs}} \right)^{-1} \left(\mathbf{d}_{\mathrm{uc},j} - \mathbf{d}_{j}^{f}\right).\]

Compared to (14), empirical covariance matrices must be computed only once. \(\widetilde{\mathbf{C}}_{\mathrm{SD}}^{f}\) is the empirical cross-covariance matrix between the prior vector of model parameters, \(\mathbf{s}^{f}\), and the vector of predicted data, \(\mathbf{d}^{f}\), with size \(N_{\mathrm{s}} \times N_{\mathrm{obs}}\); \(\widetilde{\mathbf{C}}_{\mathrm{DD}}^{f}\) is the \(N_{\mathrm{obs}} \times N_{\mathrm{obs}}\) auto-covariance matrix of predicted data and \(\mathbf{C}_{\mathrm{obs}}\) is the covariance matrix of (all) observed data measurement errors with the same \(N_{\mathrm{obs}} \times N_{\mathrm{obs}}\) size. The empirical cross-covariance matrices are computed according to (10), i.e., in the EnKF way [Evensen, 2007, Aanonsen et al., 2009] as

(16)#\[\begin{split}\begin{aligned} \widetilde{\mathbf{C}}^{f}_{\mathrm{SD}} & = \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{s}^{f}_{j} - \overline{\mathbf{s}^{f}_{j}}\right)\left(\mathbf{d}^{f}_{j} - \overline{\mathbf{d}^{f}} \right)^{\mathrm{T}}, \\ \widetilde{\mathbf{C}}^{f}_{\mathrm{DD}} &= \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{d}^{f}_{j} -\overline{\mathbf{d}^{f}_{j}} \right)\left(\mathbf{d}^{f}_{j} - \overline{\mathbf{d}^{f}} \right)^{\mathrm{T}}, \\ \widetilde{\mathbf{C}}^{f}_{\mathrm{SS}} &= \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{s}^{f}_{j} - \overline{\mathbf{s}^{f}_{j}}\right)\left(\mathbf{s}^{f}_{j} - \overline{\mathbf{s}^{f}_{j}}\right)^{\mathrm{T}} \approx \mathbf{C}_{\mathrm{prior}}. \end{aligned} `\end{split}\]

In this form, there is no inconsistency between states and parameters. ES is also efficient because like PCGA, it avoids an explicit computation and inversion of large covariance matrices (\(\mathbf{C}_{\mathrm{prior}}\)). It is also easy to implement with any simulator as it becomes a real black-box (this is not the case with EnKF) and it offers substantial reduction in computational cost compared to EnKF [Skjervheim et al., 2011] with a complete parallelization for equivalent results. In addition the emph{a posteriori} uncertainty in both parameters and predictions can be characterized using the assimilated ensemble

(17)#\[\begin{split}\begin{aligned} \widetilde{\mathbf{C}}^{a}_{\mathrm{DD}} &= \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{d}^{a}_{j} -\overline{\mathbf{d}^{a}_{j}} \right)\left(\mathbf{d}^{a}_{j} - \overline{\mathbf{d}^{a}} \right)^{\mathrm{T}}, \\ \widetilde{\mathbf{C}}^{a}_{\mathrm{SS}} &= \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{s}^{a}_{j} - \overline{\mathbf{s}^{a}_{j}}\right)\left(\mathbf{s}^{a}_{j} - \overline{\mathbf{s}^{a}_{j}}\right)^{\mathrm{T}}. \end{aligned}\end{split}\]

Interestingly, although it might seem different at first, ES has been proven to be an approximation to RML-Gauss-Newton equation with full step where all ensemble members are updated with the same average sensitivity matrix [Reynolds et al., 2006].

Equivalence between ES and RML-Gauss-Newton#

Considering an ensemble of \(N_{e}\) realizations, recalling the objective function in (1) and the associated update given in (6), a Gauss-Newton iteration for the \(j^{\mathrm{th}}\) ensemble member reads

\[\mathbf{s}_{\ell+1,j} = \mathbf{s}_{\ell,j} - \gamma_{\ell,j} \Bigg[ \mathbf{s}_{\ell,j} - \mathbf{s}_{\mathrm{prior}} + \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell,j}^{\mathrm{T}} \Bigg(\mathbf{J}_{\ell,j} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell,j}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathcal{F}(\mathbf{s}_{\ell,j}) - \mathbf{d}_{\mathrm{uc},j} - \mathbf{J}_{\ell,j} \Big(\mathbf{s}_{\ell,j} - \mathbf{s}_{\mathrm{prior}}\Big)\Big) \Bigg].\]

Evaluating \(\mathbf{J}_{\ell, j}\) for each realization and at the local iterate requires the existence of the tangent linear operator of the non-linear forward model \(\mathcal{F}\), i.e., an adjoint state model, which is generally not available and still requires \(N_{\mathrm{obs}}\) forward runs at each local iterate and for each Gauss-Newton iteration. Therefore, we perform a linearization of the model around the ensemble mean and use the same gradient for all realizations through the average sensitivity matrix \(\overline{\mathbf{J}} \equiv \left( \nabla \mathcal{F} |_{\mathbf{s} = \overline{\mathbf{s}}} \right)^{\mathrm{T}} \in \mathcal{R}^{N_{\mathrm{s}}\times N_{\mathrm{obs}}}\). A common step length \(\gamma\) is also used, and the update now reads

(18)#\[\mathbf{s}_{\ell+1,j} = \mathbf{s}_{\ell,j} - \gamma_{\ell} \Bigg[\mathbf{s}_{\ell,j} - \mathbf{s}_{\mathrm{prior}} + \mathbf{C}_{\mathrm{prior}} \overline{\mathbf{J}}_{\ell}^{\mathrm{T}} \Bigg(\overline{\mathbf{J}}_{\ell} \mathbf{C}_{\mathrm{prior}} \overline{\mathbf{J}}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathcal{F}(\mathbf{s}_{\ell,j}) - \mathbf{d}_{\mathrm{uc},j} - \overline{\mathbf{J}}_{\ell} \Big(\mathbf{s}_{\ell,j} - \mathbf{s}_{\mathrm{prior}}\Big)\Big) \Bigg].\]

Following Reynolds et al. [2006], we derive the ensemble smoother as an approximation to RML-Gauss-Newton equation with full step (\(\gamma=1\)), performing a single iteration with \(\ell = 0\) (initial guess) and using \(\mathbf{s}_{\mathrm{prior}} = \mathbf{s}_{0,j}\). As a consequence, \(\mathbf{s}_{0,j} - \mathbf{s}_{\mathrm{prior}} = 0\). Denoting \(\mathbf{s}_{1, j} = \mathbf{s}_{j}^{a}\), \(\mathbf{s}_{0, j} = \mathbf{s}_{j}^{f}\) and \(\mathcal{F}(\mathbf{s}_{0, j}) = \mathbf{d}^{f}_{j}\) then (18) becomes

(19)#\[\mathbf{s}_{j}^{a} = \mathbf{s}_{j}^{f} - \Bigg[\mathbf{C}_{\mathrm{prior}} \overline{\mathbf{J}}_{0}^{\mathrm{T}} \Bigg(\overline{\mathbf{J}}_{0} \mathbf{C}_{\mathrm{prior}} \overline{\mathbf{J}}_{0}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathbf{d}_{j}^{f} - \mathbf{d}_{\mathrm{uc},j} \Big) \Bigg].\]

Assuming that \(\overline{\mathbf{d}^{f}} = \mathcal{F}(\overline{\mathbf{s}}^{f})\), recall (13) from the previous section

\[\mathbf{d}_{j}^{f} - \overline{\mathbf{d}^{f}} = \mathcal{F}\left(\mathbf{s}_{j}^{f}\right) - \mathcal{F}(\overline{\mathbf{s}}^{f}) = \mathbf{J}|_{\overline{\mathbf{s}}^{f}}\left(\mathbf{s}_{j}^{f} - \overline{\mathbf{s}}^{f}\right) \equiv \overline{\mathbf{J}}_{0}\left(\mathbf{s}_{j}^{f} - \overline{\mathbf{s}}^{f}\right).\]

Injecting the previous results in the auto-covariances definitions yields

(20)#\[\begin{split}\begin{aligned} \widetilde{\mathbf{C}}^{f}_{\mathrm{SD}} & = \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{s}^{f}_{j} - \overline{\mathbf{s}^{f}_{j}}\right)\left(\mathbf{d}^{f}_{j} - \overline{\mathbf{d}^{f}} \right)^{\mathrm{T}}, \\ &= \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{s}^{f}_{j} - \overline{\mathbf{s}^{f}_{j}}\right)\left(\mathbf{s}_{j}^{f} - \overline{\mathbf{s}}^{f}\right)^{\mathrm{T}}\overline{\mathbf{J}}_{0}^{\mathrm{T}} = \mathbf{C}_{\mathrm{prior}} \overline{\mathbf{J}}_{0}^{\mathrm{T}}, \end{aligned}\end{split}\]

(21)#\[\begin{split}\begin{aligned} \widetilde{\mathbf{C}}^{f}_{\mathrm{DD}} &= \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1}\left(\mathbf{d}^{f}_{j} -\overline{\mathbf{d}^{f}_{j}} \right)\left(\mathbf{d}^{f}_{j} - \overline{\mathbf{d}^{f}} \right)^{\mathrm{T}}, \\ & = \frac{1}{N_{e} - 1} \sum_{j=0}^{N_{e}-1} \overline{\mathbf{J}}_{0} \left(\mathbf{s}_{j}^{f} - \overline{\mathbf{s}}^{f}\right) \left(\mathbf{s}_{j}^{f} - \overline{\mathbf{s}}^{f}\right)^{\mathrm{T}}\overline{\mathbf{J}}_{0}^{\mathrm{T}} = \overline{\mathbf{J}}_{0} \mathbf{C}_{\mathrm{prior}} \overline{\mathbf{J}}_{0}^{\mathrm{T}}. \end{aligned}\end{split}\]

Using (20) and (21) in (19) gives the following equation which is the ensemble smoother

\[\begin{split}\begin{aligned} \mathbf{s}_{j}^{a} & = \mathbf{s}_{j}^{f} - \widetilde{\mathbf{C}}^{f}_{\mathrm{SD}} \Bigg(\widetilde{\mathbf{C}}^{f}_{\mathrm{DD}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathbf{d}_{j}^{f} - \mathbf{d}_{\mathrm{uc},j} \Big), \\ & = \mathbf{s}_{j}^{f} + \widetilde{\mathbf{C}}^{f}_{\mathrm{SD}} \Bigg(\widetilde{\mathbf{C}}^{f}_{\mathrm{DD}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathbf{d}_{\mathrm{uc},j} - \mathbf{d}_{j}^{f} \Big). \end{aligned}\end{split}\]

As shown by Reynolds et al. [2006], ES (and EnKF) are indeed the same as updating each ensemble member by doing one iteration of the Gauss-Newton method and using the same average sensitivity matrix for all ensemble members. This partially explains the reasonable performance of EnKF when applied to history matching production data.

Ensemble size, inflation and need for localization#

In reservoir history matching, the computational cost of ES and of other derived iterative ensemble smoothers mainly depends on the ensemble size since the most expensive part is usually the computation of the prediction ensemble \(\mathbf{D}\) which requires \(N_{e}\) simulations at each assimilation. Consequently, the number of ensemble realizations is a trade-off between performance and accuracy. In the literature, the og magic fg range 30-100 is usual, and although several methods have been proposed [Yin et al., 2015, Leutbecher, 2019, Milinski et al., 2020], the choice seems to be application dependent and empirical. When the size of the ensemble chosen for inversion is too small, it might not be statistically representative of the variability of the unknowns: the low rank approximation of the error covariance matrix introduces sampling errors which decrease proportionally to \(1/\sqrt{N_{e}}\) [Evensen, 2003]. For small \(N_{e}\), long-range spurious correlations might appear causing the smoother divergence and collapse. Covariance localization and inflation are two techniques developed to overcome these issues [Evensen et al., 2022].

As detailed by Evensen et al. [2022], there are several ways to perform localization. In this work, we choose to follow the localization in observation space following the approximate localization scheme introduced by Houtekamer and Mitchell [2001]. (15) writes

(22)#\[\mathbf{s}^{a}_{j} = \mathbf{s}^{f}_{j} + \boldsymbol{\rho}_{\mathrm{SD}} \odot \widetilde{\mathbf{C}}_{\mathrm{SD}}^{f} \left(\boldsymbol{\rho}_{\mathrm{DD}} \odot \widetilde{\mathbf{C}}_{\mathrm{DD}}^{f} + \mathbf{C}_{\mathrm{obs}} \right)^{-1} \left(\mathbf{d}_{\mathrm{uc},j} - \mathbf{d}_{j}^{f}\right),\]

where \(\boldsymbol{\rho}_{\mathrm{SD}}\) is a damping (correlation) matrix with size (\(N_{\mathrm{s}} \times N_{\mathrm{obs}}\)); \(\boldsymbol{\rho}_{\mathrm{DD}}\) is a damping matrix with size (\(N_{\mathrm{obs}} \times N_{\mathrm{obs}}\)); and \(\odot\) is the Schur (or Hadamard) product performing element-wise multiplication of two matrices. These correlation matrices are typically built based on spatial and time distances from measurement and rely on damping functions such as proposed by Gaspari and Cohn [1999]. #+LATEX: Some have been implemented in pyesmda and a descriptive notebook is availablefootnote{https://pyesmda.readthedocs.io/en/stable/tutorials/correlation_matrices_building.html}. For large-scale applications, these matrices are probably sparse and sparse representation can be used to save memory and computation time.

As stated by Evensen et al. [2019] in reservoir models, as well as in ISR RTM, a consistent localization is difficult because there are both long-range correlations and transient phenomena in the reservoir, e.g., purely spatial localization can normally not be used. An alternative approach to distance-based localization is to perform adaptive localization based empirical correlations [Anderson, 2007, Bishop and Hodyss, 2007, Fertig et al., 2007]. The correlation is defined as

\[\mathrm{cor}(\mathbf{X},\mathbf{Y}) = \dfrac{\mathrm{cov}(\mathbf{X},\mathbf{Y})}{\sigma_{\mathbf{X}} \sigma_{\mathbf{Y}}}.\]

Using (10), the empirical correlation matrix is defined as

(23)#\[\begin{split}\begin{split} \widetilde{\boldsymbol{\rho}}_{\mathrm{DD}} &= \dfrac{\mathbf{A}_{\mathrm{d}}\mathbf{A}_{\mathrm{d}}^{\mathrm{T}}}{\mathbf{A}_{\mathrm{d}}\mathbf{A}_{\mathrm{d}}}, \\ \widetilde{\boldsymbol{\rho}}_{\mathrm{SD}} &= \dfrac{\mathbf{A}_{\mathrm{s}}\mathbf{A}_{\mathrm{d}}^{\mathrm{T}}}{\mathbf{A}_{\mathrm{s}}\mathbf{A}_{\mathrm{d}}}, \end{split}\end{split}\]

with the centered anomaly matrices \(\mathbf{A}_\mathrm{s}\) and \(\mathbf{A}_\mathrm{d}\) defined in (12). The initial idea was to perform a thresholding of the correlation matrices. Such approach imples that a model variable is updated by only using observations that have large enough correlations (in magnitudes) with it, regardless of the physical locations of these observations. More recently, the proposed approach by Luo et al. [2018], Luo and Bhakta [2020], Luo and Xia [2022] that combines truncation-based adaptive localization with tapering of each local update seems promising to overcome the small discontinuities in the updates noticed by Evensen [2009].

Comments on the ensemble smoothers#

Remembering (15), the ES analysis equation can be written as

\[\begin{split}\begin{split} \mathbf{s}^{a}_{j} &= \mathbf{s}^{f}_{j} + \mathbf{A}_{\mathrm{s}}^{f} (\mathbf{A}_{\mathrm{s}}^{f})^{\mathrm{T}} \left(\widetilde{\mathbf{C}}_{\mathrm{DD}}^{f} + \mathbf{C}_{\mathrm{obs}} \right)^{-1} \left(\mathbf{d}_{\mathrm{uc},j} - \mathbf{d}_{j}^{f}\right), \\ & = \mathbf{s}^{f}_{j} + \mathbf{A}_{\mathrm{S}}^{f} \mathbf{M}^{f}. \end{split}\end{split}\]

As stated by Reynolds [2017], since \(\mathbf{A}_{\mathrm{S}}^{f}\) is a linear combination of the initial ensemble of models, it indicates that every \(\mathbf{s}^{a}_{j}\) is a linear combination of the models in the initial ensemble, i.e., we are trying to find a linear combination of the initial ensemble members that matches data. Thus, if the models contain large structural features, it will be difficult to capture the geology by assimilating data with the ES.

Ensemble Smoother with Multiple Data Assimilation (ESMDA)#

With ES, all data are assimilated simultaneously, which means that a single Gauss-Newton correction is applied to condition the ensemble to all available data. Therefore, ES may not be able to provide acceptable data matches when applied to reservoir history-matching problems (several Gauss-Newton iterations are usually required). Moreover, in ES, there is a possibility of ensemble collapse (convergence to a local minimum) particularly when measurement errors \(\mathbf{C}_{\mathrm{obs}}\) are low. To overcome both of these problems, Emerick and Reynolds [2013], Emerick and Reynolds [2012] proposed to assimilate the same data multiple times with an inflated measurement error covariance (\(\mathbf{C}_{\mathrm{obs}}\)). The covariance inflation was inspired by the fact that for history-matching with gradient-based methods, some form of damping at early iterations is usually necessary to avoid getting trapped in an excessively rough model which does not give a very good history match [Li et al., 2003, Gao and Reynolds, 2006].

Thanks to its simple formulation, ESMDA is perhaps the most used iterative form of the ensemble smoother in geoscience applications. In ESMDA, the ES analysis equation from (15) becomes

(24)#\[\mathbf{s}_{\ell+1, j} = \mathbf{s}_{\ell,j} + \mathbf{C}_{\mathrm{SD}, \ell}\left(\mathbf{C}_{\mathrm{DD},\ell}+\alpha_{\ell} \mathbf{C}_{\mathrm{obs}}\right)^{-1} \left(\mathbf{d}_{\mathrm{obs}} + \alpha_{\ell} \mathbf{v}_{j} - \mathcal{F}(\mathbf{s}_{\ell,j}) \right),\]

with \(\mathbf{v}_{j}\) an unconditional realization of \(\mathcal{N}(\mathbf{0},\mathbf{C}_{\mathrm{obs}})\), and \(\alpha_{\ell}\) the inflation factors which should respect the following constraint

(25)#\[\sum_{\ell=0}^{N_{a}-1} \frac{1}{\alpha_{\ell}} = 1.\]

with \(N_{a}\) the number of assimilations and \(\ell\) the assimilation index. This condition is imposed because it makes single (ES) and multiple data assimilations (ESMDA) equivalent for the linear-Gaussian case [Emerick and Reynolds, 2013].

Localization#

As far as localization is concerned, it can be applied in the same way than for the ES as described in sec_inflation. The analysis step reads

(26)#\[\mathbf{s}_{\ell+1, j} = \mathbf{s}_{\ell,j} + \boldsymbol{\rho}_{\mathrm{SD}} \odot \mathbf{C}_{\mathrm{SD}, \ell}\left(\boldsymbol{\rho}_{\mathrm{DD}} \odot \mathbf{C}_{\mathrm{DD},\ell}+\alpha_{\ell} \mathbf{C}_{\mathrm{obs}}\right)^{-1} \left(\mathbf{d}_{\mathrm{obs}} + \alpha_{\ell} \mathbf{v}_{j} - \mathcal{F}(\mathbf{s}_{\ell,j}) \right).\]

Automatic hyperparameters tuning#

In addition to the number of realizations in the initial ensemble, the original form of ESMDA requires the definition of two hyperparameters, namely the number of assimilations \(N_{a}\), and the sequence of inflation factors \(\{\alpha_{0}, \alpha_{1}, \dots,\alpha_{N_{a}-1}\}\). Unfortunately, this form of ESMDA lacks the ability to continue “iterating” if the data mismatch is too large, and the optimal associated inflation factors could only be determined by trial and error. It becomes intractable for large-scale problems and tends to make the results dependent on the practitioner. It is also worth noting that the choice of the inflation factor sequence is critical, as it can both prevent near ensemble collapse and limit extreme values of rock property fields for complex problems. To avoid a purely empirical user-based choice, several adaptive procedures have been proposed in the literature. Two of them have been implemented in pyesmda and tested thoroughly in this work, namely the Restricted Step (ESMDA-RS) procedure established by Le et al. [2016], and the Data Misfit Controller (ESMDA-DMC) procedure developed by Iglesias and Yang [2021]. Pseudo-codes for these two procedures are given in Fig. 2, Fig. 3 respectively. Interestingly, both procedures result in strong initial inflation factors, that decrease over the course of iterations. This is consistent with the earlier idea from Li et al. [2003], Gao and Reynolds [2006] that some form of stronger damping is required at early iterations.

Regarding the number of realizations in the initial ensemble, Ranazzi and Sampaio [2019] investigated its impact on the number of iterations required with the restricted step automatic procedure. They found that the ensemble size influenced the reduction of the uncertainty of the posterior models, but it did not show any influence on the total number of assimilations and on the choice of the inflation factor (they used 100, 200, 300, 400 and 500 ensemble members).

_images/esmda_dmc.png — Fig. 3 ESMDA-DMC#

Efficient analysis step#

The main computational difficulty in the ESMDA algorithm is the inversion of the term \(\mathbf{C}_{\ell} = \mathbf{C}_{\mathrm{DD}, \ell}+\alpha_{\ell} \mathbf{C}_{\mathrm{obs}}\) (line 9 of Fig. 2 and line 13 of Fig. 3). Inspired by the implementation from Equinor, seven approaches have been implemented in pyesmda and are summarized in the Table 1.

The four first methods are exact but not efficient and not suited for large-scale inversion with big data sets. The last three methods are approximate or exact depending on \(N_e\), \(N_{\mathrm{obs}}\), and the truncation factor which gives the fraction of the highest singular values to keep (in the interval ]0, 1]):

if \(N_{e} \leq N_{\mathrm{obs}}\), then the solution is always approximated, regardless of the truncation;

if \(N_{e} > N_{\mathrm{obs}}\), then the solution is exact when truncation is 1.0.

Only the three first methods are compatible with the localization by \(\boldsymbol{\rho}_{\mathrm{DD}}\) because the others rely on \(\mathbf{A}_{\mathrm{d}} \mathbf{A}_{\mathrm{d}}^{\mathrm{T}}\) and the Shur product is not distributive in front of the matrix multiplication.

Table 1 Inversion approaches for the term \(\mathbf{C}_{\ell} = \mathbf{C}_{\mathrm{DD}, \ell} + \alpha_{\ell} \mathbf{C}_{\mathrm{obs}}\)#
Method	Description	LSC? [1]	Reference
Naive	Naive inversion of \(\mathbf{C}_{\ell}\)	NO
Exact - Cholesky	Solve \(\mathbf{C}_{\ell} \mathbf{x} = \mathbf{b}\) with a Cholesky factorization [2] of \(\mathbf{C}_{\ell}\)	NO
Exact - Lstsq	Solve \(\mathbf{C}_{\ell} \mathbf{x} = \mathbf{b}\) with least squares	NO
Exact - Woodbury	Use the second Woodbury lemma in (5)	NO
Rescaled	TSVD [3] of \(\mathbf{C}_{\ell}\) with a rescaling procedure to compensate for truncation of small singular values	YES	Appendix A.1 from Emerick (2012)
Subspace	TSVD [3] of \(\mathbf{A}_{\mathrm{d},\ell}\) instead of \(\mathbf{C}_{\ell}\), more efficient when \(N_e \ll N_{\mathrm{obs}}\)	YES	Appendix A.2 from Emerick (2012)
Subspace rescaled	Same as subspace but with a rescaling procedure	YES	Appendix A.2 from Emerick (2012)

Notes

[1] Large Scale Compatible?

[2] Because \(\mathbf{C}_{\mathrm{DD}, \ell}\) is real symmetric positive semi-definite, \(\mathbf{C}_{\ell}\) is positive definite if \(\mathbf{C}_{\mathrm{obs}}\) is positive definite.

[3] TSVD = Truncated Singular Value Decomposition