Existence and Consistency of a Nonparametric Estimator of Probability Measures in the Prohorov Metric Framework

We consider nonparametric estimation of probability measures for parameters in problems where only aggregate (population level) data are available. We summarize an existing computational method for the estimation problem which has been developed over the past several decades [3, 6, 15, 18, 20]. New theoretical results are presented which establish the existence and consistency of very general (ordinary, generalized and other) least squares estimates for the measure estimation problem. 1 Motivation and Problem Formulation In a standard nonlinear regression problem, a mathematical model is proposed which links one or more states of interest to the independent variables (regressors) of an experiment and to a vector of parameters whose values are unknown to the experimenter. An experiment is then conducted on the physical or biological system and data is collected for one or more states of interest. The unknown parameters of interest are then estimated in an inverse or parameter estimation problem, the theory for which is well-established [14, 24, 26]. Yet in many situations physical, biological, or experimental limitations do not permit one to sample indivudal data directly. Rather, one obtains data at the aggregate level as multiple individuals are sampled. In this case, it is commonly assumed that while the states of interest for these individuals are described by a single mathematical framework, each individual is described by a unique set of parameters within that framework. For instance, the growth of mosquitofish [7, 15, 16] and shrimp [10, 12] have been shown to be described by a size-structured partial differential equation model in which the rate of individual growth is assumed to vary probabilistically across the population. HIV replication data has been shown to be accurately described by a cellular-level model in which intracellular delays vary from cell to cell [5]. The probabilistic distribution of parameters has also been observed in models of electromagnetic polarization [9, 17]. These examples and others are considered at greater length in the recent book [2]. More precisely, suppose that the quantities of interest for a single individual can be described by the mathematical model dy dt = g(t, y(t); q), y(t0) = y0. (1.1)

unknown to the experimenter.An experiment is then conducted on the physical or biological system and data is collected for one or more states of interest.The unknown parameters of interest are then estimated in an inverse or parameter estimation problem, the theory for which is well-established [14,24,26].Yet in many situations physical, biological, or experimental limitations do not permit one to sample indivudal data directly.Rather, one obtains data at the aggregate level as multiple individuals are sampled.In this case, it is commonly assumed that while the states of interest for these individuals are described by a single mathematical framework, each individual is described by a unique set of parameters within that framework.For instance, the growth of mosquitofish [7,15,16] and shrimp [10,12] have been shown to be described by a size-structured partial differential equation model in which the rate of individual growth is assumed to vary probabilistically across the population.HIV replication data has been shown to be accurately described by a cellular-level model in which intracellular delays vary from cell to cell [5].The probabilistic distribution of parameters has also been observed in models of electromagnetic polarization [9,17].These examples and others are considered at greater length in the recent book [2].
More precisely, suppose that the quantities of interest for a single individual can be described by the mathematical model dy dt = g(t, y(t); q), y(t 0 ) = y 0 . (1.1) The parameter vector q ∈ R r is specific to each individual within the population.The model solution is y(t; θ) = Cf (t; q, y 0 ) ( where θ = (q, y 0 ) ∈ R r+s = R p .It is assumed f (t; θ) ∈ R s and C ∈ R l×s so that y ∈ R l .(In the notation that follows, we tacitly assume l = 1; this is only for convenience and all theory presented holds for vector observations.)It is assumed that θ ∈ Θ for all individuals in the population, where Θ is a set of admissible parameters.
For the aggregate data problem, one can consider n observations as random variables resulting from the direct sampling of the mean population state, but measured subject to random error.Then it is possible to define the random variables V j = v(t; P 0 ) + E j (1.3) for j = 1, . . ., n where v(t; P ) = E[Cy(t; •)|P ] = Θ Cy(t; θ)dP (θ), and the random variables E j represent measurement noise, modeling error, microfluctuations, etc.Let E = (E 1 , . . ., E n ).It is assumed that the first two central moments of the random vector E are Without loss of generality, it may be assumed (by transforming the data and model) that the random variables E j are independent and identically distributed, so that R = σ 2 I n , where I n is the n × n identity matrix [24].
In the context of (1.3) one does not have information on a fixed, single parameter, but rather on the distribution of parameters which characterizes the behavior of the entire population.Given n realizations v j of the random variables V j (which we will sometimes write v and V for notational convenience), the goal of an inverse or parameter estimation problem is to produce an estimate of the hypothetical true measure P 0 .Significantly, the data is sampled from the state space of the mathematical system and not from the parameter space; thus one does not sample directly from the distribution of interest.
The estimated measure should be one that best fits the data in some appropriate sense, so that one must first choose a framework in which to work.Given that choice of framework, one must establish a set of theoretical and computation tools with which to treat the parameter estimation problem.For the results presented here, we focus on a frequentist approach using least squares estimation.Theoretical results for likelihood estimation (also in a frequentist framework) can be established with little difficulty from the results presented here.We do not consider a Bayesian approach in this manuscript, except to remark that a comparison of the frequentist and Bayesian approaches to the estimation of the unknown distribution P 0 is an interesting avenue for future work.
For the least squares problem, define the estimator J n ( V , P ) = arg min We remark that P n is itself a random variable in that it is a function of the random variables V j (and hence E j ).This dependence is generally suppressed with the exception of the subscripted n, but should be carefully noted, particularly in the consideration of the existence and consistency of the estimator (see below).The inverse problem is then to use realizations v j of the random variables V j to compute Pn = arg min J n ( v, P ) = arg min However, one cannot typically compute Pn as defined.In most practical problems, the model v(t; P, ψ) cannot be computed exactly and must be approximated by some numerical scheme (e.g., finite difference methods, Galerkin methods, etc.).Similarly, the space P(Θ) has (uncountably) infinitely many elements so that it must also be approximated.Thus, given a set of realizations {v j } of the random variables V j , what one computes in practice is (1.7) The immediate question of interest is how these formal definitions relate back to the actual quantity of interest, the unknown 'true' probability measure P 0 .In answering this question, it must be shown that the least squares estimator P n defined by (1.5) is well-defined and subsequently that P N n,M converges (in some sense) to Pn as M and N grow large.Of course, the answer to this question depends largely upon the approximation schemes used.For instance, one could define P M (Θ) to be the subset of the space of probability measures consisting of those measures with a specific parametric form.While this technique has the advantage of creating a standard nonlinear estimation problem, it may lead to inaccurate and misleading results unless there is strong evidence to suggest a particular parametric form for the unknown measure.In this document, we are concerned with nonparametric estimation, so that only a minimal set of restrictions is placed on the class of admissible measures.Finally, it must be shown that the estimator Pn converges to P 0 as the number of observations n increases.This is a question of the consistency of the least squares estimator P n .
In developing a framework to address these issues, one encounters a rich body of mathematical theory.In this manuscript, it will be shown that the Prohorov Metric Framework (PMF) provides a natural setting in which to work.While this framework has been used extensively for computational construction of nonparametric estimates [5,7,9,10,12,15,16,17], several theoretical components have remained unresolved.Here, we provide a proof of the existence (as a measureable function) and consistency of the least squares estimator underlying the computational framework.Particularly in the latter case, this provides a basis for additional work in defining and constructing confidence 'intervals' for the associated estimates [9].

The Prohorov Metric Framework
We begin with several general definitions and theorems which are meant to motivate the PMF and provide some background.In the interest of brevity, no proofs are given here for this motivating material.Extensive proofs are provided in the Technical Report version of this manuscript [21].
First, the Riesz Representation Theorem [25, pg. 357-358] on the space of bounded continuous functions is stated.This theorem can be used to characterize the weak * topology on the continuous dual of the space of bounded continuous functions, which provides an intuitive motivation for the weak topology on the space of probability measures.Consider the metric space Θ with its metric d, which we can write together as (Θ, d).Define the space C B (Θ) = {f : Θ → R|f bounded, continuous}.
Theorem 2.1 (Riesz).Assume (Θ, d) is a compact (Hausdorff1 ) space.For every f * ∈ C B (Θ) * (the continuous dual of the space C B (Θ)), there exists a unique finite signed Borel measure µ such that Given this identification, we may write f * = f * µ when convenient.We see that the set P(Θ) of probability measures on (Θ, d) can be identified with those ) When viewed in the context of P(Θ) ⊂ C B (Θ) * , this is the weak convergence of measures known from the theory of probability and stochastic processes.
With this motivation, we now turn to the problem of characterizing the weak topology of measures using the Prohorov metric.This metric can be shown to metrize the weak topology of measures, and can thus be used to establish several desirable properties of the space of probability measures.Definition 2.2.Let (Θ, d) be any metric space (not necessarily compact) and define the set C B (Θ) as above.Given any probability measure P ∈ P(Θ) and some ǫ > 0, an ǫ-neighborhood of P is (2.1) Comparing the Riesz Represention Theorem (Theorem 2.1) with the definition of B ǫ (P ), there is a clear connection between the open balls on P(Θ) and the weak topology of measures.In fact, we may take the collection of all open balls as the definition of the weak topology of measures [23, pg. 236].Alternatively, we have the following equivalent characterizations of the weak topology.
The weak topology of measures, in turn, gives rise to notions of weak (topological) convergence of measures.Definition 2.4.Given a sequence of measures P M ∈ P(Θ) for all M = 1, . . ., ∞, we say P M converges weakly to P , P M w * − − → P , if any one (and hence all) of the following equivalent conditions holds: The equivalence of the above notions of convergence is often referred to as the portmanteau theorem [23, pgs. 11-12].We remark that the notation P M w * − − → P is slightly abusive as it implies weak * convergence when what is meant is the weak convergence of measures.Yet it should be emphasized that the two notions are equivalent on the space of probability measures.
The above definitions and theorem provide several characterizations of the weak * topology on the set of probability measures.While this characterization is mathematically sufficient, our discussions of approximation and convergence would be facilitated by some metric ρ defined on the space P(Θ) which metrizes the above notions of topological convergence.That is, given two probability measures P and Q, we would like ρ to have the property that Q ∈ B ǫ (P ) if and only if ρ(P, Q) < ǫ.Such a metric could then be used to establish intuitive measures of convergence, compactness, etc., in the space of probability measures.In fact, such a metric does exist, named for the Russian probabilist Y.V. Prohorov who first defined the metric and derived its properties.
Definition 2.6.Let (Θ, d) be a metric space and let P(Θ) be the set of all probability measures on Θ.For any two measures P, Q ∈ P(Θ), the Prohorov metric ρ is While this definition is far from intuitive, it gives rise to a number of desireable properties, namely that (1) the Prohorov metric, as defined, is in fact a metric, and (2) the Prohorov metric metrizes the weak topology of measures: Theorem 2.7.Let (Θ, d) be a separable metric space.Then ρ is a metric on P(Θ).With these results, we have obtained the desired result-the weak topology of measures (weak * topology) is equivalent to the topology induced by the Prohorov metric on the space of probability measures over a separable metric space (Θ, d).It should be noted that in the definition of the Prohorov metric, it is sufficient to consider only sets F which are closed (see [27, Online Supplement] for a proof), so that the definitions and results presented here are in agreement with similar results obtained previously [3,4,10,15,20].We now proceed to use the Prohorov metric to characterize the properties of the space (P(Θ), ρ) which will prove useful in establishing results for the nonparametric estimation of measures. Define That is, D is the space of Dirac measures on Θ, defined for all F ∈ Σ Θ as Proposition 2.9.Let (Θ, d) be a separable metric space and define D ⊂ P(Θ) as above.Then k=1 is Cauchy in (P(Θ, ρ).Corollary 2.11.Let (Θ, d) be a separable metric space and let the space D be defined as above.Then D is closed in (P(Θ), ρ).(That is, D is weak * closed in the space of probability measures.)Definition 2.12.P ∈ P(Θ) is tight if for all ǫ > 0 there exists a compact set K ⊂ Θ such that P (K) > 1 − ǫ.A family of measures Π ⊂ P(Θ) is tight if for all P ∈ Π, P is tight.
for all P ∈ Π, then Π is tight.
Theorem 2.14 (Prohorov).Assume (Θ, d) is separable and let Π ⊂ (P(Θ), ρ).The following are equivalent: The compactness of the space (P(Θ), ρ) given the compactness of (Θ, d) is of vital importance for the theoretical framework.In effect, one need only show that the cost functional J n ( v, P ) in (1.6) is a continuous function of P in order to be guaranteed the existence of a minimizer to the least squares estimation problem.We need one final result which will be useful in establishing computational tools for the parameter estimation problem.
Theorem 2.17.Assume (Θ, d) is a separable, compact metric space.Let {θ k } ∞ k=1 be an enumeration of of the countable dense subset of Θ.Take Q ⊂ R to be the set of all rational numbers.Define That is, P(Θ) is the collection of all convex combinations of Dirac measures on Θ with rational weights.Then P(Θ) is dense in P(Θ), and thus P(Θ) is separable.
Taken together, these results establish that if (Θ, d) is separable and compact, then (P(Θ), ρ) is also compact and separable.This, in turn, can be used to establish a number of useful results for the original estimation problem.These results are presented in the next two sections.

Existence of the Estimator
In this section we present new results for the existence of estimators in the Prohorov Metric Framework.We begin by proving the existence of P n and Pn as measurable functions mapping a subset of R n (that is, the data) into the space of probability measures on Θ.We remark that the statement of Theorem 3.1 concerns the estimate Pn obtained from the data realizations v ∈ R n .This is sufficient to establish the existence of the estimator P n as a measureable function as well, since the random vector V is by definition a measurable function from a probability triple into R n , and the composition of measurable functions is measurable.Theorem 3.1.Define the function J n : R n × P(Θ) → R according to Equation (1.6).Assume (Θ, d) is separable and compact and take the space of probability measures P(Θ) with the Prohorov metric ρ.Assume further that J n (•, P ) is a measurable function from R n → R for each P ∈ P(Θ), and that J n ( v, •) : P(Θ) → R is continuous for each v ∈ R n .Then there exists a measurable function Pn : R n → P(Θ) such that k=1 be an enumeration of the countable dense subset of Θ.For each M ≥ 1, define (That is, P M is the set of all discrete measures consisting of a convex combination of M Dirac measures weighted with rational coefficients.)Thus P M is countable.Let {P M j } ∞ j=1 be an enumeration of the elements of P M .(We remark that, because the M nodes θ k are fixed in advance, the space P M can be analogously considered as a subset of R M , a fact which will be exploited in some of the notation below.)Finally, define P M J = {P M j } J j=1 , the first J enumerated elements of P M .
Fix J ≥ 1. Define the function P M J ( v) implicitly as Such a function must exist because the minimum is begin taken over a finite number of elements from a point set; if the minimum occurs at multiple elements of P M J , we may arbitrarily choose the element which comes first in the enumeration so that the function P M J ( v) is well-defined.First, we show that . (Thus F is a finite point set.)We must show that the set B defined as Since F is a finite point set, we can define for each P M j ∈ F the sets By assumption, the functions J( v, P M j ) are measurable from R n into R for all P M j , j ≥ 1.The minimum over a finite set of functions is also measurable, as is the test for equality between two measureable functions.Hence B j ∈ Σ R n .Finally, B = ∪B j , the union being over the finite number of sets B j , hence B ∈ Σ R n and the function P M J ( v) is measurable.As mentioned previously, we can identify the function P . Let pM J be the first component of the vector representation for P M J ( v).Now consider the sequence {p Since each pM J ( v) is a measurable function, so is pM 1 ( v).Also, since the space [0, 1] M is compact, there must exist a convergent subsequence of (the vector representation of) P M J to some vector (p M 1 ( v), pM 2 ( v), . . ., pM M ( v)), which can be identified with the measure PM .Now inf The first equality comes from the definition of PM and the continuity of the function J; the second equality comes from the definition of the probability measures P M j l ; the final equality arises from the density of {P M j } in P. Now, define (with some abuse of notation) Applying the same arguments above inductively on J (j,M ) n , we obtain a set of measureable functions pM 1 ( v), . . ., pM M ( v) such that and we have proven the existence of a measurable function PM ∈ P M mapping R n → P(Θ) which minimizes the cost functional J n .We conclude the proof by (A6) There exists a measure µ on T such that for all g ∈ C(T ).
Assumption (A1) establishes the probability triple on which the error random variables E j are assumed to be defined.As we will see, this probability triple will permit us to make probabilistic statements regarding the consistency of the estimator P n .These assumptions as well as the two theorems below follow closely the theoretical results of [14] which establish the consistency of the ordinary least squares estimator for a traditional nonlinear least squares problem.The key idea is to first argue that the functions J n ( V ; P ) converge to J 0 as n increases; then the minimizer P n of J n should converge to the unique minimizer P 0 of J 0 [1].
Because the functions J n are functions of the vector V , which itself depends on the random variables E j , these functions are themselves random variables, as are the estimators P n .Though we have generally refrained from doing so up to this point, it will occasionally be convenient to evaluate these functions at points in the underlying probability triple.Thus we may write J n ( V ; P )(ω), E j (ω), etc., whenever the particular value of ω is of interest.
The following results is stated without proof in [18].We give a proof here for the sake of completeness.Theorem 4.1.Under assumptions (A1)-(A7), there exists a set A ∈ Σ Ω with P Ω (A) = 1 such that for all ω ∈ A, as n → ∞ and for each P ∈ P(Θ).Moreover, the convergence is uniform on P(Θ).
Proof.As in [14], the proof will proceed in three parts.First, for any fixed element P ∈ P(Θ), a set A P is constructed with P Ω (A P ) = 1 such that the convergence statement holds.The sets A P are then used to construct a set A as described.Finally, the uniform convergence is shown.
Let P ∈ P(Θ) be fixed.We may rewrite We consider the three terms on the right.For the first term, define By the Strong Law of Large Numbers, P Ω (B 1 ) = 1.For the third term, observe that by assumption (A6) and the continuity of v(t; •).(Note also that this convergence is independent of ω ∈ Ω.)For the second term, define Ẽj = (v(t j ; P 0 ) − v(t j ; P )) E j . Then where the final inequality follows from the continuity of v and the compactness of T .Hence we have and therefore the set B P defined by  satisfies P Ω (B P ) = 1 by Kolmogorov's Law of Large Numbers.Finally, we may define A P = B 1 ∩ B P .Then P Ω (A P ) = 1 and 1 n J n ( V ; P )(ω) → J 0 (P ) for each ω ∈ A P , which completes the first part of the proof.
For the second part of the proof, we must find a set A with P Ω (A) = 1 such that 1 n J n ( V ; P )(ω) → J 0 (P ) for each ω ∈ A and for all P ∈ P(Θ).Naively, we desire A = ∩A P , but this intersection is (in general) uncountable.Rather, we construct the set A using the dense countable subset of P(Θ) (Theorem 2.17).Define Again by the Strong Law of Large Numbers, P Ω (A 1 ) = 1.Now define the set P (Θ) as before and set Since the intersection is taken over a countable number of sets, each having probability one (with respect to P Ω ), P Ω (A) = 1.To complete the second part of the proof, we must show that A ⊂ A P for all P ∈ P(Θ) (and not merely for all P ∈ P(Θ), which holds by the definition of A).If this is the case, then 1 n J( V ; P )(ω) → J 0 (P ) for all ω in A and for all P ∈ P(Θ).
Consider any P ∈ P(Θ) and take ω ∈ A, ǫ > 0. Since ω ∈ A, ω ∈ A 1 and we may choose n 1 such that for all n ≥ n 1 , By the continuity of v and the density of P(Θ) in P(Θ), we may choose P M ∈ P(Θ) such that .
Finally, ω ∈ A imples ω ∈ A P M wich in turn implies ω ∈ B P M .Thus we may choose n 2 such that for all n ≥ n 2 , Then for n ≥ max{n 1 , n 2 }, The first term goes to zero since ω ∈ A implies ω ∈ B 1 .The final term goes to zero by assumptions (A5) and (A6).For the second term, Thus 1 n J n ( V ; P )(ω) → J 0 (P ) and thus ω ∈ A P .Thus A ⊂ A P for all P ∈ P(Θ) and the second part of the proof is complete.
Finally, we must show the convergence is uniform on P(Θ) for ω ∈ A. To do so we will show that the sequence of functions 1 n J n ( V ; P )(ω) is equicontinuous (viewed as functions of P ) and then use the Arzela-Ascoli Theorem.For fixed ω ∈ A, let ǫ > 0. Take P ∈ P(Θ).By the continuity of v (A5) and compactness of T (A4), there exists a δ > 0 such that for all P ∈ B δ (P ).Since ω ∈ A, ω ∈ A 1 and we can choose Then for n ≥ N and for all P ∈ B δ (P ), Thus the sequence of functions 1 n J n ( V ; P )(ω) is equicontinuous for each ω ∈ A and by the Arzela-Ascoli Theorem, 1 n J n ( V ; P )(ω) → J 0 (P ) uniformly on compact subsets of P(Θ), and hence on P(Θ) itself.Theorem 4.2.Under assumptions (A1)-(A7), the estimators P n w * − − → P 0 as n → ∞ with probability 1.That is, Proof.Take the set A as in the previous theorem and fix ω ∈ A. Then by the previous theorem, 1 n J n ( V ; P )(ω) → J 0 (P ) for all P ∈ P(Θ).Let δ > 0 be arbitrary and define O = B δ (P 0 ).Then O is open in P(Θ) (in the subspace topology) and O C is compact (again, in the subspace topology).Since P 0 is the unique minimizer of J 0 (P ) by assumption (A7), there exists ǫ > 0 such that J 0 (P ) − J 0 (P 0 ) > ǫ for all P ∈ O C .By the previous theorem, there exists n 0 such that for n ≥ n 0 , 1 n J n ( V ; P )(ω) − J 0 (P ) < ǫ 4 for all P ∈ P(Θ).Then for n ≥ n 0 and But J n ( V ; P n )(ω) ≤ J n ( V ; P 0 )(ω) by definition of P n .Hence we must have ) for all n ≥ n 0 , which implies P n (ω) w * − − → P 0 since δ > 0 was arbitrary.Theorem 4.2 establishes the consistency of the estimator (1.5).Given a set of data v, it follows that the estimate Pn corresponding to the estimator P n will converge to the true distribution P 0 under the stated assumptions.We remark that these assumptions are not overly restrictive (compare [14,19,24])s though some of the assumptions may be difficult to verify in practice.Assumptions (A3)-(A5) are mathematical in nature and may be verified directly for each specific problem.Assumptions (A1) and (A2) describe the error process which is assumed to generate the collected data.While it is unlikely that one will be able to prove a priori that the error process satisfies these assumptions, posterior analysis such as residual plots [22,Ch. 3] can be used to investigate the appropriateness of the assumptions of the statistical model.Assumption (A6) reflects the manner in which data is sampled and, together with Assumption (A7), constitutes an identifiability condition for the model.The limiting sampling distribution function µ may be known if the experimenter has complete control over the values t j of the independent variables (e.g., if the t j are measurement times) but this is not always the case.

Computational Convergence
The novel results in the previous two sections establish the desirable property of consistency of the estimator P n as a measureable function mapping the data observation process to the space of probability measures.However, it is generally not possible to directly solve the optimization problems (1.5) or (1.6) for P n or Pn as a function of V or v.As a result, approximate (generally numerical) methods must be used in order to solve (1.7) and obtain an approximate estimate P N n,M .We must ascertain, then, how the approximate estimate P N n,M relates to the exact estimate Pn (for any fixed value of n.)The following result establishes the computational convergence of the Prohorov Metric framework.
Together with the results of the previous two sections, these results establish a comprehensive body of theory for the least squares estimation of the measure P 0 that is assumed to have generated the observed data.
Theorem 5.1.Let (Θ, d) be a compact, separable metric space and consider the space (P(Θ), ρ) of probability measures on Θ with the Prohorov metric, as before.Let P M (Θ) be as defined after the proof of Theorem 3.1.Assume 1. the map P → J N n ( v, P ) is continuous for all n, N ; 2. for any sequence of probability measures P k → P in P(Θ), v N (t; P k ) → v(t; P ) as N, k → ∞; 3. v(t; P ) is uniformly bounded for all t, P .
Then there exists minimizers P N n,M satisfying (1.7).Moreover, for fixed n, there exists a subsequence (as M, N → ∞) of the approximate estimates P N n,M which converges to some P * n which satisfies (1.6).This theorem provides a set of conditions under which a sequence of approximate estimates P N n,M converges to the estimate Pn of interest.This estimate is itself a realization (for a particular data set) of the estimator P n which has been shown to exist and to be consistent, so that P n → P 0 with probability one.Thus we are assured that a computed measure P N n,M is an accurate estimate of the true distribution P 0 .The assumptions of Theorem 5.1 are not restrictive.In typical problems (and, indeed, in the assumptions of other theorems appearing in this document) it is assumed that the parameter space Θ as well as the independent variable space T are compact.In such a case, Assumptions 1 and 3 above are satisfied in the individual model solutions y(t; θ) are continuous on T × Θ. Assumption 2 is then simply a condition on the convergence of the numerical procedure used in obtaining model solutions.
Significantly, the Prohorov Metric Framework is computationally constructive.In practice, one does not construct a sequence of estimates for increasing values of M and N ; rather, one fixes the values of M and N to be sufficiently large to attain a desired level of accuracy.By Theorem 2.17, we need only to have some enumeration of the elements of P M (Θ) in order to compute an approximate estimate P N n,M .Practically, this is accomplished by selecting M nodes in Θ, {θ M k } M k=1 .The optimization problem (1.7) is then reduced to a standard constained estimation problem over Euclidean M -space in which one determines the values of the weights p M k corresponding to each node.Thus, P N n,M = arg min where in the final line we seek the weights pM . These are sufficient to characterize the approximating discrete estimate P N n,M since the nodes are assumed to be fixed in advance.Moreover, define Then one can equivalently compute [10] P As M grows large, the quadratric optimization problem (5.1) becomes poorly conditioned [10].Thus there is a trade-off: M must be chosen sufficiently large so that the computational approximation is accurate, but not so large that ill-conditioning leads to large numerical errors.The efficient choice of M as well as the choice of the nodes {θ k } M k=1 is an open research problem.
It should be acknowledged that the uniquness of the computational problem (i.e., when H is positive definite) is not sufficient to ensure the uniqueness of the limiting estimate P * n in Theorem (5.1) (as there could be multiple convergent subsequences).However, if J n ( v; P ) is uniquely minimized, then every subsequence of P N n,M which converges must converge to that unique minimizer.Moreover, under assumptions (A1)-(A7), it has been shown that 1  n J n ( v, P ) → J 0 (P ) (as n grows large) with probability one, and the function J 0 (P ) is assumed to be uniquely minimized by P 0 .

Concluding Remarks
In this document we have defined a parameter estimation problem in which one has a mathematical model describing the dynamics of an individual biological or physical process but data which is sampled from a population of individuals.Because each individual is assumed to be described my a unique set of parameters, the data is described not by a single parameter but by the probability distribution (over all individuals) from which these parameters are sampled.
Theoretic results for the nonparametric measure estimation problem are presented which establish the existence and consistency of the nonparametric estimator.Combined with established computational/approximation techniques, these results form a comprehensive theoretical basis for the nonparametric least squares estimation of a probability measure.
Several open problems remain.First, while the computational scheme is simple, it is not clear how one should go about choosing the M nodes θ k from the dense subset of Θ which are then used to estimate weights p k .From a theoretical perspective, the nodes need only to be added so that they 'fill up' the parameter space in an appropriate way.In practice, however, rounding error and ill-conditioning can be quite problematic, particularly for a poor choice of nodes.A more complete computational algorithm will include information on how to optimally choose the M nodes θ k (as well as the value of M ).Some results in these directions can be found in [8,10,11].
Additionally, given the consistency of the estimator P n , it would be desirable to place some measure of confidence on the estimated probability distribution.The traditional frequentist approach relies on either asymptotic theory or bootstrapping to construct such measures of confidence.In the former case, it is not clear how one might extend notions of sensitivity to the space of probability measures, which would require a notion of differentiability on the space of probability measures.In the latter case, the results provide some computational estimates but do not enjoy any theoretical rigor.Some work on these topics has been considered [8,9,11,13] and is ongoing.
.1) From this reformulation, it is clear that the approximate problem (1.7) has a unique solution if H is positive definite.If the individual mathematical model (1.2) is independent of P (See [2, Sec.14.1.2]for a complete discussion) then the matrices H and f can be precomputed in advance.Then one can rapidly (and exactly) compute the gradient and Hessian of the objective function in a numerical optimization routine.