NON-RECURSIVE ENUMERATION OF HETEROCHRONOUS TREES

Abstract: The enumeration formula of a non-contemporaneous genealogy with total sample size n = n1 + n2 requires a nested sum-product. The set of ancestral patterns in the noncontemporaneous genealogy yields a multiplicity factor that translates from the set of ancestral patterns in the isochronous genealogy. A computation formula of the multiplicity factor proves to be non-recursive. Evaluation of small sample sizes demonstrates the emergent complexity. Extension to the enumeration formula in the heterochronous genealogy with m samples of total size n = n1 + · · · + nm yields a non-recursive nested sum-product. These enumeration formulae measure sample spaces of Bayesian prior distributions of trees relevant to theoretical and computational phylogenetics.


Introduction 1.Background
A tree is a connected graph that does not contain cycles, in which there exists a path from each vertex to every other vertex such that forward paths do not return to any vertex.Consider binary trees for which coalescence occurs pairwise only such that two edges subtend every internal vertex, excluding tips, and one edge extends above every vertex, except the root.This constraint remains normative in many areas of coalescent theory and its applications ( [1], [2], [3]), including phylogenetics ( [4], [5], [6], [7], [8], [9], [10]), although multicoalescent processes is also an active area of research [11].The alternative class of phylogenetic model used in some of these works apply stochastic branching processes to model the growth of a tree that possess technical similarities and dissimilarities to the coalescent process (Section 9.2, [12], [13], [14]).
When the tips of a tree are all contemporaneous an isochronous sample results.Extension of the coalescent process to heterochronous samples allowed multiple collections at successive epochs ( [15], [16], [17], [18]).Enumeration of heterochronous genealogies proven in Sections 2 and 3 turns out to be iterative in the form of nested sum-products, thereby disproving a conjectured recursion in the literature of computational biology (equation 1, [19], Section 2.2 [20]).Recursions in theoretical computer science and coalescent theory are attractive for efficient algorithm design.There is a distinction between recursions for enumeration of state spaces and those for probability distributions of conditional ancestral Markov chains given a genetic sample (see, for example, Table 1 in [5], that contains an application of a branching-coalescing ancestral recombination graph).
When differences in the genetic diversity of two samples drawn from the same population at different times are (in part) a consequence of mutation then that population is said to be measurably evolving [16].That is, any population with mutation rates high enough (rapidly evolving pathogens) or with deep historical records (ancient DNA), so that a statistically significant accumulation of substitutions between serially sampled data can be detected [18].For isochronous data, the rate at which lineages coalesce is inversely proportional to the product of population size and a primitive mutation rate, when time is measured in units of mutational substitutions.With heterochronous data, it is possible to decouple population size from mutation rate, and estimate each separately [15].A software package ( [21], [22], [23]) that includes functionality for statistical inference on heterochronous genomic data is applied in the phylogenetic studies cited above ([4]- [9]).There is a Bayesian simulation package [24] that generates heterochronous genetic data according to transition probabilities and inter-arrival times calculated with the heuristic possibility of double simultaneous pair-wise coalescences. 1heoretical phylogenetics encompasses various techniques of enumeration and distance metrics on trees (see, for example, [25]) that is a modern research field of discrete mathematics [12].Derivation of some recursive enumeration formulae [26] in Subsection 1.2 will enable description of useful elementary graph theoretical methods.

Theoretical Introduction
In the case of unlabelled rooted topologies, the enumeration formula, for n odd, and, for n even, where a n in (1.1) denotes the number of such tree shapes with n tips.The criterion of recursion is the sum of the number of subtrees with (n − i) and i tips, for i = 1, 2, . . ., (n − 1)/2 (n odd) or n/2 (n even).Proof follows by induction, with early terms readily verified with sketches that successively use the comb (n = 3 ) and balanced (n = 2 ) topology.Successive terms of the recursion can be seen by inspection until the symmetric shape.That is, when the topology is subdivided equally into two branches split off from the root and both consist of n/2 tips.Two observations then hold: (i) the same shape in both subtrees occurs in a n/2 ways; (ii) a different shape in each subtree occurs in a n 2 a n 2 − 1 ways, except that left and right subtree shapes do not count when interchanged, which introduces a factor of 1/2.Therefore, their sum yields total number of symmetric shapes.
In the case of tip-labelled rooted topologies, the enumeration formula Here T n in (1.2) denotes the set of all such tree shapes with n tips.The criterion of recursion is the number of edges plus one, corresponding to the root, in a tip-labelled topology .A new tip-labelled topology with n tips is obtained by appending a new edge to the possible tip-labelled topologies with n − 1 tips.A new edge can be appended to any pre-existing edge or back from the root.
Tip-labelled ranked trees, also called ordered histories, and isochronous genealogies, describe the ancestral pattern from tips to root that identifies the order of internal vertices in the tree.In this case, the enumeration formula where R n in (1.3) denotes the set of all isochronous genealogies with n tips.The criterion of recursion being that pair-wise coalescence events define the sequence of internal vertices of the genealogical tree by the lineages involved.
Fully-ranked tip-labelled trees, also known as heterochronous genealogies, have non-contemporaneous samples.Multiple sample collection events increase the number of extant lineages in the genealogical tree.Pairs of lineages involved at coalescence events proceeding back from each collection are taken uniformly at random from all lineages that remain in the tree due to the assumption of exchangeability.The lineages that coalesce during an interval between consecutive collections will not necessarily involve those lineages added to the tree at the latest collection.The ancestral pattern is realised irrespective of the constraints implied by a tree representation.In this sense, the nomenclature of 'tree' can be misleading.However, when the vertices exist within a complete graph the genealogy has no spatial structure.
The total number of heterochronous genealogies with n tips is a property of the state space for a Bayesian prior distribution of phylogenetic inference.In Sections 2 and 3, new results proven describe iterative enumeration formulae of heterochronous genealogical trees.These results on ancestral patterns, rather than branch lengths or waiting-time distributions, thus precisely relate to phylodynamics.

Sample Space Size
When the total number of tips of the first plus the second collection equals n the possible tip configuration varies accordingly as (i, n − i) ; i = 1, 2, . . ., n − 1.In this case, the enumeration formula of the non-contemporaneous genealogy with n tips in total equals (n − 1) To prove (2.1) consider the possible number of coalescence events that occur during the interval between the first and second collections.This number varies from zero to n-2.The first term in (2.1) corresponds to zero and was described in (1.3).One coalescence during the interval between collections ancestral patterns, and so on.Until n-2 such coalescences yield ancestral patterns.Clearly, a factor of 2 1−n appears in (2.1).

Factorization of an Isochronous Genealogy
Rearranging (2.1) allows factorization of (1.3), Corollaries.(i) Enumeration of the (n − 1, 1) -genealogy yields (1.3) multiplied by n+1  3 .(ii) Enumeration of the n 2 , n 2 -genealogy does not exceed four thirds of (1.3), for all finite n .To prove this, consider the number of coalescence events that occur during the interval between the two collections; 0, 1, . . ., n 2 − 1 .Sum the possible genealogical trees that arise in each case, to get 3) shows the way in which the corresponding series yields an exact value for finite n (even) .Every quotient coefficient of n in (2.3) is less than one, and decreases with n.Therefore, this corollary does hold, albeit roughly, as P.F.Slade a reduced partial sum of the infinite series that converges to (1 − x) −1 , where x = 1  4 .This upper bound also holds for the (i, n − i) -genealogies, where i = 2, 3, . . ., n 2 − 1 .

Computation Formula of the Multiplicity Factor
A computation formula can be extracted from the expression within the curly brackets in (2.2) that provides a multiplicity translation from (1.3) the total number of isochronous genealogies with n tips to (2.1) the total number of non-contemporaneous genealogies with n tips in total.For n>1 odd, For n>2 even, adjusted upper summation terminals result and the formula is otherwise the same,

Evaluation of the Enumeration Formula
Refer to Table 1 for a brief exploration of sample space size as n increases from the simplest patterns of topology through to the comparison of isochronous and non-contemporaneous genealogies.Enumeration of (1.1-3) provide the standard pattern foundations that grow in complexity with successive pattern information.The multiplicity factor summarizes the level of increased complexity contained within the non-contemporaneous genealogy.Consideration of the sample space size thus yields a raw gauge of the increased computational requirements for bioinformatics, genomics and phylogenetics of non-contemporaneous data with only one further time-delayed sample collection.That is, when analysing the data with respect to its genealogical realization.In practice there may be a constraint on the format of data and a sample configuration will be fixed.

Trees of m ≥ 2 Serial Samples
Let the total number of tips in the heterochronous genealogy be n = n 1 + n 2 + • • • +n m , where n i denotes be the i− th sample collection of tips, for constant m.Let i j denote the number of coalescence events that occur during the interval between the j− th and (j+ 1)-st samples.Thus, i 1 = 0, 1, . . ., (n 1 − 1) ; ; and since the remaining lineages must all coalesce in the final interval i . Note that n i ≥ 1; i = 1, 2, . . ., m.Note also that with fixed n, the minimum n m = 1 and the maximum n m = n − m.
The number of possible tip configurations within the m− genealogy equals n−1 m−1 combinations, which distribute the total number of tips n amongst the m serial samples accordingly.Let F (n 1 , . . ., n m ) denote the number of heterochronous m− genealogies with n tips in total for a tip configuration (n 1 , . . ., n m ), where . . .
Alternative expressions for p m−1 and p m in (3.2), Note all these product terms apply the convention of being unity when the upper terminal is less than the lower terminal.The numerator of (3.3) has become independent of i m−1 .The denominator of (3.3) multiplied by (3.4) equals The next product term simplifies identically, Now, the numerator of (3.6) is independent of i m−2 .Multiply, the denominator of (3.6) and the numerator of (3.3), to get In consideration of (3.5) and (3.7), the products p i in (3.1) can be replaced, and the other such function replacements from (3.8) follow accordingly.
A conjectured recursion ([19] equation 1; [20] equation 2.4) where 3).The enumeration of the m− genealogy with n tips in total derived in the present Section does not have such a form.The only similarity is found in the quotient of the denominator in (3.3) and (3.4), however no mention of the quotient derived here actually appears in that conjecture.
Remarks.Equation (3.1) shows the summation terms that precede the final summation term do not factor out of the final interval.Otherwise, the final summation is then ill-posed (undefined) since the values of i 1 , . . ., i m−2 become unknown.Therefore, the full enumeration of a heterochronous genealogy has the form of an extensive iterated sum-product.Efficiency can be gained by storage and reuse of certain product terms, although iteration through the nested sum-products determines where the reuse should occur.Increasing the total number of tips to (n+ 1) yields an identical enumeration formula adjusted for an additional tip placed within any one of the fixed number of m samples.Therefore, enumeration remains non-recursive in that case.With n fixed, add an (m+ 1)-st sample with n m+1 tips and the tip configuration varies such that n = ñ1 + • • • + ñm + n m+1 , accordingly.

Conclusion
Enumeration formulae of a heterochronous m− genealogy are derived and shown to be iterative in the form of nested sum-products with some efficiency gains available from pre-calculated product terms.A multiplicity factor derived translates the sample space size of all pair-wise coalescent ancestral patterns with a single sample of n to the sample space size in a heterochronous genealogy with a total sample of n = n 1 + n 2 .In the latter case, the sample configuration varies through (i, n − i) ; i = 1, 2, . . ., n − 1.In the general case, enumeration of the heterochronous genealogy requires an extended non-recursive procedure.In both cases, the probability distribution on genealogies is non-uniform since the rate of coalescence events that determine the genealogy depends on the number of lineages, which varies with sample configuration.