Robustness of Selection and Timing Inference under Model Variation in Population Genetics

Javier Escabi12 and Sahand Hormoz123*

    1Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA

    2Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, 02215, USA

    3Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA

    * Corresponding author; email: sahand_hormoz{at}hms.harvard.edu

      bioRxiv preprint DOI: https://doi.org/10.1101/2025.01.08.631974

      Posted: January 10, 2025, Version 1

      Copyright: This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/

      1 Abstract

      In population genetics, accurately inferring the selection coefficient and the time of onset of advantageous mutations from genetic data is fundamental for understanding evolutionary processes. Here, we investigate how mismatches between the true evolutionary process and the inference model—specifically in the reproductive variance (σ2) and the number of generations (L)—affect the posterior distributions of the selection coefficient and the time of onset. Using the Kolmogorov forward and backward equations, we model the stochastic dynamics of gene frequencies under selection and drift. We show that while the posterior distribution of the selection coefficient remains unaffected by changes in σ2 and L, this invariance does not apply to the time of onset. By framing the problem as a first passage time issue, we derive explicit expressions for the offsets in the posterior mean and variance of the time of onset that result from incorrect assumptions about σ2 and L. Our analysis reveals that these offsets are related to the mean and variance of the first passage time required for the allele frequency to reach a certain threshold, starting from an initial frequency determined by the model parameters. Under the assumption of a uniform prior for the time of onset, we find that the offset in the inferred mean is given by the difference in the effective generation duration (Δ = 12) between the true process and the inference model. We validate our theoretical findings through simulations, demonstrating that the empirical offsets closely match our predictions. Furthermore, we generalize our results to accommodate non-uniform prior distributions, such as exponential priors, and provide numerical methods for calculating offsets under arbitrary priors. Stochastic fluctuations due to genetic drift, which are influenced by the reproductive variance and generational structure, can introduce significant biases in the posterior distribution of time of onset of advantageous mutations. By quantifying these biases, our framework enables more accurate adjustments to inferences drawn from genetic data, thereby enhancing our understanding of evolutionary dynamics and improving the reliability of population genetic analyses.

      2 Introduction

      Population genetics models of drift and selection have been used in modeling real-world populations with different biological details, for example, variation in offspring number, timing between generations, population structure, and sexual reproduction. These models are used for inference because they are known to be robust. One way to define robustness is to consider the behavior of these models under certain limits. For example, many distinct models including the Wright-Fisher model [12], the Moran model [3], and the Cannings models [45] share the same diffusion limit: population size N → ∞, selection coefficient s → 0, such that Ns → C, where C is a nonzero constant. In this limit, the probability distribution of gene frequency at time t > 0 conditioned on an initial frequency converges to a universal description by the Kolmogorov forward equation under an effective population size [6789]. This ensures a robustness where the time-course of gene frequencies for the same initial conditions are identical across many different models as long as the effective population size is the same [9]. A similar type of robustness has been defined for the ancestral relationships of genes in addition to their relative frequencies in the population. In the same diffusion limit and in the neutral case, the gene genealogy of a finite random sample of genes backwards in time converges to the Kingman coalescent under an appropriate scaling of time according to the effective population size [101112]. General conditions for convergence to the Kingman coalescent have been extended to include a changing population size, non-exchangeable offspring number distributions, selfing and sexual reproduction, etc., through a generalized time-scale transformation [1011121314151617]. From the perspective of inference, these results guarantee a robustness of the Kingman coalescent in a different sense: only an effective population size, defined as the ratio of true population size N to a model specific reproductive variance σ2, can be inferred from an observed set of coalescence times. However, inference of population size N itself is not robust because it requires knowledge of σ2.

      The idea of robustness historically has been centered around inferences of effective population size to uncover demographic histories that include population bottlenecks, population structure, and population growth [18192021222324], reviewed in [2526]. Inference of selection has become of interest more recently due to the drop in cost and advances in modern sequencing technologies, for example, inferences of population growth rates in human populations due to the increased statistical power to observe rare genetic variants [2728293031], reviewed in [32], inferences of selection and timing of driver mutations in blood cancers from single-cell whole genome sequencing experiments [3334], and inferences of selection in the drosophila genome [353637]. On the theoretical side, models of selection have been extensively studied, including balancing selection [38], the ancestral selection graph [39], multiple merger models [40414243], and selective sweeps [434445464748495051], reviewed in [5253]. However, robustness in the context of 1) inference of selection and 2) inference of the time at which the selectively advantageous mutation arose has received surprisingly little attention. This analysis is particularly important when selection is weak, for example, in the case of inferring selection and timing of the driver mutations of blood cancers which undergo random fluctuations in population size in their early history [3334]. Importantly, the fluctuations in the trajectories of the population size of the selectively advantageous mutation created by random drift introduces uncertainty when inferring these two quantities from an observed set of coalescence times. This makes the analysis of the robustness of the inference of these two quantities subtly different from robustness of the coalescent statistics under an effective population size in the classical sense.

      We will now pose the central question of this paper: suppose we have a coalescent tree from a population under selection and infer a posterior distribution for a selection coefficient and the time at which the selectively advantageous mutation arose using a model. How will the inferred distributions change if the model were to change?

      We can consider other forms of data as well. For example, we can consider the site-frequency spectrum of a sample or time-course measurements of gene frequencies. In all these cases, the results/equations of the paper will apply. Throughout the paper, we will consider coalescent trees since it adds an additional layer of complexity, but will point out how the results are a more general property of population genetics models and apply to other forms of data as well.

      We can frame the question more precisely. To do this, we will first define a class of models that include selection. We will begin by defining a modified version of the Wright-Fisher model with selection [12], and then generalize the process into a class of Cannings models with selection [4554]. The goal is to model a single mutation with a selective advantage conditional on fixation that begins with a population size of one, although these results can be generalized to multiple competing or subclonal mutations. We will then frame the inference part of the question more precisely after the model definitions.

      3 Model definitions

      We will begin by defining the Wright-Fisher model with selection. The model begins with a total population of N identical individuals (or genes), where N remains fixed over time. At each generation, the individuals at the current generation give birth to a new generation of N individuals and immediately die off. Each individual in the new generation can descend from any individual in the previous generation with equal probability. This process is iterated until at some generation one individual acquires a mutation that gives it a selective advantage s. We will refer to s as the selection coefficient. From this point on, the probabilities of descent change. Each individual in the new generation descends from wild-type individuals in the previous generation with probability p, and from mutant individuals in the previous generation with probability (1 + s)p. If n is the number of mutant individuals in the previous generation, then p can be derived from the condition that probabilities must sum to 1:Embedded Image

      We then define L to be the total number of generations of the Wright-Fisher process, and g to be the number of generations from the time at which the mutant individual arose all the way to the Lth generation at which the process terminates. Lastly, we define a to be the age of the population, or the amount of time (for example in years) that has elapsed from the first generation to the Lth. Note that the number of generations per unit time is then Embedded Image, assuming that the duration of each generation remains the same over time.

      We are interested in generalizing the Wright-Fisher model with selection in the Cannings sense [1245]. A rigorous treatment and definition of the Cannings models with selection and examples of these models can be found in [5455]. We will take a similar approach to [54], and construct the generalized Cannings version of our model for completeness.

      We will begin by defining a class of models that agrees with the Wright-Fisher model in terms of the average time-course of gene frequencies under selection, but differs in terms of the fluctuations in offspring number. To do this, there are a set of general characteristics of the Wright-Fisher model that we will retain while allowing other characteristics of the model to change. In particular, denote the set of random variables that represents the number of offspring produced by each individual in the Wright-Fisher model at a single generation t as Embedded Image. For any fixed t before the selectively advantageous mutation arose (prior to generation L − g), the joint distribution of Embedded Image follows a multinomial distribution with parameter values Ni = N and Embedded Image, where i refers to the ith individual. Importantly, the offspring numbers at a given generation are exchangeable random variables, where exchangeable means that the marginal distributions are invariant under arbitrary permutations. In addition, Embedded Image since we assumed the population size is fixed. However, the Embedded Image are not exchangeable random variables once the selectively advantageous mutation arises. At this point, the random variables are partitioned into two distinct classes with two different distributions: one class of individuals with a selectively advantageous mutation that on average produces more offspring, and another class of individuals that do not carry the selectively advantageous mutation. Here, the marginal distributions are invariant only if we permute the random variables that belong to the same class, or in other words, exchangeability holds only within each class. To construct the Cannings model, we will retain the condition of exchangeability within each classeq and assume Embedded Image, but will allow the distributions of Embedded Image to deviate from the multinomial form at the level of the higher moments beyond the mean. For example, one way to do this is to assume the Embedded Image satisfy a multinomial form as in the Wright-Fisher model but draw the pi from a given distribution [55].

      Next we will assume that the variance in the offspring number distribution is a σ2 value across all the individuals in each generation, and that the mean value of any Embedded Image conditional on belonging to a particular class and conditional on the total fraction of individuals belonging to that class (to account for the decreasing effect of the selection coefficient when approaching saturation), matches the mean value of the corresponding Embedded Image under the Wright-Fisher model with selection conditioned in the same way. Note that the Wright-Fisher model is a special case of a Cannings model with σ2 = 1. By imposing that the two processes correspond at the level of the means, we are ensuring that the only difference between the Wright-Fisher model and the Cannings extension is the degree of random fluctuations.

      Lastly, we need a sampling procedure to obtain a set of coalescence times from these models. To do this, we can consider the coalescence times of a finite random sample of k selectively advantageous individuals at the final time-point. The only way to guarantee we are able to sample a set of k selectively advantageous individuals is to condition the mutation on having expanded to at least k individuals at the final time-point, which is equivalent to conditioning on fixation when the total number of generations L is large. Throughout the paper, we will express this assumption by conditioning on eventual fixation. Although we assume eventual fixation, we are interested in cases where the k individuals are sampled during the selective sweep or shortly after fixation. Importantly, the assumption that we have sampled from the class of selectively advantageous individuals is not necessary, and we could have assumed that the k individuals are randomly sampled from the entire pool of individuals and our results will not change.

      We can now state the question more precisely. Suppose we have observed a set of coalescence times generated by some Cannings model with selection that we defined above. Now suppose we infer a posterior distribution for the selection coefficient and time of onset of the selectively advantageous mutation from the observed set of coalescence times using a different Cannings model with an arbitrary σ2. What will happen to the inferred distributions if we change σ2 in the model used for the inference?

      We can take a step back and ask what coalescent theory would predict. In particular, we can infer the effective population sizes from an observed set of coalescence times [9101112]. We can then consider the robustness of inferences of s. To begin, we will consider the Wright-Fisher model with selection that we defined (rather than an arbitrary Cannings model for simplicity, but the concepts will still hold), and then define n(t) as the number of individuals with a selectively advantageous mutation as a function of time in units of generations. Note that n(t) is a random variable, which produces a well-defined coalescent rate Embedded Image [101112], where k is the number of sampled individuals, after n(t) is realized. We can consider the inferred effective population size from a set of coalescence times at two different generations Embedded Image and Embedded Image, where ne(t) is the effective population size as a function of time. The selective coefficient can be indirectly inferred from the estimator Embedded Image which does not depend on the reproductive variance. Since σ2 is not required to infer s, then naively we can argue that inference of s is robust. However, this argument does not say anything about the fluctuations in the inferred value of s. The inference results in a posterior distribution of s, not a single value. It is conceivable that the larger the assumed σ2 in the model, then the larger the uncertainty in Embedded Image. However, we will show later that indeed the inferred posterior distribution of s is independent of σ2.

      We can also consider the robustness of the inferred time at which the selectively advantageous mutation arose. We can begin by considering the inferred population size trajectory which is n(t) = σ2ne(t), where σ2 is the reproductive variance of the model used for the inference. When the Wright-Fisher model with selection is conditioned on fixation, as a crude approximation, the selectively advantageous mutation grows according to 1 + t on average until a population size of Embedded Image is reached, at which point selection is stronger than drift and the mutation becomes destined for fixation. This is known from modeling average frequencies of mutations using birth-death processes [5657], as well as the probability of fixation [5859], and considering the scale at which different terms dominate, such as in [38]. Using coalescent theory, we will infer a population size trajectory of n(t) ≈ σ2(1 + t) for small t values, on average, from an observed set of coalescence times generated from the Wright-Fisher model. Importantly, when σ2 ≠ 1 in the model used for inference, the inferred population size reaches one at a different generation: by setting σ2(1 + t) = 1 we derive Embedded Image which is a number in between −1 and 0 provided σ2 ≥ 1. This is different from the time at which the true population size of the Wright-Fisher model (where σ2 = 1) reaches one on average, which is at generation t = 0 by definition, and differs from the time at which the inferred population size is one by at most the duration of one generation when σ2 ≥ 1, but can be larger than one generation if σ2 < 1. Again, we run into the same problem that arises in the inferred selection coefficient: this naive argument neglects the fluctuations in the inferred time of onset. Increasing σ2 increases the random fluctuations in the model which increases the uncertainty in the inferred time. We will later derive explicit formulas describing the offset of the inferred posterior under different σ2 that fully accounts for these random fluctuations.

      Up to this point, we have only considered the impact of varying σ2 in the model, but have not considered the impact of varying the assumed L on the inferred posterior of s and g. To better understand the impact, it is most instructive to convert the branch lengths of the data tree to an absolute time-scale (such as units of minutes, days, years). This is oftentimes done empirically if the branch lengths are in number of mutations and we know the mutation rate per unit time. The model assumes an arbitrary number of generations that spans the tree L, but the absolute amount of time spanning the tree is assumed to be a known, fixed quantity a. If Embedded Image is the duration of a generation in the model, then the coalescent rate in units of absolute time is Embedded Image. Note that we can define Embedded Image as the effective population size with a σ2 converted from variance per generation to per unit time, sinceEmbedded Image is a conversion factor. For simplicity, we can write Embedded Image, where Embedded Image, and will use the subscript y to denote parameters converted to units of an absolute time-scale throughout the paper. We can then define an estimator for the inferred selection coefficient in units of time Embedded Image and note that the same arguments apply. In particular, the inferred sy is independent of (σ2)y, and consequently independent of L. Similarly, the inferred population size during drift of a tree generated from a Wright-Fisher model where the duration of a generation is assumed to be one is n(τ) = (σ2)y (1 + τ). Note that it follows that the inferred time is Embedded Image, which shows an impact of at most the duration of a generation for (σ2)y ≥ 1, but can be larger than one generation whenever (σ2)y < 1. Again, these arguments ignore the impact of random fluctuations due to a different (σ2)y (or equivalently L and σ2 in the model). Similar to before, we will derive explicit formulas showing the offset of the inferred posterior under different L or σ2 that fully accounts for the random fluctuations.

      The reader can verify that these arguments hold for varying L only when the inferred parameters have units of an absolute time-scale, and not when they have units of generations. For the remainder of the paper, we will derive the results on an absolute time-scale to be as general as possible. This is because any formula expressed in units of absolute time can be converted to units of generations by assuming Embedded Image. To derive explicit formulas, we will need to take a more rigorous approach and model the fluctuations in gene-frequencies as a function of time. This requires invoking the Kolmgorov description of the Cannings model class.

      4 Scaling limit of the Kolmogorov forward equation beginning from a population size of one

      In the previous section, we presented arguments for the robustness of the inferred sygy to the assumed σ2 or L in the model. These arguments failed to account for the contributions of random fluctuations of the model to the posterior distribution. To account for random fluctuations, we will use the Kolmogorov forward equation [6] to model the changes in gene frequencies through time. This approach is a diffusion approximation, first used by [601] as well as in various influential works more explicitly, including [61626364] and is currently a standard approach in population genetics. We will consider a particular limit where the error in the posterior of sygy under different σ2 or L completely vanishes and use it as an example to build intuition. This will give us a deeper understanding of the problem, help us identify where robustness breaks down, and point us towards an approach to carry out rigorous calculations.

      We can begin by considering the Kolmogorov forward equation for the Cannings model. Here φ represents the probability density of gene frequency x at generation t, where the remaining parameters have the same definitions as in the Cannings model defined. In particular, the parameters of the equation are N, the total population size, σ2, the variance in offspring number, and s, the selection coefficient. The fourth parameter is g, the number of generations that has elapsed from the initial time-point when the clone size is one to the final time-point. g is implicitly incorporated as the duration of time for which we run the forward Kolmogorov equation. Namely, φ(x, 0) = δ(x− 1/N), and k individuals are sampled at the final time-point g from a realization of the clone whose frequency is described by φ(x, t). As before, our goal is to infer s and g from the coalescent structure of the k individuals. The dynamics of φ(x, t) is given by,Embedded Image

      As suggested before, we will convert the equation to units of absolute time. This can be done using the differential Embedded Image and the chain rule to obtainEmbedded Imageand substitute (Eq. 3) into (Eq. 2) and rewrite as followsEmbedded Image

      We can write the above equation in terms of sy, (Ne)y. This can be done by first noting that if there are Embedded Image generations per unit time, and the population grows by a factor of 1 + s each generation, then the population growth is Embedded Image per unit time. In addition, we can use Embedded Image, and write N as an effective population size Embedded Image. Substituting these values and assuming |s<< 1 we obtain:Embedded Image

      It is clear that this expression only depends on L and σ2 through (Ne)y and sy. Therefore, we will obtain the same solution if we arbitrarily change L and σ2 while keeping sy and (Ne)y constant. For example, if we let L → cL, then Embedded Image so that sy remains constant, and N → cN so that (Ne)y remains constant. Similarly for σ2 → 2N → cN so that (Ne)y remains constant, and sy remains constant merely by the fact that it does not depend on σ2.

      In addition, it follows that the distribution of coalescence times of a random sample of selectively advantageous individuals is also preserved as long as sy and (Ne)y are held fixed. For example, we can define the random variable x(τ) as the frequency (i.e. fraction of N) of selectively advantageous individuals as a function of time described by (Eq. 5). The effective population size of the fit subpopulation is then x(τ)(Ne)y, and the rate of coalescence at each time-point is Embedded Image after any realization of x(τ). Since the distribution of x(τ) and thereby Embedded Image is preserved, then the distribution of coalescence times is also preserved when changing L or σ2 as long as sygy, (Ne)y are held fixed, where gy is the duration of time in absolute time units we run x(τ) under (Eq. 5) before sampling.

      Although this scaling is well-understood, it is useful to recall its biological meaning. In particular, increasing L → cL or σ2 → 2 increases the rate of diffusion of gene frequencies by increasing (σ2)y. However, the scaling also requires N → cN to preserve (Ne)y, which decreases the rate of diffusion since the rate also depends on absolute population size, thereby canceling the increase to (σ2)y. In addition, changing L → cL requires Embedded Image to preserve sy, so the increase in the growth rate due to adding more generations is balanced by a decrease in selection at each generation. Put more simply, under this scaling we are preserving the balance of drift and selection on an absolute time-scale by requiring N and s to cancel any change in σ2 or L. This scaling also ensures that the effective population size of the selectively advantageous subpopulation x(τ)(Ne)y, and hence the coalescence rate, is preserved.

      Up until now, we have pointed out a standard scaling of the Kolmogorov forward equation, and how it coincides with the Kingman coalescent. It may be tempting at this point to assume that we have solved the problem. For example, we could naively argue that since any combination of the parameter values sygy, (Ne)y produces the same distribution of coalescence times, then it follows that we can infer these parameter values independent from the choice of L or σ2. Another way of saying this is that the posterior distribution will not change if we change the L or σ2 in the model used for the inference. However, this argument is actually incorrect. In particular, the Kolmogorov forward equation is expressed in terms of a frequency x, and as a result, the initial condition is also expressed as a frequency. More precisely, the initial condition can be written asEmbedded Imagewhere the right hand side of (Eq. 6) is a Dirac delta function, and x0 is the starting frequency. However, our model does not assume a fixed initial frequency. It instead assumes an initial frequency of Embedded Image, since the process begins from a fixed population size of one. This implies that the initial condition changes with the scaling we considered above. For example, we can rewrite our initial frequency in terms of the parameters of the equation in absolute time units and σ2 and L to obtain Embedded Image, and henceEmbedded Image

      It follows that the initial condition changes with σ2 and L while holding fixed sygy, (Ne)y although the equation itself is invariant, which implies the solution of the Kolmogorov forward equation (i.e. the distribution of gene frequencies or x(τ)) changes under the scaling. As a result, such an invariance does not actually hold under these assumptions.

      We now consider the invariance of the Kolmogorov forward equation with the correct initial condition of one. We will show that the invariance under changing L and σ2 while keeping fixed (Ne)y and sy still holds when conditioned on eventual fixation in the limit L → ∞ or σ2 → ∞. We will use this result as a starting point to think about what happens when L and σ2 are finite.

      To begin, denote the density conditional on eventual fixation as φ(x | F). Using Bayes’ theorem, we can writeEmbedded Imagewhere we have set P (F | x, τ) = P (F | x), since the probability of eventual fixation is time-independent. The probability of fixation of a mutation with frequency x within a sufficiently large population of size N and with selection coefficient s is given by [59]. Therefore, after rewriting the parameters in terms of sy, (Ne)yEmbedded ImageEmbedded Image

      Note that (Eq. 10) is (Eq. 9) with Embedded Image, since we always begin with one selectively advantageous individual. Substituting (Eq. 910) back into (Eq. 8) gives usEmbedded Imagewhere we have assumed that Embedded Image. This limit assumes that drift is stronger than selection initially when the population size is small. The unconditional gene frequency density φ is given as the solution to the diffusion approximation [7], which in our notation isEmbedded Image

      The primed summation is over even values of n whenever k is even, and odd values of n whenever k is odd. Embedded Image are Gegenbauer polynomials of the first kind [65] and are defined asEmbedded Imagewhere we have used Pochhammer notation to denote a rising factorial defined asEmbedded Image

      The (λk)y are eigenvalues, and Embedded Image are constants which depend on sy and (Ne)y and k. The eigenvalues and constants arise through a related oblate spheroidal angle equation [666768]. These values can be numerically approximated using the method of continued fractions or taken from tabulated values [6869].

      The Ck are determined by the initial condition δ(x − x0) and are given byEmbedded Image

      We are interested in the initial condition Embedded Image. We can substitute the initial condition into (Eq. 15) and rewrite it in a way that will allow us to ignore second order termsEmbedded Image

      By substituting (Eq. 16) into (Eq. 12) and distributing the sum we obtainEmbedded Image

      Lastly, we can substitute (Eq. 17) into (Eq. 11) and obtainEmbedded Image

      Note that in this form, we can see that the sum converges as L → ∞ or σ2 → ∞ to a limit function of x, τ, sy, (Ne)y that is independent of the initial condition Embedded Image. This suggests that the process forgets the initial differences in frequency at τ = 0, while preserving the distribution of frequencies soon after. As before, this implies that the effective population size x(t)(Ne)y of the selectively advantageous subpopulation is preserved, and hence the distribution of coalescence times is also preserved, showing robustness to changes in the initial frequency caused by changes in L and σ2. Another way of saying this is that the posterior distribution converges to the following limit functionEmbedded Imagewhere P (sygy, (Ne)y) is a prior distribution, Embedded Image is the likelihood function, Ck are a set of observed coalescence times, and we have ignored the constant of integration or can think of it as being absorbed by f. The likelihood function on the right hand side is denoted as being independent of the initial condition, and as a result so is the constant of integration.

      Such a result may not be too surprising upon further reflection. We can think about the particular scaling more carefully and begin by first writing the fitness in absolute time units asEmbedded Image

      If we let L → cL, then we must have Embedded Imageto keep sy constant. However, conditional on fixation, the population size required for escaping extinction during drift (i.e. becoming established for fixation due to selection) goes from Embedded ImagetoEmbedded Image. Both processes take on average Embedded Image generations to establish, or in absolute time units a duration of Embedded Image, which remains constant under the scaling [5657585938]. This implies the population size increases on average by a factor of c. However, N → cN to keep (Ne)y constant. Since both the mutant population size and the total population size both increase by a factor of c, the frequency is preserved under this scaling. As we argued before, since x(t) and (Ne)y is preserved, the effective population size x(t)(Ne)y and hence the distribution of coalescence times is preserved. A similar argument can be made for σ2 → 2. Under this scaling, the population size at which the population is established is Embedded Image while the time to establish is independent of σ2 and constant. Since N → cN to keep (Ne)y constant, x(t) is preserved, and it follows that the distribution of coalescence times is also preserved.

      These dynamics are, of course, well understood. However, it’s worth emphasizing that there is a key distinction between the original scaling we considered which assumes a fixed initial frequency vs an initial population size of one. In the first case, the scaling results in the mutant population size increasing by a factor of c, but only because the initial population size is increased by a factor of c by definition. In the second case, the population size is scaled by a factor of c because the scaling results in an increased rate of extinction which means the trajectories that are observed are the ones that fluctuated to a larger population size and quickly escaped drift early on. In other words, the rescaling of population size is related to the dynamics of the population, and not a result of arbitrarily changing the initial population size. Importantly, this rescaling of the mutant population size has nothing to do with the fact that N → cN, and only related to s, L and σ2 which determine the time and population size at which drift is escaped. The reason N → cN is required in the Kolmogorov equation is to preserve the time at which saturation is reached, but the scaling of trajectories is fundamentally a result of the underlying dynamics of the selectively advantageous subpopulation.

      This invariance does not just apply to coalescent tree data. We have already established that the distribution of x(t) is invariant, so the results would apply if our data were a set of gene-frequencies x(t) measured at time-points after the initial advantageous mutation arose. Similarly, the observed site-frequency spectrum of the selectively advantageous subpopulation is also invariant. This is because the combined factor of increase in population size over time and the equivalent factor of decrease in the probability of fixation implies that the same number of neutral mutations that arise at a given time-point and reach the final time-point is observed in both processes (any process and the scaled parameter version of it). Extensions to these other data forms should ultimately be calculated and studied more rigorously. We will continue to focus on coalescent trees since our goal is not to create a catalog of all possible forms of data under which these results apply, but to present a framework/procedure that can be easily adapted to other forms of data beyond coalescent trees.

      As we previously argued, the invariance under an initial population size of one is only true in the limit of L → ∞ or σ2 → ∞. In this limit, the scaled process when conditioned on fixation drifts to the exact initial frequency of the unscaled process with vanishing fluctuations so that the two processes are described by the same solution to the Kolmogorov forward equation. For finite L and σ2, coalescence events that occur near the time at which the selectively advantageous mutation arose will have subtle differences in their distribution of coalescence times under different values of L and σ2. As a result, the coalescence times will retain their invariance only after some τ = ϵ, where ϵ → 0 in the limit as L → ∞ or σ2 → ∞. Although it is possible to consider the rate of convergence of ϵ directly, it is not a particularly useful quantity, since we are interested in explicitly calculating how the posterior distribution is impacted by this initial error. From this approach, it is neither clear what the impact should be nor does it even suggest a way to compute it. What we can say though is that the impact could in theory be significant. For example, we can consider the case of the inferred times at which blood cancer mutations arise in individual patients [3334] where the time between generations is thought to be one year [70]. If, for example, the error happens to be a fixed number of ten generations, then the inferred age of onset of the mutation would be impacted by ten years which could significantly change the interpretation of the results. Such an error would, of course, vanish as the number of generations per unit time becomes infinite. To directly compute the impact on the posterior distribution, we will need to introduce new concepts. In particular, we will need to map ensembles of trajectories from different models with different initial conditions in a way that preserves coalescence statistics. To do so, we will solve a first passage time problem.

      5 Robustness is a first passage time problem

      So far we have shown that under the appropriate scaling of parameter values and under certain limits, namely where L → ∞ or σ2 → ∞ such that (Ne)y and sy are held constant, then the posterior distribution converges to the following limit functionEmbedded Imagewhich is independent of the initial condition. Recall that gy is the time of onset of the clone or the absolute time from the occurrence of the selectively advantageous mutation (when the clone arises with population size one) to the time at which random individuals are sampled from the clone to construct their coalescent tree. Although the posterior becomes arbitrarily close in the limit, it will differ from the limit function for finite σ2 or L. To compute the impact, we will first discuss how to correctly establish a correspondence between trajectories across different scalings, and in doing so it will become clearer how to compute the error.

      We will consider again the first scaling in this paper, which assumed a fixed initial frequency rather than a fixed initial population size of one. We will consider two processes that differ by the scaling L → cL. We will assume that the processes have an initial population size of 1 and c (c > 1) for the first and second process respectively to ensure the frequency remains fixed as N → cN. We will refer to the first process as the original or unscaled process and the second as the scaled process. The invariance of the Kolmgorov forward equation suggests the following: for any trajectory in the original process, there exists an equivalent trajectory in the scaled process that is precisely a factor of c larger in population size at every time-point and has an identical probability of occurring. Since effective population size is preserved across both trajectories, the distribution of coalescence times from a random sample of either trajectory is the same.

      Now consider two such pairs of trajectories (an original and its scaled counterpart). Since the scaled trajectory begins from a population size of c rather than 1, we can use it to create a new trajectory that begins from a population size of 1 by stitching it together with another trajectory that began at a previous time point, but reaches a population size of c for the first time and terminates precisely at the time at which the scaled trajectory begins. We stress that it is important to only consider trajectories that reach c for the first time, thereby excluding those that have become or exceeded a population size of c at any time-point before. Importantly, we can create an ensemble of trajectories out of the scaled trajectory, each with a different probability of occurring. For example, a trajectory that begins much farther back than the scaled trajectory is unlikely to reach c for the first time when the scaled trajectory begins, since reaching c before is likelier. By design, each trajectory of the scaled ensemble has an identical probability of any Ck that the original/unscaled process can produce, since the effective population size matches the unscaled process after the time-point at which it begins.

      These ideas can be restated in terms of frequencies. In particular, the unscaled process has a starting frequency of Embedded Image while the scaled ensembles begin farther back from a starting frequency of Embedded Imageand reach Embedded Image precisely when the unscaled process begins. Conditional on reaching Embedded Image when the unscaled process begins, both processes will have the same trajectories of frequency from there on and hence the same distribution of coalescence times. For the remainder of the paper, we will use the language of frequencies because it simplifies the notation.

      Establishing this correspondence allows us to associate the trajectories of any process with an ensemble of trajectories from a scaled version of the process in a way that preserves coalescence statistics. This is useful because we can generate a posterior distribution of trajectories using a model with scaled parameter values with respect to an underlying process, and map them to trajectories from a posterior distribution generated using the correct parameter values. The mapped trajectories across the posteriors will only differ in that the ensembles from the scaled posterior will have begun farther back, which means that the impact of the scaling on the inferred gy is related to the time it takes the scaled process beginning from a smaller frequency to reach the starting frequency of the original process. This is known as a first passage time problem which is a recurrent theme in stochastic processes.

      Importantly, this offset will have no impact on the inferred posterior of fitness sy. This is because the time it takes the scaled process to reach the frequency of the original process is independent of its fitness sy, since this occurs during the drift phase long before selection determines the dynamics of the trajectories. By the time sy impacts the dynamics of the trajectories, the scaled trajectory has reached the frequency corresponding to the start of the unscaled trajectory. Drift switches over to exponential growth at a population size of Embedded Image. We live in the regime Embedded Image where these assumptions are clearly satisfied, implying that the scaled trajectories match the starting frequency of the unscaled trajectory long before drift ends. Therefore, the posterior of sy will not change if L and σ2 change as long as (Ne)y and sy are held constant.

      We will now make these intuitions rigorous and use it to come up with a more sophisticated definition of robustness. Our calculations will focus on the posterior of gy which requires an explicit calculation. The invariance of the inferred sy follows as a direct consequence of the way we have associated trajectories across the different processes.

      6 Definition of Robustness

      We will now express the ideas in the previous section in rigorous mathematical terms. We will derive a set of equations that describe the impact on the posterior distribution under changing model assumptions σ2L and take them as our definition of robustness. We will then derive under certain conditions a set of simple expressions relating robustness to the mean and variance of the time of first passage. As we will see, this approach is both conceptually useful and will make the assumptions more explicit.

      We will begin by deriving a posterior distributions using a (σ2)y assumed to match the true underlying process, and compare it to a posterior distribution using c(σ2)y. Note that the scaling (σ2)y → c(σ2)y is equivalent to either σ2 → 2 or L → cL, but we will now use the former notation as it encompasses both. The changing initial condition will now instead be expressed as Embedded Image. We can consider the marginal posterior distribution of the inferred time at which the selectively advantageous mutation arose (i.e. time at which the population was originally one) for the unscaled and scaled process which we will denote as Embedded Image and Embedded Image. More precisely Embedded Image are random variables which denote a sampled value from these posteriors. We are interested in deriving an expression for Embedded Image in terms of Embedded Image, and Embedded Image in terms of Embedded Image, although generalizing to higher moments is possible.

      The distribution of Embedded Image we will denote by Embedded Image, where P (gy) is the prior distribution of the time of onset, and Embedded Image is the integration constant. For the scaled process, we will denote the distribution of Embedded Image as Embedded Image, where Embedded Image also denotes the prior distribution with respect to Embedded Image, but have assumed the prior distributions in both processes are the same, namely that Embedded Image whenever Embedded Image, and that Embedded Image. Note that Embedded Image and gy both refer to the time (in absolute time units) into the past at which the population was initially one in both processes. Next we will denote the probability density that the scaled process with parameter c(σ2)y reaches frequency Embedded Image for the first time after a duration of time T beginning from frequency Embedded Image as Q(T). Importantly, as we stated in the previous section, the distribution of coalescence times will be the same across both processes if we condition on the frequency of the scaled process reaching Embedded Image when the original process begins. If we denote the trajectory of frequency of the scaled process as x* where Embedded Image, then we can denote this assumption by conditioning the likelihood function and writing: Embedded Image for all T > 0. This assumption will allow us to express Embedded Image in terms of f (Ck | gy) and Q(T).

      Before doing the calculation, we will take a quick digression to note that this assumption can break down. For example, if the original process begins after the final coalescent event (here, after means closer to the present), but the scaled process begins before, then the original process will have a probability of zero of producing the observed tree and the scaled process will have a nonzero probability, yet we will have assumed that both processes have equal probability of producing the observed tree. However, this assumption appears not to significantly impact the results as simulations will show in a later section. This is likely because coalescence events rarely occur in the time-interval before the population size reaches c. In addition, we will end up integrating the likelihood of coalescence across different times of onset, and most of the contribution to the integral will occur at time-points before the final coalescence event which is when the assumption does not break down.

      We can now write down the posterior of Embedded Image, and use the law of total probability along with this assumption to obtain:Embedded Image

      Here we have employed the use of an integral to simplify the notation and avoid having to sum over fractional durations of time. Finally, note that the integration constant isEmbedded Image

      We can use (Eqs. 2223) to compute the expectation and variance. We will begin with the expectation:Embedded Image

      Switching the order of integration we obtain:Embedded Image

      Next we do a change of variables Embedded Image in the inner integral, which can be done since gy is constant:Embedded Image

      We can also rewrite the integration constant in a similar manner by changing the order of integration and using the same change of variables Embedded Image:Embedded Image

      If we assume that P (gy + T) = P (gy), or equivalently that it is a uniform distribution, then from (27) we see that Z* = Z, and from (26) we obtainEmbedded Imageor equivalentlyEmbedded Imagewhere Embedded Image is the expectation of the first passage time to Embedded Image using the scaled process and with starting frequency Embedded Image

      We can then compute the second moment of Embedded Image in a similar fashion, again with the assumption P (gy + T) = P (gy), and use it together with the computed mean to obtain the variance:Embedded Image

      We will not compute the higher moments although it is possible to continue this procedure.

      The equations above show the relationship between robustness and the time of first passage. Furthermore, under the simplifying assumption of a uniform prior distribution, Embedded Image and Embedded Image are exactly the offset of the mean and variance of the posterior distribution respectively. Therefore, in this case, the problem of robustness reduces to finding the mean and variance of the time of first passage. We will consider non-uniform prior distributions in a later section. For now, we will compute the offset of the mean and variance assuming P (gy + T) = P (gy).

      It is important to note that these equations hold for only c ≥ 1, since we are implicitly assuming the process with a larger (σ2)y and hence smaller starting frequency is begun farther back (in this case the process with parameter c(σ2)y). However, we can generalize the equations by considering the case 0 < c < 1. The key insight is to notice that these equation more generally show how inferred values differ between two inferences with different assumed values of (σ2)y, independent of whether either (σ2)y actually matches the variance in the underlying process. We considered the unscaled (σ2)y as the true underlying process in our derivations, but there is nothing inherent about this in the equations. More generally, the left side of the equation corresponds to the inference with a larger (σ2)y while the left term of the right hand side corresponds to an inference with a smaller (σ2)y. This allows us to write for 0 < c < 1:Embedded ImageEmbedded Image

      Or equivalentlyEmbedded ImageEmbedded Image

      We can now interpret (σ2)y as the correct parameter of the underlying process, and obtain an expression for the impact of using a model with 0 < c < 1. Apart from the obvious difference that the correction is negative in sign, the first passage correction is now computed using the parameter value of the underlying process (σ2)y and the time it takes to reach a frequency of Embedded Image beginning from a frequency of Embedded Image.

      7 Calculating the time of first passage

      In the previous section, we derived a set of equations to describe the offset of the mean and variance of the posterior of gy from assuming an incorrect (σ2)y in the model. Under the assumption of a uniform prior distribution, we showed that the offset is the expectation and variance of the time of first passage. We will now attempt to compute these quantities directly. In particular, we will compute Embedded Image and varEmbedded Image for c ≥ 1 and Embedded Image and varEmbedded Image for 0 < c < 1, where Embedded Image is the first passage time distribution from Embedded Image using the model parameter c(σ2)y, and Embedded Image is the first passage time distribution from Embedded Image using the underlying process parameter (σ2)y. To derive formulas that hold for both cases, we will denote the initial frequency of the process that is pushed back as p, the frequency that must be reached as f, and substitute the correct values later on. To calculate the mean and variance of the time of first passage, we will follow an approach that was used to compute the moments of the time until fixation by directly integrating the Kolmogorov backwards equation [716], but with modified conditions. However, other approaches are possible that consider the same question [16359727374757677787980]. We will begin by writing down the Kolmogorov backwards equation for the Cannings model [659]. As an approximation, we will drop the selection term that depends on sy since the first passage occurs during the drift phase before the mutant population is established.Embedded Image

      Here φ(pf, τ) represents the probability that the frequency of the mutant subpopulation takes on a value of f (0 < p < f ≤ 1), where f is an absorbing boundary, at generation τ beginning from a frequency of p. The most widely studied case is f = 1, where φ(p; 1, τ) represents the probability of an allele having reached fixation [597172]. The actual Cannings trajectories do not remain at frequency f once f is reached for 0 < f < 1, but artificially introducing an absorbing boundary allows the process to retain information on the fraction of trajectories that reach f for the first time at any time-point. Importantly, the equation above does not explicitly depend on f because the absorbing boundary is specified in the boundary conditions, even in the case of f = 1. In particular, we must assume:Embedded Image

      For simplicity of notation, we will now write φ(p, τ), where it is implicit we are modeling the probability of reaching f.

      We are interested in computing the mean and variance of the distribution of time it takes to reach f for the first time beginning from p, conditioned on the process reaching f eventually. The reason we exclude the trajectories that go extinct before reaching f is because it is only the trajectories under the scaled parameter values that reach a frequency f that will have the same coalescence time distribution as the original process. Recall that we have chosen to model f as an absorbing boundary. The precise justification for modeling f this way is as follows: it allows us to first compute the probability that a Cannings trajectory reaches f at some point (rather than going extinct before reaching f) through limτ→∞ φ(p, τ) = φ(p), and then to compute the probability of reaching f by time τ conditional on the process eventually reaching fEmbedded Image. This quantity Embedded Image is important because it is the CDF of the first passage time conditional on f being eventually reached. This follows from the definition of an absorbing boundary, since any trajectories that reach f at any given time are by definition doing so for the first time. The probability density of the first passage time is then Embedded Image. We will first focus on computing φ(p) and then compute the mean and variance of the density. To compute φ(p), we will take the limit as τ → ∞ of the Kolmogorov backwards equation so that Embedded Image, which follows from the fact that 0 or f is eventually reached. We therefore obtainEmbedded Imagewhich impliesEmbedded Image

      We then integrate and apply the boundary conditions φ(f) = 1, and φ(0) = 0 to obtainEmbedded Image

      Note that when f = 1 we obtain the well-known probability of fixation of an allele with frequency p under drift and no selection.

      We will now consider the jth moments of the time until first passage, conditional on reaching f, which we denote as Embedded Image. Importantly, note that Embedded Image depends on f, but we have excluded f as a variable for simplicity of notation. According to these definitions, Embedded Image and Embedded Image where it is implicit that Embedded Image and Embedded Image respectively. This relates the Embedded Image notation to the previous notation (Eqs. 29303334). We will now compute the moments by first defining the useful quantity as done in [71]Embedded Image

      Mj(p) is the unnormalized version of Embedded Image and differs by a factor φ(p). That is,Embedded Image

      To find Embedded Image, we will rewrite the Kolmogorov backwards equation and its boundary conditions in terms of Mj(p), and then after solving the equation normalize Mj(p) to obtain Embedded Image. We begin with the first moment M1(p), and convert the backwards equation by differentiating both sides with respect to τ, multiplying by τ, and then integrating from 0 to ∞ to obtain:Embedded Image

      Through integration by parts, assuming Embedded Image and noting p < f we obtain:Embedded Image

      Substituting the ultimate probability of reaching fEmbedded Image, we obtain:Embedded Image

      To derive the boundary conditions, we assume Embedded Image since the expected time to reach frequency f is 0 assuming p = f, and that Embedded Image. The second boundary condition reflects the scaling limit in Section 4 where the frequency distribution becomes invariant as the initial frequency vanishes. Multiplying by Embedded Image shows these boundary conditions are equivalent to the conditions M1(f) = 0, and M1(p) → Kp asymptotically as p → 0, where K is some constant (see Eq. 41). We can solve the ODE (Eq. 44) through integration:Embedded Image

      We first apply M1(p)− > Kp as p → 0, which is satisfied only if C2 = 0. This follows from the fact that the first term goes to 0 as Embedded Image which can be seen from the Taylor expansion Embedded Image. Applying the second boundary condition M1(f) = 0 we obtain:Embedded Image

      As a validation, normalizing (Eq. 46) by Embedded Image (Eq. 41) and then taking the limit as f → 1 we obtain:Embedded Imagewhich is the mean time to fixation, found in [71] and previous sources but with a different approach. Next we will use the following Taylor expansion in (Eq. 46)Embedded Imageand assume p << 1 and f << 1 to let higher order terms vanishEmbedded Image

      Normalizing by Embedded Image we obtainEmbedded Image

      We will first consider the c ≥ 1 case and set Embedded Image and Embedded Image to obtainEmbedded Image

      We can now use (Eq. 51) along with (Eq. 29) and the fact that Embedded Image to obtain the expected offset in the posterior distribution:Embedded Image

      We will now take a step back to think about the result we have obtained. There are multiple but equivalent ways to interpret the equation, but we will suggest the following. First considerEmbedded Image which is equal to the duration of a generation divided by the reproductive variance. Recall that the average population size during the drift phase is 1 + (σ2)yτ conditional on fixation, and therefore Embedded Image is the time it takes the population size to increase on average by one. We refer to this quantity as the effective duration of a generation, as an analogy to an effective population size, and use the notation Δ. The effective duration of a generation of a process can be thought of as the duration of a generation assigned to a Wright-Fisher model (holding sygy, (Ne)y fixed) that has overlapping trajectories of population size and the same distribution of coalescence times. If we let Embedded Image refer to the Δ of the underlying process, and Embedded Image to the Δ of the model, we can obtain a cleaner expressionEmbedded Imagewhich holds for c ≥ 1. This change of notation in the sample distributions reflects the fact that the formula turns out to not depend on c explicitly. This is because we can also show that (Eq. 53) holds for 0 < c < 1 as well. To see this, we can revisit the offset formula for 0 < c < 1Embedded Image

      We can then go back to the expression for Embedded Image in (Eq. 50) and substitute the modified initial conditions Embedded Image and Embedded Image to obtainEmbedded Image

      Substituting (Eq. 55) into (Eq. 54) we obtainEmbedded Imageor equivalentlyEmbedded Imagefor 0 < c < 1, which is the exact same expression we obtained for c ≥ 1. It therefore follows that these formulas hold for all c > 0.

      To obtain the offset in the variance under the parameter change, we will compute Embedded Image. We can take a similar approach and differentiate the backwards equation with respect to τ, multiply by τ 2, and then integrate across all τ to obtain:Embedded Image

      Using integration by parts, assuming Embedded Image, and rewriting the expression we obtainEmbedded Image

      We must now substitute M1(p) into the equation. However, we are interested in an approximate solution under the assumption p << 1 and f << 1, so we will substitute (Eq. 49). To do the following derivation correctly, we must actually include higher order terms of M1(p) and show that they vanish in the final equation. To simplify the derivation shown here, we will not include those higher order terms and simply note that when we obtain the final answer those terms will show up as higher order products of f and p and vanish. We now substitute our approximation for M1(p):Embedded Image

      Integrating once we obtain:Embedded Image

      Using the following first order approximations, and again noting that higher order terms will vanish, we can substituteEmbedded Imageinto (Eq. 61) to obtainEmbedded Image

      We then integrate a second time:Embedded Image

      Similar to before, we can derive boundary conditions that reflect the scaling limit in Section 4, in particular, we assume Embedded Image becomes constant, or equivalently M2(p) → Kp asymptotically, as p → 0. In addition, we assume Embedded Image or equivalently M2(f) = 0. After applying both boundary conditions we obtain:Embedded Image

      Now applying the normalizationφ(p) we obtain:Embedded Image

      Lastly, substituting Embedded Image and Embedded Image we obtainEmbedded Image

      We can now compute the variance from the first two moments:Embedded Image

      It therefore follows that the offset in the variance is:Embedded Imageor equivalentlyEmbedded Image

      Although this derivation assumes c ≥ 1, we can do a similar calculation as for the mean using the variance offset formula for 0 < c < 1 to show the formula holds more generally for all c > 0.

      8 Empirical Validation of Mean Offset Formulas Using Simulated Data

      In the previous sections, we showed that the mean offsets in the posteriors of gy and sy when using a set of correct vs. incorrect parameter values areEmbedded ImageEmbedded Imagewhere Δu, Δm are effective durations of a generation. Sm and Su are random variables representing a sampled point from the posterior of sy. As stated before, the invariance of the inferred sy (Eq. 72) is a direct consequence of the way trajectories are mapped between the scaled and unscaled processes, and therefore does not require an explicit calculation. Subscripts m and u denote whether the posterior is computed using the model parameters or the underlying process parameters, respectively.

      To validate our theoretical results using simulations, we performed inference on simulated data and computed the mean offsets empirically. To do this, we generated a simulated data tree using the Wright-Fisher model with selection as defined above (more details can be found in [33]). The model parameters used were s = 0.06779, g = 200, L = 600, N = 109, with the Wright-Fisher process conditioned on non-extinction at the final time point. A mutation rate of 5 per generation was used to assign the number of mutations, drawn from a Poisson distribution, to each branch of the tree. This sets the average tree length to 3000 mutations. 20 individuals at the final time-point were sampled, and the data tree was reconstructed as the coalescent tree of these individuals. The simulated tree used in our analysis is shown in (Fig. 1a).

      Figure 1:

      a. The simulated data tree, with the vertical axis representing the number of mutations accumulated along the branches. b. Two example posterior joint-distributions of sy and gy inferred from the data tree, using scaling factors of c = 1 and c = 4. c. The effect of varying the scaling factor c (which scales the model parameter L) on the inferred gy posterior. Blue dots represent the numerically computed means of the posterior distributions, with error bars indicating the standard error due to finite sampling. The red shaded region shows the predicted theoretical offsets calculated using Equation 73, relative to the ground truth at c = 1 (where the model L matches the true number of generations used to generate the data tree). d. Identical to (c), but showing results for the computed means of the sy posteriors and the corresponding theoretical offsets calculated using (Eq. 74).

      Different posterior distributions were inferred from the data tree using the same Wright-Fisher model, a fixed N = 109 that treats saturation as effectively infinite, and a fixed uniform prior on sygy while varying the assumed number of generations in the model. The assumed number of generations for the different posteriors were cL for Embedded Image,where L = 600. To superimpose the processes on an absolute time-scale, we assumed that each generation corresponds to one unit of time so that a = 600, or in other words Embedded Image.The mutation rate was set to Embedded Image per generation to match the mutation rate per unit time in the data tree for all inferences.

      Approximate Bayesian Computation (ABC) was used to obtain the posterior distributions. ABC simulates trees based on the specified priors and model, retaining parameter values that generate simulated trees similar to the observed data tree, as measured by a distance metric. These retained parameter values represent samples from the posterior distribution. Further details and software regarding ABC inference from coalescent trees, as used in our simulations, can be found in [33].

      The number of data points sampled from each posterior corresponding to the different Embedded Image values were n = 322135, 247253, 201252, 172164, 149576, 64180, 36946, 25857 respectively. The empirical mean of the marginals for gy and sy were computed for each posterior, and the standard error of the means were computed through bootstrapping of the points sampled from the posterior with replacement.

      We then derived the theoretical offsets of the mean. To do this, we computed the effective duration of a generation in the underlying process. Since Embedded Image and Embedded Image for the data tree, then Δu = 1. However, the model used for the inferences assumes cL number of generations with σ2 = 1, so Embedded Image.Therefore, the mean offset formulas areEmbedded ImageEmbedded Imagewhere 𝔼[Gu] and 𝔼[Su] are the means of Gu and Su, respectively, estimated empirically using ABC with the correct parameter values. The equations predict the offsets from these estimated means.

      We begin by plotting two example posterior distributions from the data tree using c = 1 and c = 4 (Fig. 1b, top and bottom, respectively). To precisely illustrate the results, we directly compare the empirical mean offsets with their expected theoretical values for all c (Fig. 1c). In this figure, the empirical mean calculated for each c is shown as a blue dot, with bootstrapped standard error of the mean represented by error bars. The red shaded region represents the theoretical offset calculated using (Eq. 73). More specifically, the upper and lower boundaries of this region are determined by setting 𝔼[Gu] in (Eq. 73) to the upper and lower bounds of the error bar at c = 1, respectively. This provides a probable range for the true mean offsets. By definition, the red shaded region includes the blue dot at c = 1, as the offset should be zero when using the correct parameter values. (Fig. 1d) presents analogous results, but with blue dots representing the computed means for the posteriors of sy and the red shaded region representing the theoretical offsets calculated using (Eq. 74). Our theoretical predictions of the mean offsets correctly match the simulated inferred values. We did not obtain simulated higher moments due to the computational challenge of resolving them numerically, a consequence of the exceedingly small predicted effect size.

      9 General robustness for nonuniform prior distributions

      Up to this point, we have derived explicit formulas for the offset of the mean and variance of the inferred posterior when varying (σ2)y under the assumption of a uniform prior distribution. These formulas were derived from a set of general equations (final form of Eq. 26 and Eq. 27) which allow for arbitrary prior distributions. We will rewrite the general equations below. In particular, the offset of the mean is:Embedded ImagewhereEmbedded Image

      The offset of the second moment is given by:Embedded Imagewhere Z* is also computed from (Eq. 76). The offset in the variance is computed from both (Eq. 75) and (Eq. 77). Offset formulas for arbitrary P (gy) are then derived by substituting the explicit form of the prior into these equations. Although our goal here is not to create a catalog of equations under different priors, we will demonstrate the method’s utility with an example. We will consider one case of interest where we can explicitly compute the solution. In particular, we will assume an exponentially distributed prior Embedded Image,where α > 0. To begin, we will compute Z* in terms of Z by substituting Embedded Image into (Eq. 76):Embedded Image

      Substituting Embedded Image into (Eq. 75) we obtain:Embedded Imageand then substituting (Eq. 78) into (Eq. 79) we obtain:Embedded Imagewhich holds for c ≥ 1. The argument of the integral in (Eq. 80), Embedded Image is a well-defined distribution, since it is non-negative and integrates to one. Therefore, Embedded Image is an expectation. If we let Embedded Image be a random variable denoting a sample from this distribution, then we can write:Embedded Image

      We can do a similar calculation for the 0 < c < 1 case to obtain:Embedded Image

      Similar to previous calculations, the subscripts c(σ2)y or (σ2)y for T α implicitly define both the starting frequency p and absorbing boundary f of the first passage time process of the corresponding Q(T). In particular, the Q(T) used to derive the expression for c ≥ 1 and for 0 < c < 1 are different. For c ≥ 1, Embedded Image and Embedded Image,and Embedded Image and Embedded Image for 0 < c < 1.

      A similar expression exists for the variance,Embedded Image

      We can then compute the unknown first passage time dependent terms in (Eq. 8182) and (Eq. 83) using the integral definition and then doing a Taylor expansion of the exponential factor:Embedded Image

      The exchange of the integral with the sum in (Eq. 84) is a valid operation and converges whenever Embedded Image (see Supplement for details including Eq. 104 for definition of Aj). The derivation of this result in the Supplement assumes Q(T) decays exponentially as T → ∞ asymptotically. Numerical simulations show Embedded Image approaches a value in between Embedded Image and Embedded Image as j → ∞ (see Fig. 2 in the Supplement).

      Figure 2:

      Different Ai for i = 1, 2, …, 800 are computed numerically and Embedded Image vs i is plotted. Different colors are used for each point for better visualization. The plot shows Embedded Image converging as i → ∞ which implies the magnitude of Ai decays exponentially as i increases.

      The expectation in (Eq. 84) is now expressed as moments of Q(T) which are given by the Kolmogorov backwards equation. A similar Taylor expansion can be done for the case 0 < c < 1, except the exchange of the integral and sum is valid for Embedded Image, the difference being that Δu is replaced withΔm.

      In the Supplement, we derive the general form for the jth moment of Q(T) in terms of effective generations using the Kolmogorov backwards equation. Using the Embedded Image notation in (Eq. 84), we can writeEmbedded Imagewhere the Embedded Image are given by (Eq. 104) in the Supplement. Substituting (Eq. 85) into (Eq. 8182) we obtain:Embedded Image

      The unknown term in the variance offset can be computed in a similar manner through Taylor expansions ofEmbedded Imagewhich then gives us:Embedded Image

      The offset of the variance is then obtained by substituting (Eq. 88) into (Eq. 83), and then substituting (Eq. 104) in the supplement for the final form. Notice that in the numerator of the first term of (Eq. 88), the moments have subscript j +2 as opposed to j +1. Lastly, similar to the uniform prior case, these expressions can be written as expansions of Δ. This is a general feature of these formulas since the conditions p and f can be written in terms of Δ.

      Up to this point, we have considered two particular cases of prior distributions where the offset formulas can be solved explicitly: the uniform and exponential distribution. In both of these cases, the offset of the mean and variance of the posterior is impacted by an additive first passage time term. This occurs because the uniform and exponential distribution exhibit the properites P (gy + T) = P (gy) and P (gy + T) = P (gy)P (T) respectively. In the former case, we made this assumption explicitly while in the latter case implicitly. However, the uniform and exponential distribution are the only distributions that satisfy either of these properties, and so in general the relationship between the first passage time and the offsets of the posterior is not as simple. Although in general we cannot derive simple expressions, we can still compute the offsets numerically using the generalized equations.

      One potential approach is to start by assuming P (gy) is a smooth and entire function, and use the Taylor expansion Embedded Image in (Eqs. 7577) to obtain for case c ≥ 1:Embedded ImageEmbedded Imagewhich show how the mean and second moment of the posterior change respectively. Special care should be taken to ensure that exchanging the integral and sum to obtain (Eqs. 8990) is valid, such as establishing dominated convergence based on the properties of the assumed prior. The case 0 < c < 1 is more complicated because it not only requires using Embedded Image instead of Embedded Image due to the different conditions on p and f, but also swapping Embedded Image and Embedded Image so that the left hand sides of (Eqs. 7577) now refer to the expectation and second moment using the underlying (i.e. the unscaled) process. The posterior of the model process (i.e. the scaled process) is now convolved with the first passage time distribution requiring a transform or de-convolution process to invert the equation. We will not propose a method to solve for this case, and will instead leave it as an open problem.

      (Eqs. 8990) can be used to numerically approximate the offset formulas for arbitrary prior distributions under the appropriate assumptions given a sufficient number of sampled points from the inferred posterior.

      This is possible because we can rewrite f (Ck|gy) in terms of the posterior as Embedded Image,since the normalization constant for Ck shows up both in the numerator and denominator. Therefore, given n points sampled from the posterior, say Embedded Image,we can approximate the offset formulas by taking a weighted average over (gy)i and Embedded Image for the sample mean and second moment respectively using the following estimators:Embedded ImageEmbedded Imagewhich hold for c ≥ 1. As before, the variance offset estimator is obtained by combining both equations (Eqs. 9192). We can now use these estimators along with sampled points from the posterior distribution generated using any Bayesian inference framework (such as Approximate Bayesian Computation) with a non-uniform prior distribution and generate the offset of the mean and variance of the inferred gy due to changing the model parameters.

      As mentioned before, (Eqs. 9192) can break down when the prior distribution is not entire, since integration to ∞ breaks down. In those cases, it may be possible to derive different estimators by centering the Taylor expansion around different points to obtain a larger radius of convergence, and then finite truncating the integral as an approximation. In addition, if dominated convergence cannot be established on the interval [0, ∞) and exchanging the integral with the sum is not justified, but these conditions can be established on compact intervals, then finite truncating the integral and letting the partial sums go to infinity could be a potential solution.

      10 Discussion

      We derived a set of general equations that describe how the posterior distribution of the inferred time of onset and fitness of a beneficial mutation change with the reproductive variance (σ2)y (or equivalently the number of generations L) in Cannings models of population genetics. We also showed that the inferred posterior distribution of the selection coefficient in units of absolute time is independent of (σ2)y. Under the assumption of a uniform prior distribution, we derived the offset of the mean and variance of the posterior of the time of onset and related it to the moments of a first passage time distribution. We solved for the moments of first passage using the Kolmogorov backwards equation and related it to the effective duration of a generation Embedded Image.The derived formulas were validated with simulations. We then showed that the generalized equations can be used to derive formulas for arbitrary prior distributions as well, and as an example we considered the case of an exponentially distributed prior. In addition, we derived two estimators to compute the offset of the mean and variance numerically, but under restricted conditions and only for the case c ≥ 1 where the c(σ2)y of the model process is larger than the (σ2)y of the underlying process. All of these formulas were derived assuming Embedded Image and N >> 1, and that Embedded Image for all T > 0. If any of these assumptions break down significantly, then the derived formulas will not apply.

      In terms of future directions, we make the following suggestions. In this work, we only focused on the offset of the mean and variance of the posterior. Extensions to higher moments are easily made by directly computing the offset of the jth moment (i.e. the arbitrary jth moment version of Eqs. 7577) using a binomial expansion of (gy + T)j in the derivation to break apart the integral. Under a uniform prior assumption, the expression reduces further to products of the moments of the original posterior with a correct parameter set and moments of the time of first passage which are given by (Eq. 104) in the Supplement. It would also be interesting to check if there are simplified formulas for the centered moments as well, similar to the calculation for the variance. In addition, our framework allows for numerical simulations of the posterior offsets under arbitrary prior distributions using derived estimators (Eqs. 9192) of the generalized equations (Eqs. 7577). These estimators hold under restricted conditions and only for the c ≥ 1 case, but it may be possible to de-convolve the generalized equations or possibly (Eq. 22) for the 0 < c < 1 case. We suggest improving the numerical approach, clarifying the conditions under which the approximations hold, and developing a de-convolution method for the remaining case. In addition, the concepts introduced here could be extended to multiple competing or subclonal mutations, and it would be interesting to think about the limits where the equations or assumptions break down, for example, when multiple merger events occur. Lastly, our framework applies to other forms of data as well besides coalescent trees, such as time-course measurements of gene frequencies or site-frequency spectra, and so more rigorous calculations for other data forms should also be carried out.

      In summary, we were able to develop a framework based on a first passage time approach to extend notions of robustness under an effective population size to inferences of selection and time of onset of mutations. This approach is not only useful conceptually, but can be used to correct inferences under changing model assumptions and new information using simple formulas instead of having to reanalyze data. As a result, this work furthers our understanding of the robustness of population genetics models with drift and selection in a way that can be directly applied to real-world data.

      12 Supplement

      The supplemental section will be divided into two parts. In the first part, we will establish a general formula for computing arbitrary jth moments of the first passage time distribution. In the second part, we will justify the exchange of the integral with the sum in (Eq. 84) using Lebesgue’s dominated convergence theorem. In the derivation, we will make the assumption that Q(T) decays exponentially.

      We will now derive the jth moments of the first passage time distribution. The formula is obtained by solving the moment version of the Kolmgorov backwards equation inductively. To begin, we will consider the jth moment transformed Kolmogorov backwards equation (similar to Eqs. 42 and 58 for the first two moments but we are instead multiplying by τ j) and then assume p << 1 so that the p2 terms vanish, or equivalently by replacing p(1 − p) with p to obtain the following approximationEmbedded Image

      Recall that this equation is for computing the unnormalized moments. It turns out that solving the approximate form of the transformed equation (i.e. Eq. 93) is equivalent to solving the equation without the approximation and then letting the order p2 terms vanish in the solution. We will use this approximation because it simplifies the calculations. We will let the reader verify that both approaches are equivalent and produce the same answer.

      Using integration by parts and assuming Embedded Image and p < f, we can rewrite (Eq. 93) asEmbedded Image

      Here the boundary conditions are Mj(f) = 0 and Mj(p) = Kp asymptotically as p → 0. We claim that the general solution to the boundary value problem isEmbedded Imagewhere Embedded Image

      This answer was obtained by solving for the moments recursively for several iterations until we identified the pattern. For a rigorous proof, we will use induction.

      We begin with the base case j = 1 and showEmbedded Imagewhich matches (Eq. 49), and is hence the solution for j = 1.

      For the inductive step we assume the formula is true for Mj(p). Using this assumption in (Eq. 94) for j + 1 we can writeEmbedded Image(Eq. 97) can then be integrated twice producingEmbedded Imagewhere C1 and C2 are constants of integration. C2 then vanishes after applying the asymptotic boundary conditionEmbedded Image

      It will be useful to rewrite Embedded Image, where Aj+1 is an arbitrary constant up to this pointEmbedded Image

      After applying the remaining boundary condition Mj+1(f) = 0 we obtainEmbedded Imageor in other words, Embedded Image. The result then follows because Aj+1 has the same form as the Ai for i ≤ j, and since the Aj+1 term can be absorbed into the sum and rewritten asEmbedded Image

      Now that we have established the formula for all j = 1, 2, 3, …, we can normalize the solution by Embedded Image to obtain the moments of first passageEmbedded Image

      Lastly, we can substitute the appropriate p and f values. Recall that there are two cases (revisit section 7. Calculating the time of first passage). In particular, for c ≥ 1 we use Embedded Image and Embedded Image, and for 0 < c < 1 we use Embedded Image and Embedded Image.After substituting for both cases, and then using the definition of Δm and Δu, we obtain the final form of the momentsEmbedded Imagewhere Embedded Image

      Simulated values of Embedded Image in (Fig. 2) show it is monotonically decreasing and converges to a value between Embedded Image and Embedded Image. This implies Ai decays exponentially as i increases.

      The remainder of the supplement will be used to show that the integral in (Eq. 84) can be exchanged with the sum. This can be done through Lebesgue’s dominated convergence theorem by showing that all of the partial sums before distributing the integral are bounded by some integrable function. However, we will first consider the asymptotic tail behavior of Q(T). In particular, we must argue that Q(T) decays sufficiently fast as T → ∞ so that the partial sums remain bounded.

      We can see from (Eq. 104) that for the case c ≥ 1:Embedded Imageprovided the limit exists (Fig. 2 shows the limit numerically and Fig. 3 shows Eq. 105). For the case 0 < c < 1, we simply replace Δu with Δm and Embedded Image with Embedded Image in (Eq.105). The existence of the limit (Eq. 105) suggests the density of first passage Q(T) asymptotically has an exponentially decaying tail with rate Embedded Image, since the jth moments of Q(T) match the moments of the exponential distribution in the limit. Although a rigorous proof of this result should be established using large deviation theory, to make progress we will take as an assumption that Q(T) is asymptotically exponential for the remainder of the supplement. Note that we will also use the notation β for the remainder of the supplement.

      Figure 3:

      The limit in (Eq. 105) is shown numerically by computing Embedded Image for j = 1, 2, 3, …, 50 using the definitions provided in (Eq. 104) for different parameter choices of Δm and Δu. The simulated values are plotted against j for each parameter choice all superimposed. Δu is always kept larger than Δm so that the condition c ≥ 1 is satisfied. The choices of Δm and Δu are shown in the figure legend with a corresponding dot color that identifies each parameter choice with a plot. The theoretical limit (right hand side of Eq. 105), is approximated numerically as Embedded Image for each plot using the definition provided in (Eq. 104) and are superimposed as horizontal dotted lines. The theoretical limits depend only on Δu, and are labeled to the right of their corresponding horizontal line according to the range of different Δu used. As expected, the plotted curves converge to their theoretical limits independent of the Δm used showing that (Eq. 105) holds.

      We will now consider the partial sums of (Eq. 84). It is easy to show with the triangle inequality thatEmbedded Imagewhere r is a non-negative integer. Next we show the right hand side of (Eq. 106) is an integrable function. We consider the integral of (Eq. 106) and rewrite as followsEmbedded Imagewhere F (T) is the CDF of Q(T). Using F (T) = (1 −P (T RV ≥ T)) = −P(T RV ≥ T), where T RV is the random variable with distribution Q(T), we obtainEmbedded Imagewhich follows from the assumption that Q(T) is asymptotically exponential, that is that Embedded Image (K is constant) as T → ∞, so long as α < β. Under this assumption, dominated convergence is satisfied and the exchanging of the integral and sum in (Eq. 84) is justified.

      11 Acknowledgments

      We acknowledge generous funding from the National Institutes of Health (NIH) –National Heart, Lung, and Blood Institute (NHLBI)– under grant R01HL158269 and its supplement, and the Leukemia & Lymphoma Society under grant 8041-24.

      References

      [80]. P. Narain. A note on the diffusion approximation for the variance of the number of generations until fixation of a neutral mutant gene. Genetics Research, 15(2):251–255, April 1970.

      [1]. Sewall Wright. Evolution in Mendelian populations. Genetics, 16(2):97–159, March 1931.

      [2]. R. A. Fisher. The genetical theory of natural selection. The genetical theory of natural selection. Clarendon Press, Oxford, England, 1930.

      [3]. P. A. P. Moran. Random processes in genetics. Mathematical Proceedings of the Cambridge Philosophical Society, 54(1):60–71, January 1958.

      [4]. C. Cannings. The Latent Roots of Certain Markov Chains Arising in Genetics: A New Approach, I. Haploid Models. Advances in Applied Probability, 6(2):260–290, 1974.

      [5]. C. Cannings. The Latent Roots of Certain Markov Chains Arising in Genetics: A New Approach, II. Further Haploid Models. Advances in Applied Probability, 7(2):264–282, 1975. Publisher: Applied Probability Trust.

      [6]. A. Kolmogoroff. über die analytischen Methoden in der Wahrscheinlichkeitsrechnung. Mathematische Annalen, 104(1):415–458, December 1931.

      [7]. M. Kimura. Stochastic processes and distribution of gene frequencies under natural selection. Cold Spring Harbor Symposia on Quantitative Biology, 20:33–53, 1955.

      [8]. James Crow and Motoo Kimura. Some genetic problems in natural populations. In Jerzy Neymann, editor, Contributions to Biology and Problems of Health, pages 1–22. University of California Press, December 1956.

      [9]. Motoo Kimura and James F. Crow. The Measurement of Effective Population Number. Evolution, 17(3):279–288, 1963. Publisher: Society for the Study of Evolution, Wiley.

      [10]. J. F. C. Kingman. The coalescent. Stochastic Processes and their Applications, 13(3):235–248, September 1982.

      [11]. J. F. C. Kingman. On the Genealogy of Large Populations. Journal of Applied Probability, 19:27–43, 1982. Publisher: Applied Probability Trust.

      [12]. J. F. C. Kingman. Exchangeability and the Evolution of Large Populations. In G. Koch and F. Spizzichino, editors, Exchangeability in Probability and Statistics, pages 97–112. North-Holland Publishing Company, 1982.

      [13]. M. Möhle. Robustness Results for the Coalescent. Journal of Applied Probability, 35(2):438–447, 1998. Publisher: Applied Probability Trust.

      [14]. M. Möhle. Weak Convergence to the Coalescent in Neutral Population Models. Journal of Applied Probability, 36(2):446–460, 1999. Publisher: Applied Probability Trust.

      [15]. M. Möhle. Coalescent Results for Two-Sex Population Models. Advances in Applied Probability, 30(2):513–520, 1998. Publisher: Applied Probability Trust.

      [16]. M. Möhle. A Convergence Theorem for Markov Chains Arising in Population Genetics and the Coalescent with Selfing. Advances in Applied Probability, 30(2):493–512, 1998. Publisher: Applied Probability Trust.

      [17]. Bjarki Eldon and John Wakeley. Coalescent Processes When the Distribution of Offspring Number Among Individuals Is Highly Skewed. Genetics, 172(4):2621–2633, April 2006.

      [18]. L. Jin, M. L. Baskett, L. L. Cavalli-Sforza, L. A. Zhivotovsky, M. W. Feldman, and N. A. Rosenberg. Microsatellite evolution in modern humans: a comparison of two data sets from the same populations. Annals of Human Genetics, 64(2):117–134, March 2000.

      [19]. R. Nielsen. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics, 154(2):931–942, February 2000.

      [20]. J. D. Wall and M. Przeworski. When did the human population size start increasing? Genetics, 155(4):1865–1874, August 2000.

      [21]. J. C. Stephens, J. A. Schneider, D. A. Tanguay, J. Choi, T. Acharya, S. E. Stanley, R. Jiang, C. J. Messer, A. Chew, J. H. Han, J. Duan, J. L. Carr, M. S. Lee, B. Koshy, A. M. Kumar, G. Zhang, W. R. Newell, A. Windemuth, C. Xu, T. S. Kalbfleisch, S. L. Shaner, K. Arnold, V. Schulz, C. M. Drysdale, K. Nandabalan, R. S. Judson, G. Ruano, and G. F. Vovis. Haplotype variation and linkage disequilibrium in 313 human genes. Science (New York, N.Y.), 293(5529):489–493, July 2001.

      [22]. A. Polanski and M. Kimmel. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics, 165(1):427– 436, September 2003.

      [23]. Alison M. Adams and Richard R. Hudson. Maximum-likelihood estimation of demographic parameters using the frequency spectrum of unlinked single-nucleotide polymorphisms. Genetics, 168(3):1699–1712, November 2004.

      [24]. Gabor T. Marth, Eva Czabarka, Janos Murvai, and Stephen T. Sherry. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics, 166(1):351–372, January 2004.

      [25]. Joshua G. Schraiber and Joshua M. Akey. Methods and models for unravelling human evolutionary history. Nature Reviews. Genetics, 16(12):727–740, December 2015.

      [26]. Magnus Nordborg. Coalescent theory. Handbook of Statistical Genomics: Two Volume Set, pages 145–30, 2019.

      [27]. Alex Coventry, Lara M. Bull-Otterson, Xiaoming Liu, Andrew G. Clark, Taylor J. Maxwell, Jacy Crosby, James E. Hixson, Thomas J. Rea, Donna M. Muzny, Lora R. Lewis, David A. Wheeler, Aniko Sabo, Christine Lusk, Kenneth G. Weiss, Humeira Akbar, Andrew Cree, Alicia C. Hawes, Irene Newsham, Robin T. Varghese, Donna Villasana, Shannon Gross, Vandita Joshi, Jireh Santibanez, Margaret Morgan, Kyle Chang, Walker Hale IV, Alan R. Templeton, Eric Boerwinkle, Richard Gibbs, and Charles F. Sing. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nature Communications, 1:131, November 2010.

      [28]. Matthew R Nelson, Daniel Wegmann, Margaret G Ehm, Darren Kessner, Pamela St. Jean, Claudio Verzilli, Judong Shen, Zhengzheng Tang, Silviu-Alin Bacanu, Dana Fraser, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science, 337(6090):100– 104, 2012.

      [29]. Jacob A. Tennessen, Abigail W. Bigham, Timothy D. O’Connor, Wenqing Fu, Eimear E. Kenny, Simon Gravel, Sean McGee, Ron Do, Xiaoming Liu, Goo Jun, Hyun Min Kang, Daniel Jordan, Suzanne M. Leal, Stacey Gabriel, Mark J. Rieder, Goncalo Abecasis, David Altshuler, Deborah A. Nickerson, Eric Boerwinkle, Shamil Sunyaev, Carlos D. Bustamante, Michael J. Bamshad, Joshua M. Akey, Broad GO, Seattle GO, and on behalf of the NHLBI Exome Sequencing Project. Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes. Science, 337(6090):64–69, July 2012. Publisher: American Association for the Advancement of Science.

      [30]. Wenqing Fu, Timothy D. O’Connor, Goo Jun, Hyun Min Kang, Goncalo Abecasis, Suzanne M. Leal, Stacey Gabriel, Mark J. Rieder, David Altshuler, Jay Shendure, Deborah A. Nickerson, Michael J. Bamshad, NHLBI Exome Sequencing Project, and Joshua M. Akey. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature, 493(7431):216–220, January 2013. Publisher: Nature Publishing Group.

      [31]. Elodie Gazave, Li Ma, Diana Chang, Alex Coventry, Feng Gao, Donna Muzny, Eric Boerwinkle, Richard A. Gibbs, Charles F. Sing, Andrew G. Clark, and Alon Keinan. Neutral genomic regions refine models of recent rapid human population growth. Proceedings of the National Academy of Sciences, 111(2):757–762, January 2014. Publisher: Proceedings of the National Academy of Sciences.

      [32]. Feng Gao and Alon Keinan. Explosive genetic evidence for explosive human population growth. Current opinion in genetics & development, 41:130–139, 2016.

      [33]. Debra Van Egeren, Javier Escabi, Maximilian Nguyen, Shichen Liu, Christopher R. Reilly, Sachin Patel, Baransel Kamaz, Maria Kalyva, Daniel J. DeAngelo, Ilene Galinsky, Martha Wadleigh, Eric S. Winer, Marlise R. Luskin, Richard M. Stone, Jacqueline S. Garcia, Gabriela S. Hobbs, Fernando D. Camargo, Franziska Michor, Ann Mullally, Isidro Cortes-Ciriano, and Sahand Hormoz. Reconstructing the Lineage Histories and Differentiation Trajectories of Individual Cancer Cells in Myeloproliferative Neoplasms. Cell Stem Cell, 28(3):514–523.e9, March 2021.

      [34]. Nicholas Williams, Joe Lee, Emily Mitchell, Luiza Moore, E. Joanna Baxter, James Hewinson, Kevin J. Dawson, Andrew Menzies, Anna L. Godfrey, Anthony R. Green, Peter J. Campbell, and Jyoti Nangalia. Life histories of myeloproliferative neoplasms inferred from phylogenies. Nature, 602(7895):162–168, February 2022. Publisher: Nature Publishing Group.

      [35]. K. R. Thornton, J. D. Jensen, C. Becquet, and P. Andolfatto. Progress and prospects in mapping recent selection in the genome. Heredity, 98(6):340–348, June 2007. Publisher: Nature Publishing Group.

      [36]. Matthew W. Hahn. Toward a Selection Theory of Molecular Evolution. Evolution, 62(2):255–265, 2008. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1558-5646.2007.00308.x.

      [37]. Guy Sella, Dmitri A. Petrov, Molly Przeworski, and Peter Andolfatto. Pervasive Natural Selection in the Drosophila Genome? PLOS Genetics, 5(6):e1000495, June 2009. Publisher: Public Library of Science.

      [38]. Michael M. Desai and Daniel S. Fisher. Beneficial mutation selection balance and the effect of linkage on positive selection. Genetics, 176(3):1759–1798, July 2007.

      [39]. Stephen M Krone and Claudia Neuhauser. Ancestral Processes with Selection. Theoretical Population Biology, 51(3):210–237, June 1997.

      [40]. Jim Pitman. Coalescents with Multiple Collisions. The Annals of Probability, 27(4):1870–1902, 1999. Publisher: Institute of Mathematical Statistics.

      [41]. Serik Sagitov. The general coalescent with asynchronous mergers of ancestral lines. Journal of Applied Probability, 36(4):1116–1125, December 1999.

      [42]. Jason Schweinsberg. Coalescents with Simultaneous Multiple Collisions. Electronic Journal of Probability, 5(one):1–50, January 2000. Publisher: Institute of Mathematical Statistics and Bernoulli Society.

      [43]. Richard Durrett and Jason Schweinsberg. Approximating selective sweeps. Theoretical Population Biology, 66(2):129–138, September 2004.

      [44]. N. L. Kaplan, T. Darden, and R. R. Hudson. The Coalescent Process in Models with Selection. Genetics, 120(3):819–829, November 1988.

      [45]. R. R. Hudson and N. L. Kaplan. The coalescent process in models with selection and recombination. Genetics, 120(3):831–840, November 1988.

      [46]. NL Kaplan, RR Hudson, and CH Langley. The “hitchhiking effect” revisited. Genetics, 123(4):887– 899, December 1989.

      [47]. T. H. Wiehe and W. Stephan. Analysis of a genetic hitchhiking model, and its application to DNA polymorphism data from Drosophila melanogaster. Molecular Biology and Evolution, 10(4):842–854, July 1993.

      [48]. M. Slatkin. Simulating genealogies of selected alleles in a population of variable size. Genetical Research, 78(1):49–57, August 2001.

      [49]. Yuseob Kim and Wolfgang Stephan. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics, 160(2):765–777, February 2002.

      [50]. Molly Przeworski. The signature of positive selection at randomly chosen loci. Genetics, 160(3):1179– 1189, March 2002.

      [51]. Graham Coop and Robert C. Griffiths. Ancestral inference on gene trees under selection. Theoretical Population Biology, 66(3):219–232, November 2004.

      [52]. Yuseob Kim and Thomas Wiehe. Simulation of DNA sequence evolution under models of recent directional selection. Briefings in Bioinformatics, 10(1):84–96, January 2009.

      [53]. John Wakeley. Natural selection and coalescent theory. Evolution since Darwin: the first, 150:119– 149, 2010.

      [54]. Sabin Lessard and Véronique Ladret. The probability of fixation of a single mutant in an exchangeable selection model. Journal of Mathematical Biology, 54(5):721–744, May 2007.

      [55]. Florin Boenkost, Adrián González Casanova, Cornelia Pokalyuk, and Anton Wakolbinger. Haldane’s formula in Cannings models: the case of moderately strong selection. Journal of Mathematical Biology, 83(6):70, December 2021.

      [56]. Willy Feller. Die Grundlagen der Volterraschen Theorie des Kampfes ums Dasein in wahrscheinlichkeitstheoretischer Behandlung. Acta Biotheoretica, 5(1):11–40, May 1939.

      [57]. David G. Kendall. On the Generalized “Birth-and-Death” Process. The Annals of Mathematical Statistics, 19(1):1–15, March 1948. Publisher: Institute of Mathematical Statistics.

      [58]. J. B. S. Haldane. A mathematical theory of natural and artificial selection. Mathematical Proceedings of the Cambridge Philosophical Society, 23(5):607–615, January 1927.

      [59]. Motoo Kimura. ON THE PROBABILITY OF FIXATION OF MUTANT GENES IN A POPULATION. Genetics, 47(6):713–719, June 1962.

      [60]. R. A. Fisher. On the Dominance Ratio. Proceedings of the Royal Society of Edinburgh, 42:321–341, 1922.

      [61]. G. Malécot. Sur un problème de probabilités en chaine pue pose la génétique. 219:379–381, 1944.

      [62]. G. Malécot. La diffusion des gènes dans une population Mendélienne. 221:340–342, 1946.

      [63]. William Feller. Diffusion Processes in Genetics. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, volume 2, pages 227–247. University of California Press, January 1951.

      [64]. Motoo Kimura. Solution of a process of random genetic drift with a continuous model*. Proceedings of the National Academy of Sciences, 41(3):144–150, March 1955. Publisher: Proceedings of the National Academy of Sciences.

      [65]. E.T. Whittaker and G.N. Watson. A Course of Modern Analysis. 4th edition, 1927.

      [66]. M.J.O. Strutt. Lamesche, Mathieusche, und verwandte Funktionen in Physik und Technik. Springer, 1932.

      [67]. E.W. Hobson. Spheroidal and Ellipsoidal Harmonics. Cambridge University Press, 1931.

      [68]. J.A. Stratton, P.M. More, L.J. Chu, and R.A. Hutner. Elliptic Cylinder and Spheroidal Wave Functions. John Wiley, New York, 1941.

      [69]. E.L. Ince. 6:547–558, 1928.

      [70]. Henry Lee-Six, Nina Friesgaard Øbro, Mairi S Shepherd, Sebastian Grossmann, Kevin Dawson, Miriam Belmonte, Robert J Osborne, Brian JP Huntly, Inigo Martincorena, Elizabeth Anderson, et al. Population dynamics of normal human blood inferred from somatic mutations. Nature, 561(7724):473–478, 2018.

      [71]. Motoo Kimura and Tomoko Ohta. The Average Number of Generations until Fixation of a Mutant Gene in a Finite Population. Genetics, 61(3):763–771, March 1969.

      [72]. P. Narain. The Conditioned Diffusion Equation and Its Use in Population Genetics. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):258–266, 1974. Publisher: [Royal Statistical Society, Wiley].

      [73]. Motoo Kimura. Some Problems of Stochastic Processes in Genetics. The Annals of Mathematical Statistics, 28(4):882–901, 1957. Publisher: Institute of Mathematical Statistics.

      [74]. Motoo Kimura. Diffusion Models in Population Genetics. Journal of Applied Probability, 1(2):177– 232, 1964. Publisher: Applied Probability Trust.

      [75]. G. A. Watterson. Markov Chains with Absorbing States: A Genetic Example. The Annals of Mathematical Statistics, 32(3):716–729, 1961. Publisher: Institute of Mathematical Statistics.

      [76]. G. A. Watterson. Some Theoretical Aspects of Diffusion Theory in Population Genetics. The Annals of Mathematical Statistics, 33(3):939–957, 1962. Publisher: Institute of Mathematical Statistics.

      [77]. W. J. Ewens. Numerical Results and Diffusion Approximations in a Genetic Process. Biometrika, 50(3/4):241–249, 1963. Publisher: [Oxford University Press, Biometrika Trust].

      [78]. W. J. Ewens. Population Genetics. Methuen, London, 1969.

      [79]. P. Narain. Response to selection in finite populations. Ph.D. Thesis, Edinburgh University, 1969.