The study of evolution has changed considerably since the days of Darwin's finches. Molecular biology has both identified DNA as the mechanism of heredity and opened a window into the complexity that evolutionary processes can produce. Still, evolution's gradual pace leaves researchers with few sources of data. Sequencing and genome mapping give a present-day snapshot and allow phylogenetic inference. The fossil record adds snapshots from the past. "Real-time" evolution can be seen in microorganisms, either experimentally or in the wild (e.g., with flu seasons), but many aspects of evolutionary dynamics remain poorly understood.

The energy landscape is a conceptual tool for describing dynamical behavior in physics. For a particle moving in 2D under conservative forces, the landscape is simply a plot of the potential (and the force, its gradient, $$ - \nabla \phi $$). For thermodynamic systems, the horizontal plane represents the space of all system configurations. The coordinates need not be Euclidean or even continuous. (For proteins, they might be rotation angles of all bonds along the polymer backbone.) Free energies (e.g., Helmholtz $$ A = U - TS $$) balance energetic and entropic forces. In biological conditions, an unfolded protein traverses its rough, funnel-shaped landscape, moving "downhill" toward low-free-energy folded configurations.

Evolutionary dynamics can be framed in similar terms. The role of free energy is played by fitness, and the landscape is inverted (fitness is maximized). The horizontal coordinates could be phenotypes like expression levels, or genotypes, in the form of a sequence network. In the network pictured below, $$\mathrm{a}$$ and $$\mathrm{b}$$ might represent two of the four nucleotide bases, two classes of amino acids, or even two epigenetic states. Unfortunately, real fitness landscapes are difficult to access. Fitness, environment, and the genotype-to-phenotype map all depend on each other. (Synthetic-aesthetic landscapes, like the one in the banner above, are much more straightforward. ☺)

sequence network

Evolutionary models are diverse in their purposes, and in the variety of assumptions they make (e.g., random mating). The Bak-Sneppen model, for instance, is of great theoretical interest; it shows how self-organized criticality might explain power law extinction statistics seen in the fossil record. Ewens's sampling formula, on the other hand, is frequently applied to real populations to look for signatures of neutral evolution. Our work, Khromov et al., 2018, generalizes Ewens's formula to arbitrary fitness landscapes. It is applicable for populations on large sequence networks when the evolutionary forces of selection, mutation, and genetic drift all balance, creating a steady state de-labeled allele frequency distribution.

Not surprising, the generalization comes with more complicated mathematics. Even after coarse-graining to a two- or three-plane landscape, the sampling probabilities still contain nested sums and infinite series. My primary role was to write a theory code to efficiently compute our analytic results and validate them against simulations. Values for $$\mathcal{F}$$ (the generalized $$ _1F_1 $$) were computed via a matrix of Bell polynomials. The choice of parameter values affected sum convergence, so truncation had to be done carefully. The GNU Scientific Library was used extensively, especially special function routines which check for overflow/underflow and facilitate calculations in $$ \log $$ space.