The study of evolution has changed considerably since the days of
Darwin's finches.
Molecular biology has both identified
DNA as the mechanism of
heredity and opened a window into the
complexity that evolutionary processes can produce. Still, evolution's gradual pace leaves researchers with few sources of data.
Sequencing and
genome mapping give a present-day snapshot and allow
phylogenetic inference. The
fossil record adds snapshots from the past. "Real-time" evolution can be seen in
microorganisms, either
experimentally or in the wild (e.g., with
flu seasons), but many aspects of evolutionary dynamics remain poorly understood.
The
energy landscape is a conceptual tool for describing
dynamical behavior in
physics. For a particle moving in 2D under
conservative forces, the landscape is simply a plot of the
potential (and the force, its
gradient, $$ - \nabla \phi $$). For
thermodynamic systems, the horizontal plane represents the space of all system configurations. The coordinates need not be
Euclidean or even
continuous. (For
proteins, they might be rotation angles of all bonds along the
polymer backbone.)
Free energies (e.g.,
Helmholtz $$ A = U - TS $$) balance
energetic and
entropic forces. In biological conditions, an
unfolded protein traverses its rough,
funnel-shaped landscape, moving "downhill" toward low-free-energy folded configurations.
Evolutionary
dynamics can be framed in similar terms. The role of free energy is played by
fitness, and the landscape is inverted (fitness is
maximized). The horizontal coordinates could be
phenotypes like
expression levels, or
genotypes, in the form of a
sequence network. In the network pictured below, $$\mathrm{a, b}$$ and $$\mathrm{c}$$ might represent three of the four
nucleotide bases, three classes of
amino acids, or even three
epigenetic states. Unfortunately, real fitness landscapes are difficult to access. Fitness, environment, and the
genotype-to-phenotype map all depend on each other. (Synthetic-aesthetic landscapes, like the
one in the banner above, are much more straightforward. ☺)
Evolutionary models are diverse in their purposes, and in the variety of assumptions they make (e.g.,
random mating). The
Bak-Sneppen model, for instance, is of great theoretical interest; it shows how
self-organized criticality might explain
power law extinction statistics seen in the fossil record.
Ewens's sampling formula, on the other hand, is frequently applied to real populations to look for signatures of
neutral evolution. Our work,
Khromov et al., 2018, generalizes Ewens's formula to arbitrary fitness landscapes. It is applicable for populations on large sequence networks when the evolutionary forces of
selection,
mutation, and
genetic drift all balance, creating a
steady state de-labeled
allele frequency distribution.
Not surprising, the generalization comes with more complicated mathematics. Even after
coarse-graining to a two- or three-plane landscape, the
sampling probabilities still contain nested
sums and
infinite series. My primary role was to write a theory code to efficiently
compute our
analytic results and validate them against
simulations. Values for $$\mathcal{F}$$ (the generalized
$$ _1F_1 $$) were computed via a matrix of
Bell polynomials. The choice of
parameter values affected sum convergence, so truncation had to be done carefully. The
GNU Scientific Library was used extensively, especially special function routines which check for
overflow/
underflow and facilitate calculations in $$ \log $$ space.