1005.4601, biologia, genetyka, genetyka populacyjna
[ Pobierz całość w formacie PDF ]
//-->10Kingman and mathematical populationgeneticsWarren J. Ewensa∗and Geoffrey A. WattersonbarXiv:1005.4601v1 [math.PR] 25 May 2010AbstractMathematical population genetics is only one of Kingman’s many re-search interests. Nevertheless, his contribution to this field has been cru-cial, and moved it in several important new directions. Here we outlinesome aspects of his work which have had a major influence on populationgenetics theory.AMS subject classification (MSC2010)92D251 IntroductionIn the early years of the previous century, the main aim of populationgenetics theory was to validate the Darwinian theory of evolution, us-ing the Mendelian hereditary mechanism as the vehicle for determininghow the characteristics of any daughter generation depended on thecorresponding characteristics of the parental generation. By the 1960s,however, that aim had been achieved, and the theory largely moved ina new, retrospective and statistical, direction.This happened because, at that time, data on the genetic constitu-tion of a population, or at least on a sample of individuals from thatpopulation, started to become available. What could be inferred aboutthe past history of the population leading to these data? Retrospective324 Leidy Laboratories, Department of Biology, University of Pennsylvania,Philadelphia, PA 19104, USA; wewens@sas.upenn.edub15 Brewer Road, Brighton East, Victoria 3187, Australia;geoffreywmailbox-monashfriends@yahoo.com.au∗Corresponding authora2Warren J. Ewens and Geoffrey A. Wattersonquestions of this type include: “How do we estimate the time at whichmitochondrial Eve, the woman whose mitochondrial DNA is the mostrecent ancestor of the mitochondrial DNA currently carried in the hu-man population, lived? How can contemporary genetic data be used totrack the ‘Out of Africa’ migration? How do we detect signatures of pastselective events in our contemporary genomes?” Kingman’s famous co-alescent theory became a central vehicle for addressing questions suchas these. The very success of coalescent theory has, however, tended toobscure Kingman’s other contributions to population genetics theory. Inthis note we review his various contributions to that theory, showing howcoalescent theory arose, perhaps naturally, from his earlier contributions.2 BackgroundKingman attended lectures in genetics at Cambridge in about 1960,and his earliest contributions to population genetics date from 1961. Itwas well known at that time that in a randomly mating population forwhich the fitness of any individual depended on his genetic make-up ata single gene locus, the mean fitness of the population increased fromone generation to the next, or at least remained constant, if only twopossible alleles, or gene types, often labelledA1andA2, were possible atthat gene locus. However, it was well known that more than two allelescould arise at some loci (witness the ABO blood group system, admittingthree possible alleles, A, B and O). Showing that in this case the meanpopulation fitness is non-decreasing in time under random mating is farless easy to prove. This was conjectured by Mandel and Hughes (1958)and proved in the ‘symmetric’ case by Scheuer and Mandel (1959) andMulholland and Smith (1959), and more generally by Atkinsonet al.(1960) and (very generally) Kingman, (1961a,b). Despite this success,Kingman then focused his research in areas quite different from geneticsfor the next fifteen years. The aim of this paper is to document someof his work following his re-emergence into the genetics field, datingfrom 1976. Both of us were honoured to be associated with him in thiswork. Neither of us can remember the precise details, but the three-wayinteraction between the UK, the USA and Australia, carried out mainlyby the now out-of-date flimsy blue aerogrammes, must have started in1976, and continued during the time of Kingman’s intense involvementin population genetics. This note is a personal account, focusing on thisinteraction: many others were working in the field at the same time.Kingman and mathematical population genetics3One of Kingman’s research activities during the period 1961-1976leads to our first ‘background’ theme. In 1974 he established (Kingman,1975) a surprising and beautiful result, found in the context of storagestrategies. It is well known that the symmetricK-dimensionalDirichletdistributionΓ(Kα)(x1x2· · ·xK)α−1dx1dx2. . . dxK−1,KΓ(α)(2.1)wherexi≥0,xj= 1, does not have a non-trivial limit asK→ ∞,forgiven fixedα.Despite this, if we letK→ ∞andα→0 in such a way thatthe productKαremains fixed at a constant valueθ,then the distributionof theorder statisticsx(1)≥x(2)≥x(3)≥ · · ·converges to a non-degenerate limit. (The parameterθwill turn out to have an importantgenetical interpretation, as discussed below.) Kingman called this thePoisson–Dirichlet distribution, but we suggest that its true author behonoured and that it be called the ‘Kingman distribution’. We refer toit by this name in this paper. So important has the distribution becomein mathematics generally that a book has been written devoted entirelyto it (Feng, 2010). This distribution has a rather complex form, andaspects of this form are given below.The Kingman distribution appears, at first sight, to have nothing todo with population genetics theory. However, as we show below, it turnsout, serendipitously, to be central to that theory. To see why this is so,we turn to our second ‘background’ theme, namely the development ofpopulation theory in the 1960s and 1970s.The nature of the gene was discovered by Watson and Crick in 1953.For our purposes the most important of their results is the fact thata gene is in effect a DNA sequence of, typically, some 5000 bases, eachbase being one of four types, A, G, C or T. Thus the number of types, oralleles, of a gene consisting of 5000 bases is 45,000. Given this number, wemay for many practical purposes suppose that there are infinitely manydifferent alleles possible at any gene locus. However, gene sequencingmethods took some time to develop, and little genetic information at thefundamental DNA level was available for several decades after Watsonand Crick.The first attempt at assessing the degree of genetic variation from oneperson to another in a population at a less fundamental level dependedon the technique of gel electrophoresis, developed in the 1960s. In looseterms, this method measures the electric charge on a gene, with thecharge levels usually thought of as taking integer values only. Genes4Warren J. Ewens and Geoffrey A. Wattersonhaving different electric charges are of different allelic types, but it canwell happen that genes of different allelic types have the same electriccharge. Thus there is no one-to-one relation between charge level andallelic type. A simple mutation model assumes that a mutant gene hasa charge differing from that of its parent gene by either±1.We returnto this model in a moment.In 1974 Kingman travelled to Australia, and while there met PatMoran (as it happens, the PhD supervisor of both authors of this pa-per), who was working at that time on this ‘charge-state’ model. Thetwo of them discussed the properties of a stochastic model involving apopulation ofNindividuals, and hence 2N genes at any given locus.The population is assumed to evolve by random sampling: any daugh-ter generation of genes is found by sampling, with replacement, fromthe genes from the parent generation. (This is the well-known ‘Wright–Fisher’ model of population genetics, introduced into the populationgenetics literature independently by Wright (1931) and Fisher (1922).)Further, each daughter generation gene is assumed to inherit the samecharge as that of its parent with probability 1−u,and with probabilityuis a charge-changing mutant, the change in charge being equally likelyto be +1 and−1.At first sight it might seem that, as time progresses, the charge levelson the genes in future generations become dispersed over the entire arrayof positive and negative integers. But this is not so. Kingman recognizedthat there is a coherency to the locations of the charges on the genesbrought about by common ancestry and the genealogy of the genes inany generation. In Kingman’s words (Kingman 1976), amended here toour terminology, “The probability that [two genes in generationt]have acommon ancestor gene [in generations,fors < t,]is 1−(1−(2N )−1)t−s,which is near unity when (t−s)is large compared to 2N . Thus the [loc-ations of the charges in any generation] form a coherent group, . . . ,and the relative distances between the [charges] remain stochasticallybounded”. We do not dwell here on the elegant theory that Kingmandeveloped for this model, and note only that in the above quotationwe see here the beginnings of the idea of looking backward in time todiscuss properties of genetic variation observed in a contemporary gener-ation. This viewpoint is central to Kingman’s concept of the coalescent,discussed in detail below.Parenthetically, the question of the mean number of ‘alleles’, or oc-cupied charge states, in a population of sizeN(2N genes) is of somemathematical interest. This depends on the mutation rateuand theKingman and mathematical population genetics5population sizeN. It was originally conjectured by Kimura and Ohta(1978) that this mean remains bounded asN→ ∞.However, Kesten(1980a,b) showed that it increases indefinitely asN→ ∞,but at an ex-traordinarily slow rate. More exactly, he found the following astoundingresult. Defineγ= 1,γk+1=eγk,k= 1, 2, 3, . . . , andλ(2N) as thelargestksuch thatγk<2N . Suppose that 4Nu= 0.2. Then the ran-dom number of ‘alleles’ in the population divided byλ(2N) convergesin probability to a constant whose value is approximately 2 asN→ ∞.Some idea of the slowness of the divergence of the mean number of allelescan be found by observing that if 2N = 101656520, thenλ(2N) = 3.In a later paper (Kingman 1977a), Kingman extended the theory tothe multi-dimensional case, where it is assumed that data are availableon a vector of measurements on each gene. Much of the theory for theone-dimensional charge-state model carries through more or less im-mediately to the multi-dimensional case. As the number of dimensionsincreases, some of this theory established by Kingman bears on the ‘in-finitely many alleles’ model discussed in the next paragraph, although asKingman himself noted, the geometrical structure inherent in the modelimplies that a convergence of his results to those of the infinitely-many-alleles model does not occur, since the latter model has no geometricalstructure.The infinitely-many-alleles model, introduced in the 1960s, forms thesecond background development that we discuss. This model has twocomponents. The first is a purely demographic, or genealogical, modelof the population. There are many such models, and here we consideronly the Wright–Fisher model referred to above. (In the contemporaryliterature many other such models are discussed in the context of theinfinitely-many-alleles model, particularly those of Moran (1958) andCannings (1974), discussed in Section 4.) The second component refers tothe mutation assumption, superimposed on this model. In the infinitely-many-alleles model this assumption is that any new mutant gene is ofan allelic type never before seen in the population. (This is motivatedby the very large number of alleles possible at any gene locus, referredto above.) The model also assumes that the probability that any geneis a mutant is some fixed valueu,independent of the allelic type of theparent and of the type of the mutant gene.From a practical point of view, the model assumes a technology (rel-evant to the 1960s) which is able to assess whether any two genes are ofthe same or are of different allelic types (unlike the charge-state model,which does not fully possess this capability), but which is not able to
[ Pobierz całość w formacie PDF ]