Genealogical typing of Neisseria meningitidis

Despite the increasing popularity of multilocus sequence typing (MLST), the most appropriate method for characterizing bacterial variation and facilitating epidemiological investigations remains a matter of debate. Here, we propose that different typing schemes should be compared on the basis of their power to infer clonal relationships and investigate the utility of sequence data for genealogical reconstruction by exploiting new statistical tools and data from 20 housekeeping loci for 93 isolates of the bacterial pathogen Neisseria meningitidis. Our analysis demonstrated that all but one of the hyperinvasive isolates established by multilocus enzyme electrophoresis and MLST were grouped into one of six genealogical lineages, each of which contained substantial variation. Due to the confounding effect of recombination, evolutionary relationships among these lineages remained unclear, even using 20 loci. Analyses of the seven loci in the standard MLST scheme using the same methods reproduced this classification, but were unable to support finer inferences concerning the relationships between the members within each complex.


INTRODUCTION
The demonstration that some bacterial pathogen populations are structured into lineages that differ in their propensity to cause disease represents a major achievement of molecular epidemiology (Musser, 1996). A striking example is Neisseria meningitidis, the meningococcus, which is a commensal inhabitant of the human nasopharynx that occasionally invades the host, causing devastating septicaemia and meningitis (Stephens, 2009). Whereas asymptomatic carriage of meningococci is common, the incidence of meningococcal disease is low, and most cases are caused by a limited number of genetic types known as the hyperinvasive lineages (Caugant, 2008;Yazdankhah et al., 2004). There are several practical applications of these observations. Public health interventions can be targeted at particular lineages, as achieved with strainspecific meningococcal vaccines deployed in several countries (Jodar et al., 2002). Furthermore, such interventions should avoid changing the ecology of the meningococcus in ways that might favour the spread of vaccine escape variants or more invasive genotypes that would increase the overall disease burden . Finally, this population structuring enables genetic differences between invasive and non-invasive meningo-cocci to be defined, helping to elucidate mechanisms of pathogenicity (Bille et al., 2008(Bille et al., , 2005Hotopp et al., 2006).
Populations of bacteria contain identifiable phenotypic lineages due to clonal descent (Levin, 1981). Horizontal genetic exchange via homologous recombination reassorts variation progressively, so that isolates sharing a sufficiently recent common ancestor share a majority of genetic traits, including those that govern pathogenicity. In the case of the meningococcus, disease-associated, or 'hyperinvasive', lineages are apparent from the characterization of isolate collections by multilocus enzyme electrophoresis (MLEE), multilocus sequence typing (MLST) and, to an extent, serological isolate characterization (Caugant et al., 1987;Maiden et al., 1998;Urwin et al., 2004). The principal aim of any microbial typing methodology is to find, within a sample of isolates, groups that share a recent common ancestor (van Belkum et al., 2001). If all such clonal groups were known, along with an estimate of the relative times at which they shared common ancestry, then this knowledge could be represented as a tree called the clonal genealogy (Guttman, 1997). Reconstructing a clonal group of isolates is made possible by identifying the genetic modifications that occurred on the branch of the clonal genealogy directly above the group, which are likely to differentiate the members of the group from the rest of the isolates. However, two difficulties may occur when attempting to reconstruct clonal groups: first, there may be little genetic difference among the members of a group and their closest relatives, especially for recent groups (Achtman, 1996); and second, horizontal genetic exchange with related bacteria outside the group may obscure the signal of clonal inheritance, especially for older groups (Holmes et al., 1999;Schierup & Hein, 2000).
The level of genotypic characterization necessary for a given isolate collection depends on the goals of the study (Struelens, 1998) and how often the phenotypes of interest, e.g. antigenic variants or pathogenicity, change. Since recombination reduces phenotypic correlations among related isolates, the inability to resolve the deep phylogeny does not affect many epidemiological analyses. Conversely, closely related isolates are likely to be phenotypically uniform, unless there is strong natural selection favouring repeated changes of particular traits. Therefore, fineresolution dissection of these relationships is unlikely in any particular instance to provide substantial insight into the relationship between phenotype and genotype. It is, however, necessary to resolve these parts of the genealogy, for example to investigate recent outbreaks Feavers et al., 1999).
In this study, an extended meningococcal MLST dataset comprising nucleotide sequence data from 20 housekeeping loci was analysed with the model-based method CLONALFRAME (Didelot & Falush, 2007), which attempts to reconstruct the clonal genealogy while taking into account the statistical uncertainty of individual groups and the confounding effect of recombination. This analysis enabled an assessment of the accuracy and limitations of the MLST scheme in terms of the number of evolutionary relationships that it can correctly infer. The data show that the widely applied seven-locus MLST approach successfully identifies relationships over a narrow age range, corresponding to previously described clonal complexes, but has no power to detect longer-term relationships among these complexes. The resolution of closely related meningococcal isolates by seven-locus MLST can be poor and may need to be supplemented with additional information in certain circumstances.

METHODS
Sequence data. The 107 isolates used to develop the N. meningitidis MLST scheme were investigated (Maiden et al., 1998). This collection was designed to represent the global diversity of meningococcal disease in the latter half of the 20th century, and included multiple representatives of the hyperinvasive lineages identified at the time the collection was assembled. For a subset of 95 isolates, chosen to include all of the representatives of the major hyperinvasive lineages, published data from 12 gene fragments (Holmes et al., 1999;Maiden et al., 1998) were analysed together with data from an additional eight loci (Supplementary Table S1, available with the online version of this paper), giving a total of 20 loci. The additional loci were all annotated as encoding enzymes of cellular metabolism: aspA (aspartate ammonia-lyase); carB (carbamoyl phosphate synthase large subunit); dhpS (dihydropteroate synthase); glnA (glutamine synthetase); gpm (phosphoglycerate mutase); pyk (pyruvate kinase); rpiA (ribose-5phosphate isomerase A); and tal (transaldolase) ( Table 1). With the exception of the glnA, aroE and mtg loci, which were all located within a region of the genome 8.8 kbp in size, the 20 loci were distributed around the meningoccal FAM 18 genome (Bentley et al., 2007) and were at least 27 kbp from each other (Supplementary Table S2). With the exception of the three loci mentioned, they were therefore highly unlikely to be co-inherited in a single transformation event, as the average recombination fragment size in meningococci has been estimated as between 1.1 kbp  and 5 kbp (Linz et al., 2000).
Target gene fragments were amplified by the PCR and sequenced using specific oligonucleotide primers. The primers used for the aspA, carB, dhpS and glnA gene fragments have been described previously (Tondella et al., 1999). The primers used for the remaining gene fragments were as follows: rpiA, 59-CGA CAC AAG ACG AAC TCA AGC GC-39 and 59-GGC AGG GAT AGA TGA CTT TCG C-39; tal, 59-ATC GGA CGT TAA AGC ATT AGG-39 and 59-TTA AAC CAA CGG TGC GAG CAG-39; gpm, 59-CCA CTT TCA CAC ATT GGA GAC G-39 and 59-CGA TGA CTT TCA ATT GTC GTC C-39; pgk, 59-TTA TTT GAC GCG CAG CAC TTC C-39 and 59-TTA TTT GAC GCG CAG CAC TTC C-39. Nucleotide sequencing was done as described previously (Jolley et al., 2000). Briefly, genes were amplified by the PCR and purified using PEG 8000 /NaCl precipitation. The purified amplicons were used in extension reactions with ABI BigDye dye terminators, precipitated and separated on an ABI 3730 DNA Analyser. The sequences were assembled with the STADEN suite of computer software (Staden, 1996) and aligned using the GCG package (Womble, 2000). The sequences have been deposited in the Neisseria Multi Locus Sequence Typing Database at http://pubmlst.org/ neisseria/ .
Inference of clonal relationships. The principal analysis tool was CLONALFRAME version 1.1 (Didelot & Falush, 2007), a statistical algorithm that infers clonal relationships while taking into account the effect of homologous recombination. CLONALFRAME draws inference using a Monte-Carlo Markov chain, and therefore requires an assessment of the convergence and mixing of its results. Performing 10 independent runs of CLONALFRAME of 200 000 iterations each achieved this. The first half of each run was discarded as burn-in, and the parameters were recorded every 200 iterations in the second half to produce samples of size 500 from the posterior distribution of clonal genealogies. The results from the 10 runs were compared for convergence by manual examination of the trace of the parameters and likelihood (Gelman, 1996), using the Gelman and Rubin statistic (Brooks & Gelman, 1998;Gelman & Rubin, 1992), and using the CLONALFRAME genealogy comparison tool (Didelot & Falush, 2007). The convergence was judged satisfactory in all cases, and the samples from the 10 runs were combined to achieve maximum robustness. Statistical support for any grouping of isolates was assessed by the proportion of clonal genealogies exhibiting this grouping in the combined sample. This approach was independently applied to the extended (all 20 loci) dataset and to the MLST dataset (i.e. the seven fragments from genes abcZ, adk, aroE, fumC, gdh, pdhC and pgm). Missing data, namely the sequences of the eight novel gene fragments for 14 isolates as well as that of gpm for isolate BZ 147, were represented by a string of Ns in the CLONALFRAME input file. The values of r/m (the ratio of rates at which a given nucleotide becomes substituted through recombination and mutation) and r/h (the ratio of rates at which recombination and mutation events occur), as well as the relative age of the groupings, were estimated using CLONALFRAME.
Comparison of genealogies inferred with seven and 20 loci. The quality of the genealogy reconstructed on the basis of the seven MLST loci (corresponding to~1.5 % of the entire genome of N. meningitidis) was assessed by comparison with the reconstruction based on the nucleotide sequences from all 20 loci (~4.3 % of the entire genome). Assuming the genealogy reconstructed using the 20 loci to be predominantly correct, this gave a measure of inferential power of the seven-locus dataset. We also considered the support given by the 20-locus analysis to the clusters found by the seven-locus analysis. Under the same assumption, this gave a measure of the accuracy of the inference based on the seven-locus (i.e. MLST) dataset. Taken together, these two measures, power and accuracy, revealed the quality of the genealogical reconstruction based on the seven MLST gene fragments.

Novel nucleotide sequences
Complete sequences for the gene fragments at the eight additional loci were obtained from 92 of the 95 isolates examined. Isolates 371 (ST-1 complex) and N45/96 (ST-41/44 complex) failed at all loci for technical reasons and no sequence was obtained for the gpm locus of BZ 147, possibly due to the absence of this gene in this isolate. The sequence gene fragments ranged in size from 366 bp (dhps) to 561 bp (rpi) in length with between 11 (aspA and glnA) and 33 (dhps) alleles per locus in the 93 isolates examined (Table 1). Of the new loci, the least diverse was gpm (5 % polymorphic sites) and the most diverse was dhps (21.2 % polymorphic sites); alternations in this gene confer resistance to sulphonamides (Kristiansen et al., 1990). All novel gene fragments, including dhps, were subject to stabilizing selection, with d N /d S ratios ranging from 0.071 (pyk) to 0.273 (tal). Allele numbers were assigned to each unique gene sequence and the data are available at http:// pubmlst.org/neisseria (Table 1, Supplementary Table S1).

Genealogical inference
The genealogy inferred by CLONALFRAME using 20 gene fragments clustered 78 of the 107 isolates into six lineages (Fig. 1). These corresponded to the major hyperinvasive groups identified by MLST and MLEE analyses of the same isolate collection (Achtman, 1994;Maiden et al., 1998). For ease of comparison with other analyses, the CLONALFRAME lineages were named according to the seven-locus MLST sequence type (ST) that they predominantly contained, which replicated MLST clonal complex designations. All of the isolates previously thought to belong to hyperinvasive lineages were located in one of these lineages, with the exception of isolate 322/85 (ST-2), which was characterized as lineage VI by MLEE (Fig. 1). CLONALFRAME grouped the MLEE serogroup A subgroups I, II, V and VII into the ST-1 complex, together with an additional serogroup B strain isolated in the Netherlands in 1977. On the basis of this analysis, MLEE subgroup I should not be differentiated from subgroups II, V and VII  (Wang et al., 1992). No relationships were detected between ST-1 complex and the remaining serogroup A isolates (designated subgroups III, VII, IV-1 and IV-2 by MLEE and ST-4 and ST-5 complexes by MLST), which were clustered into a single lineage with two major branches, referred to here as the ST-4/5 complex. MLEE gave an identical branching order for these subgroups (Achtman, 1994). The ST-41 isolates were closely related to those with ST-44; this close relationship was consistent with the MLST designation of the ST-41/44 complex, which contains both these STs. CLONALFRAME could not reconstruct the relationship of the six lineages with each other and with the remaining isolates in the collection. Among these, few phylogenetic relationships were found, reflecting their high overall genetic diversity as compared to the isolates within each of the six lineages.
The credibility intervals for the relative ages of the six hyperinvasive lineages all included the value 0.2, indicating that they were of a similar age and that none represented a markedly recent epidemic clone (Smith et al., 1993) ( Table 2). Since the foundation of each hyperinvasive lineage, the members had mutated at up to 0.14 % of nucleotides, and 7-38 % of the genes analysed had been affected by homologous recombination, resulting in up to 1.5 % of nucleotides being substituted (Table 2). Although none of the complexes were young, two showed evidence for rapid expansion (Fig. 2), indicated by a higher ratio of external to internal branch lengths (Fiala & Sokal, 1985) than expected under the neutral coalescent model (Kingman, 1982a). This observation was consistent with these two lineages undergoing rapid population expansion early in their history, perhaps due to a fitness advantage, and explained the high number of polymorphic sites and polymorphic fragments observed in these two complexes.
The branching patterns observed in the other four complexes showed some evidence of divergence from coalescent expectation too, but not at a statistically significantly level ( Table 2).

Effect of recombination
Nucleotide sequence data gathered by MLST can be used to estimate the relative importance of recombination and mutation. A parameter that is often used for this purpose is r/h, the ratio of rates at which recombination and mutation events occur (Milkman & Bridges, 1990). The estimates for r/h computed here were similar from one lineage to another, ranging from 1.5 to 7.7 (Table 2), indicating that recombination occurred up to eight times as often as point mutation in the evolution of N. meningitidis lineages. This was consistent with the estimates found by a number of previous studies: 3.6 Fig. 1. Majority-rule consensus tree for the clonal genealogies inferred by CLONALFRAME when using all 20 gene fragments. The isolates are represented according to their MLEE designations, as shown in the key, and with no symbol meaning that no MLEE designation corresponds to the electrophoresis type of that isolate. ST numbers are shown for the isolates that do not belong to any of the six hyperinvasive complexes; see Fig. 3 for the others. (Feil et al., 1999), 4.75 (Feil et al., 2001), 1.1 (Fraser et al., 2005) and between 0.16 and 1.83 . However, since recombination is likely to substitute more than one nucleotide at a time, the net effect of recombination over mutation is greater than this. This can be measured using r/m, the ratio of rates at which a given nucleotide becomes substituted through recombination and mutation (Guttman & Dykhuizen, 1994). Our estimates of r/m ranged from 14.4 to 108.9 (Table 2). This was slightly lower than previous estimates of r/m based on counting techniques, which were between 100 and 275 (Feil et al., 1999(Feil et al., , 2001Jolley et al., 2000), but not as low as a previous population genetics estimate, which was between 6.2 and 16.8 . The variation of r/m observed across lineages might account for part of the differences between previously reported values since these were based on datasets with different sampling frames.
Visual inspection of the evolutionary events inferred by CLONALFRAME in the divergence of the ST-8 complex indicated that recombination happened significantly more frequently than mutation, and that when it did it introduced a large number of substitutions, resulting in a much higher net effect of recombination than mutation in complex diversification (Fig. 3). The level of divergence of the imports, often above 1 %, demonstrated that they were likely to come from other meningococci, implying that the ST-8 complex was not sexually isolated from the rest of the species. Furthermore, a number of instances where only a fraction of a gene has been imported were observed, for example a fragment of the fumC gene in ST-66 (branch B, Fig. 3). Similar observations were made for the five other hyperinvasive lineages (data not shown).

Lineage definition based on seven loci
Comparison of the CLONALFRAME genealogies inferred from seven and 20 loci revealed a number of differences (Fig. 4).  (Fig. 4); however, all four assignments were not upheld if a more stringent rule was applied for the construction of the consensus tree, as indicated by their non-maximal support (8/10, 8/10, 5/ 10 and 8/10 respectively). In each case, the 20-locus analysis rejected the clusters (support of 0/10). MLEE designations provided further support for rejecting these clusters; for example, the MLEE designation of the isolate B6116/77 (ST-10) was cluster A4, whereas those of all isolates of the ST-11 complex were ET-37. The 20-locus analysis clustered this isolate within the ST-8 complex, with support 9/10, consistent with the fact that all isolates of the ST-8 complex were designated cluster A4 by MLEE. When a conservative 95 % cut-off value was employed, equivalent to keeping only the nodes with support equal to 10/10 in Fig. 4, the seven-locus analysis identified 7/8, 10/10, 10/10, 13/14 and 23/23 of the isolates from the ST-8, ST-32, ST-11, ST-1 and ST-4/5 complexes respectively. Only the definition of ST-41/44 complex was left unclear with a support of 6/10 for the 13 isolates. Inference based on the seven-locus dataset therefore had good power to define lineages, and a good accuracy, as no misassignments were made when using a 95 % confidence cut-off.

Relationships within lineages
The seven-locus analysis defined four subgroups within the hyperinvasive lineages when a 95 % cut-off value was employed: three in the ST-1 complex and one in the ST-4/5 complex. Of these, two were confirmed by the 20-locus analysis, one was rejected and one remained unclear. The grouping that was rejected was the separation of the two ST-3 isolates from the ST-1 isolates. The one that remained unclear (support of 4/10 in the 20-locus analysis) was the monophyly of ST-4. A less conservative cut-off for the interpretation of the seven-locus results generated several misassignments. Furthermore, many relationships found with strong confidence by the 20-locus analysis were not inferred based on only seven loci, with 16 phylogenetic relationships not found at all (support of 0/10) and five found with weak statistical support (between 1/10 and 4/ 10). The seven-locus analysis therefore had low power to infer relationships within the six hyperinvasive lineages represented in this dataset.
The number of clusters found by the 20-locus analysis approximately followed the coalescent expectation (Kingman, 1982a, b), except for a deficit of recent as well as ancient clusters (Fig. 5). The missing recent clusters could be identified on the consensus tree (Fig. 4), where many branching orders were left unresolved for recently diverged clones, especially within the ST-32, ST-41 and ST-4 complexes, because of a lack of events distinguishing close relatives. The missing ancient clusters corresponded to an inability to reconstruct the branching order of the six hyperinvasive lineages among themselves and with the rest of the isolates, because the clonal signal was too disrupted ( Fig. 1). When using only the seven MLST loci, there was sufficient statistical power to infer a large proportion of the expected clusters within only a narrow age range, centred approximately on 0.12 coalescent unit (Fig. 5), corresponding to the average estimated age of the six hyperinvasive lineages.

Implications for epidemiological analyses
The utility of pathogen typing methods principally depends on the genealogical information they provide. Investigation of both disease outbreaks and longer-term epidemiological phenomena relies on distinguishing isolates that share a common ancestor from those that do not. For short-range epidemiology, typing methods are ideally highly discriminatory (Struelens, 1998), so that isolates that share a very recent common ancestor are grouped, while those from unrelated outbreaks are distinguishable. This requires that types evolve rapidly, which in turn entails that even very closely related isolates will occasionally have different types. In meningococci, for example, a number of approaches have been used, mostly based on variation in surface antigens and their genes (Jolley et al., 2007) and are highly effective in outbreak detection and characterization (Elias et al., 2006). Patterns of national or global spread, however, require the grouping of isolates that share more distant common ancestors (Caugant & Maiden, 2009). Thus, to be reliable and informative, typing methods need to be discriminatory, but also allow estimation of relationships between isolates that exhibit different types. DNA sequence data are particularly well suited for this, and stochastic models of sequence evolution have been developed to make such analysis possible, starting with the pioneering work of Jukes & Cantor (1969).
The unambiguous nature of MLST means that the spread of particular types can be tracked accurately, regardless of when or where the typing was performed (Maiden, 2006;Maiden et al., 1998;Urwin & Maiden, 2003). Furthermore, sequence data taken from a variety of loci contain more information about genealogical relationships than most other data types, for example a single large sequenced region, although it poses the question how best to extract that information, given the confounding effect of recombination. Analyses of MLST data often assume that isolates sharing the same sequence type constitute an evolutionary unit. The extended dataset presented here shows that this assumption can be an oversimplification. The 20-locus analysis shows, for example, that grouping all isolates with MLST type ST-8 would be incorrect (Fig. 3). If only the data from the seven MLST loci are considered, ST-66 is a single locus variant of ST-8 (Fig. 3); however, the additional loci revealed several recombination and mutation events differentiating the ST-8 isolates from each other (lines A, D, I, J and G in Fig. 3). The recombination events in genes dhps and gln were also found in ST-66, and  together provided strong evidence that the ST-8 strain above branch A was more closely related to ST-66 than it is to any of the other ST-8 strains in the sample (Fig. 3).
Closely related bacterial isolates contain few differences and thus relationships amongst them are difficult to resolve without the examination of a large number of potentially informative characters. On the other hand, deeper relationships are often obscured by extensive recombination (Holmes et al., 1999). For this reason, intermediate branches in the genealogy will generally be the easiest to resolve. Depending on the topology of the tree, some branches will be intrinsically easier to resolve than others, whatever methods are used. This analysis shows that in N. meningitidis, the seven loci employed by routine MLST have substantial power to reconstruct clades in a narrow age range (Fig. 5). This age range corresponds to the age of the hyperinvasive lineages in N. meningitidis, and more generally to the concept of clonal complexes that has emerged from both MLEE and MLST studies (Caugant et al., 1987;Urwin & Maiden, 2003). The clonal complexes are further characterized by gene content (Hotopp et al., 2006), including the presence and absence of virulence factors (Bille et al., 2008), antigenic properties, including capsules (Caugant & Maiden, 2009) and outer-membrane protein variants (Callaghan et al., 2006;Urwin et al., 2004), and the propensity to invade (Yazdankhah et al., 2004). The addition of further loci provides limited improvements on the definition of these complexes (Fig. 4), which is expected given that the number of inferred groupings in this age range is in line with coalescent predictions (Fig. 5).
The 20-locus analysis was, however, able to identify a large number of additional younger clades as well as some older clades (Fig. 5). This partially filled the gap in the overall distribution of ages for the reconstructed clades with that expected in a neutrally evolving population, although the youngest and oldest subdivisions remained unresolved (Fig. 5).

Conclusions
These analyses demonstrate that the clonal complex concept captures an appreciable proportion of the information on genealogical relationships amongst N. meningitidis isolates available from the seven-locus MLST scheme. However, the seven-locus MLST data do not provide information on longer timescales, and the interrelationships among the lineages corresponding to clonal complexes remain unresolved, even with 20-locus data. This contrasts with the more extensive inferences that can be made from seven-locus MLST studies of some other genera, for example the Bacillus cereus group, where MLST data provide information on clonal relationships at multiple timescales (Didelot & Falush, 2007;Didelot et al., 2009). These differences in the level of information contained in MLST datasets may due to variation in the homologous recombination rates across bacterial species (Feil et al., 2001;Vos & Didelot, 2009).
The observation that clonal complexes exist in N. meningitidis is, in itself, consistent with neutral evolution and does not necessarily require any special non-neutral process such as the emergence of epidemic clones (Smith et al., 1993); however, examination of the pattern of variation within and amongst complexes shows some evidence of deviations from neutrality, and several explanations have been invoked to accommodate them (Buckee et al., 2008;Fraser et al., 2005;Jolley et al., 2005). The data presented here are consistent with such deviations (Fig. 2).
Finally, although seven-locus MLST and similar data are most powerful for resolving relationships at the level of clonal complex, many correlations of interest between genotype and phenotype occur at higher and at lower levels and this variation is a potentially rich source of biological inference. Variation in phenotype between closely related strains is particularly informative, since it occurs on a largely isogenic background (Falush & Bowden, 2006;Roumagnac et al., 2006;Zhu et al., 2001). Of particular interest is the ST-41/40 clonal complex, which is the most diverse of all meningococcal clonal complexes, representing 1895 of the 7606 STs (25 %) present in the Neisseria MLST database at the time of writing (July 2009) (Caugant & Maiden, 2009): members of this clonal complex exhibit diversity in their invasive phenotype, and will be particularly amenable to this type of analysis (Harrison et al., 2009). Genomic data on the population scale, examining sufficient isolates that have carefully defined phenotypes, will allow relationships among meningococci to be resolved at multiple levels, allowing a richer description of epidemiological and evolutionary processes within bacterial populations (Maiden, 2008).