README file for SITES v 1.1 Feb 20, 1997 * Copyright 1996,1997 by Jody Hey * Rutgers University, Piscataway, NJ 08855 * * This computer program and documentation may be freely copied * and used by anyone, provided no fee is charged for it. To obtain SITES: For most PC operating systems: dowload and unzip the file 'sites1-1.zip' For unix users, you may prefer to uncompress and tar: download 'sites1-1.z' then type: uncompress sites1-1.z followed by: tar xf sites1-1 For Macintosh users: A compiled and ready to run executable has been prepared by Mark Jensen. It is available as sitesmac.hqx. You should also get a copy of sites1-1.doc, the documentation file, from either the zip or z archives. ______________________ Files in .zip and .z Packages ______________________ sites1-1.doc - documentation sites.c | siteread.c | siteutil.c |- program files sitemod.c | sites.h | period_p.dat | period_s.dat |- sample data files period_s.sit - sample output file sitesdos.exe - compiled executable for DOS. ______________________ Overview _______________________ SITES is a computer program for the analysis of comparative DNA sequence data. It is primarily intended for data sets with multiple closely related sequences. It is especially useful when multiple sequences have been obtained from each of one or several closely related populations or species. SITES is written in ANSI C. It should be possible to compile and run the program on a wide variety of platforms and operating systems. If you find SITES useful, and if it is used for analyses that end up as part of a publication, I would appreciate it if you would cite the following reference, which should be published sometime in early 1997. The program is mentioned in this paper. Hey, J and J. Wakeley. 1997. A coalescent estimator of the population recombination rate. GENETICS 145: 833-846. _______________________ New Features _______________________ There are two main improvements in this latest version of the program. Also the output file has been slightly modifed, and some bugs have been fixed. * New Command Line Options It is now possible to specify all data set limitation options on the command line. If only a subset of the data (i.e. a subset of the sequences- the rows, or a subset of the DNA positions - the columns) are to be examined, the specifications can be given in a a command line argument. The main benefit of this is that it simplifies those times when many runs are to be carried out, each on a subset of the data. Probably the best example is when a sliding window analysis is desired, with SITES run repeatedly, each time on a short portion of the sequence. Now a batch file can be set up, with repeated calls to SITES, with each call containing all necessary information. * Historical Population Model Fitting A new Analysis Option will apply the models and analyses that are described in Wakeley,J & J. Hey (1997, Estimating ancestral population parameters.Genetics in press). There are two models: an isolation speciation model, with four parameters; and a model of recent population size change, with three parameters. For most data sets these models will not be very useful, as data that do not correspond closely to to assumptions of the model tend not to yield very informative parameter estimates. _______________________ Analyses _______________________ * Polymorphism Table SITES identifies polymorphic sites and generates a table summarizing polymorphism information. Polymorphisms are characterized with regard to several categories: Noncoding or Coding; Synonymous, Replacement or Ambiguous; Transversion or Transition; Insertion/Deletion (Indels) * Indel Table SITES identifies the boundaries of insertions/deletions (indels) and generates a polymorphism table for indel variation. * Codon Usage Tables (assuming the standard genetic code) * Numbers of synonymous and replacement base positions. These can be interpreted as the relative proportions of random mutations that would be expected to cause synonymous or replacement changes. * Number of pairwise differences among sequences. * GC content for complete sequences and each codon position. * Group Comparisons. SITES works with groups of sequences, so that groups can be compared with one another. SITES determines the average number of pairwise differences among all pairs of groups, as well as the net divergence among all pairs of groups. SITES also determines the numbers of shared and fixed differences among groups. * Polymorphism Analyses SITES conducts several kinds of polymorphism analysis that can be applied to the entire data set or to multiple subsets of sequences. - generates the two most commonly used estimates of the neutral mutation parameter, 4Nu, pi (Nei, 1987 p256), and Theta (Watterson, 1975). - Determines several measures of non-neutrality in the site frequency distribution, including Tajima's D (Tajima, 1989), Fu and Li's D and Fu and Li's D* (Fu and Li, 1993). - Determines the site frequency distribution for each group. - Also some historical population model fitting, see NEW FEATURES. * Historical Population Model Fitting SITES carries out the analyses described in Wakeley,J & J. Hey (1997, Estimating ancestral population parameters.Genetics in press). The data are fit to two different models. - an isolation speciation model, with four parameters - a model of recent population size change, with three parameters * Recombination analyses. SITES conducts several kinds of recombination analysis on each group, or the entire data set. - a table of site by site congruency - a table of the minimum set of recombination intervals (Hudson & Kaplan, 1985) - Hey & Wakeley's (1997) gamma, an estimate of 4Nc. - Hudson's (1987) estimate of 4Nc * The data set can be partitioned in numerous ways, without changing the data file. - the groupings of sequences can be changed at runtime - sequences can be dropped from the analysis at runtime - analyses can be limited to specific base pairs or specific intervals - specific base positions can be dropped from the analysis. - analyses can be limited to certain categories of polymorphic sites ______________________________________________________________________________