Documentation

SNiPlay combines two major components in a common Web interface: a set of pipelined programs and a relational database.

Pipeline components


The automated process consists of seven main modules and requires many additional in-house Perl modules to create junctions between the main programs (format compatibility) and to output the results in various formats (e.g. allelic files).


1) SNP / indel detection

This treatment represents the core of the system, and all subsequent analyses are based on it.
Various input data types and formats are supported depending on the entry point:
    1. Trace files

    Users are requested to provide a collection of ABI sequence electrophoregrams (archive containing all trace files).

    When the input consists of trace files, the first step calls the Polymorfind program, which detects SNPs and indels from PCR product sequencing in heterozygous organisms. Polymorfind itself includes a complete pipeline combining the Phred/Phrap/Consed software suite and the Polyscan program for managing polymorphisms. Polymorfind notably has the ability to detect heterozygous indels through a combination of two passes of Polyscan with contrasting stringency.



    2. Alignments

    Three input formats are accepted as input:

    • Standard FASTA alignments: SNiPlay uses a home-made Perl module to detect SNPs and insertion/deletion events and to extract allelic information for each polymorphic position. The detection consists of a simple parsing of the alignment in order to determine positions, i.e. columns, exhibiting variation, either a base variation or a heterozygous IUPAC code for SNP, or the presence of gaps for indel. In a second time, alleles are assigned to genotypes for each variant position.



    • Note that in case of FASTA alignments, each sequence must have the same length.

    • Text output from Staden: Staden may have been previously used to align, trim and edit raw Sanger sequencing data



    • Text output from Genalys


    3. Genotyping file

    Genotyping files must be provided in the “.ped” format, which is the widely used format for linkage pedigree data required for Gevalt or Plink.

    A parser extracts allelic data, but the subsequent mapping, annotation and diversity analysis steps are disabled because sequence information is lacking.



Whatever the entry point, the next step consists of reporting the statistical information for each polymorphic position, namely, variation, number of alleles, frequency of the most and least frequent alleles, number of readable accessions, and number of homozygous vs. heterozygous accessions for each polymorphism. These SNP statistics are summarized into Excel files available for download. Additionally, this module generates a formatted SNP file for specific design of Illumina’s SNP chip.

A FASTA alignment and a consensus sequence showing which nucleotides are most abundant at each position are generated for each set of sequences. This consensus sequence offers a good representation of the initial dataset and will be used subsequently for mapping, functional annotation, and diversity analysis.



Allelic files are generated in various formats so that users can load genotyping data and locally run certain widely used external programs dedicated to haplotype reconstruction, association mapping, diversity and phylogenetic analyses and population genetic structure analysis. Indels are automatically binary coded in genotyping file so that they can be interpreted as SNP and treated by the phasing programs.




Options
    i) Alignments or genotyping files can be provided either as a single file or as a zipped collection to analyze several genes in the same run.
    The user has to select the appropriate option:



    Electrophoregrams should be provided in a single archive containing all the trace files. If several amplicons are included, each will then be automatically detected using the file names.

    ii) In order to enable the program to extract gene/amplicon and accession names, users are requested to specify under which form the sequence identifiers are provided.



    iii) External data such as population, geographic or phenotypic data, can be provided optionally and may be used for subsequent group comparisons.



    Venn diagrams are displayed to report the number of SNPs shared between the different groups (A) as well as the number of SNPs newly formed when crossing groups (B). Common or group-specific SNPs can then be retrieved in distinct files generated for each diagram section.


2) Mapping to a reference genome

An automatic BLAST-based mapping is performed using the consensus sequence as query sequence and a reference genome as subject. Three complete genomes are currently available (grapevine, sorghum and maize), but other reference genomes may easily be added in the future.
The first best hit is used to localize the amplicon within the genome and a table displays the number of hits to reveal the presence of duplicate genes or pseudo-genes.

The genome-associated GFF3 annotation is then used to obtain the corresponding gene name and gene structure and then localize polymorphic positions in the CDS, introns or UTRs of the genes. If a SNP is located within a CDS, its sequence is extracted to identify the putative protein and to determine whether or not the SNP is synonymous. Additionally, the proportion of coding sequence within the amplicon is calculated.


3) Functional annotation

A simple BLAST is performed against the Uniprot and TrEMBL databases to further annotate the genomic region and to assign a putative function to the gene. If the mapping reveals that the region corresponds to a predicted gene, the putative protein is then used for BLASTP; otherwise, a BLASTX is launched using the consensus nucleotide sequence as query.

4) Phasing / haplotype reconstruction

A haplotype is the sequence of SNPs in one of the two homologous chromosomes of a specific region, and it cannot be directly established from Sanger sequencing techniques, which cannot tell which copy of a pair of homologous chromosomes each allele belongs to. Haplotype reconstruction consists of inferring the most probable combination of alleles on each homologous chromosome from genotyping data. Moreover, this method allows missing data to be inferred. The pipeline implements different programs to perform this task, using the default values for parameters: Gevalt using the Gerbil executable, PHASE or Shape-IT.


5) Linkage disequilibrium

Two classical measures of linkage disequilibrium (r2, D') are defined as the non-random association of alleles between two loci, and their significance can be calculated from the reconstructed haplotypes. HaploView then produces a plot of LD and haplotype blocks along each region/amplicon downloadable as a PNG image.



Alternatively, three new linkage disequilibrium measures correcting for biases due to population structure (r2S), relatedness between individuals (r2K) or both (r2SK) are calculated from genotypic data by an adjusted algorithm implemented as an R module. For each pair of markers, the estimates of the new r2S, r2K or r2SK measures are provided in addition to the classical r2 measure defined for unphased genotypes. The population structure and/or kinship matrices have to be supplied by the user.

6) Diversity analysis

Haplotypes are completed by non-polymorphic bases taken from consensus sequence to obtain complete phased sequences analyzed by the SeqLib library. It provides diversity indices including nucleotide diversity (Pi), number of segregating sites (Theta), number of haplotypes (H), haplotype diversity (Hd), and Tajima’s D test of neutral evolution for each gene fragment.
If a collection of genes is involved and if mapping has been previously performed, a diversity map is generated to display a color representation of Tajima’s D values along chromosomes.


7) Haplotype network and gene phylogeny

The Haplophyle program based on the "Median Joining Network" algorithm linked to the Graphviz graph visualization tool is used to construct an artificial neural network. The sizes of circles are proportional to haplotype frequency, and the lengths of connecting lines are proportional to the number of mutational steps between haplotypes. A downloadable PNG image representing a putative phylogenetic network is produced for each gene.
If external information associated with genotypes has been previously specified, the circles can be replaced by colored pies showing the subgroup frequency for each haplotype.


SNiPlay database

1) Database "feeding"

The SNiPlay database has been augmented after launching the first two modules of the pipeline (SNP detection + mapping).
It includes polymorphisms and genotyping data produced by public grapevine projects.

2) Hosted data types

The database was designed to store SNPs, indels and associated statistical values, genotyping data, sequences, taxa and accessions, SNP genomic locations and annotations.
According to the technology used, the database anticipates two main sources of genotyping data and stores their specificities:
  • Sequence data, such as the position in the consensus sequence and flanking regions.
  • Illumina VeraCode technology, including OPA name and scoring
It also allows accessions to be linked to predefined characteristics such as population structure, cultivar, phenotype or any criteria that biologists may subsequently use to filter accessions.

3) SNP queries

SNiPlay allows one to easily explore data and retrieve SNPs and indels using various filters (such as genomic position, percent of missing data, polymorphism type, allele frequency, synonymous/non-synonymous, Illumina scores or indel size) to compare SNP patterns between populations and to export genotyping data or sequences in different formats.

After having selected a project, the user has to choose genotypes on the left side (A) and genomic regions or genes on the right side (B) and adjust the filters; the system then extracts SNPs for this selection and calculates corresponding frequencies and consensus sequences. If the subset of genotypes corresponds to the sample initially submitted to the database, SNPs are pre-calculated and thus promptly reported.

For a SNP with a given position in the genome, the database can collect and combine genotyping data obtained in distinct experiments, from sequencing and/or Illumina VeraCode projects. These positions are highlighted in the results table in green if no difference appears between different experiments or in yellow if differences are detected. In this latter case, the conflict is replaced by missing data when exporting allelic files or calculating SNP frequencies (C).

With respect to this concept, the interface can accommodate different “merging levels”, from global synthesis (high merging level) to the experimental level (low merging level) using a specific program. The sample can be composed of varieties, accessions or DNA samples, while the sequence can be a gene, an amplicon or a submission batch with the latter being the electrophoregram reading from a specific curator. Depending on the merging level, users can decide to focus on a specific experiment (which will not imply any merging) or force the system to combine allelic data between experiments.

Request results can be finally exported as different files: genotyping file, alignment file or submission file for Illumina VeraCode technology. SNP with accompanying files can also be exported for submission to GnpSNP, the INRA SNP repository.


4) General statistics for a given project

The “project overview” option summarizes the data for each project, and calculates general statistics that can easily be obtained from the database: number of mutations in coding vs. non-coding sequences, number of synonymous vs. non-synonymous substitutions, average SNP density along the genome.

This section also provides graphical outputs comparing the average heterozygosity of accessions (A), the frequency of alternative nucleotide substitution (B), and the distribution of indel sizes (C).





southgreen IRD INRA