Documentation

Table of contents

    VariCarta is a web-based database developed with the goal of collecting, reconciling and consistently cataloguing literature-derived genomic variants found in ASD subjects using primarily whole exome or whole genome sequencing. We put emphasis on precise, systematic curation of the data, standardized processing and reporting, identification of overlaps, and comprehensive annotation.

    VariCarta can be used to query variants by gene, genomic location, publication or overlap between the studies (using the clickable heatmap on the Statistics page).

    A preprint manuscript for Varicarta is available on BiorXiv: https://doi.org/10.1101/608356

    The variant database can be queried through the search bar displayed at the top of the Home page. Start typing a gene symbol under Gene Symbol, choose the desired gene from the offered list of suggestions and click Submit.

    Alternatively, you can search for variants by genomic region. We are currently using the Hg19/GRCh37 genome assembly. For search by Region, you need a chromosome, start coordinate, and stop coordinate. For example, to search the 1st chromosome from coordinates 1000000 to 2000000, enter chr1:1000000-2000000.

    Searching for variants from specific publications.

    Our Publication page lists descriptions of studies incorporated in our database. Click on the book icon in the Details column to see details of the specific publication. You can retrieve all the variants from the paper that are included in VariCarta by clicking the number under Variant Event Count.

    Example: The De Rubeis, 2014 study reports 1694 variants included in our database.

    Searching for variants overlapping different publications.

    It is possible to look up the overlapping variants between two different publications through our Statistics page. Simply go to the Variant Event Overlap heatmap, and click the grid cell with the intersecting papers of interest.

    Variant table

    Any kind of variant search will return a variant table that shows the summarized information about all the variants found in the query region. It also provides links to gene information from the Ensembl and NCBI databases, genomic annotation of the region using the UCSC Genome browser, information about papers that reported the variants and original, published variant data. The variants in the table are represented either as variant events or complex variant events.

    VariCarta’s variant event is a unique combination of a reference allele, its genomic location and alternative allele belonging to a single individual. This allows us to identify and group together the same variants from the same individuals that have been reported in multiple studies. Each variant event is displayed only once in VariCarta’s variant result table, with the Source column listing IDs of all publications reporting it. The publication IDs are tagged to indicate the scope of sequencing study ( G whole genome sequencing, E whole exome sequencing, T targeted sequencing). Here is an example of a variant event.

    VariCarta’s complex variant event is a grouping of two or more variant events from the same individual that differ but have overlapping or adjacent genomic coordinates. This indicates that the grouped variant events might be describing the same underlying genotype but are incongruent due to the heterogeneity of formats used across papers. A complex event is initially displayed as one row that can be expanded to show the information about each grouped variant event. Here is an example of a complex variant event.

    The original variant data from the source publication can be accessed by clicking on a magnifying glass icon. All query results are available for download as plain text/csv format from the spreadsheet icon in the table navigation header or footer bar.

    Publications

    Studies currently included in VariCarta are listed in a table on the Publications page. This table contains basic publication information as well as some details about the scope of the study (e.g. whole genome/exome vs targeted sequencing as given in the Technology column) and characteristics of the studied subjects (number of subjects, cohorts used, diagnosis).

    Clicking on a book icon in the Details column opens a page with more detailed information about the study, including methodology used, size and type of the cohort, and types of variants reported. We also provide curation notes, which detail issues that had to be resolved during the curation stage, or other noteworthy information regarding the study. Clicking on the Variant Event Count will display all the variants from that publication that are available in VariCarta.

    Statistics

    The Statistics page offers the basic database statistics, including the total number of subject with unique VariCarta subject IDs, the number of all variant events, as well as counts of some specific classes of variant events (de novo, LOF). We also include several gene rankings based on different criteria. These rankings are not intended to be used for ASD candidate gene pioritization because they do not exclude common variants and variants from targeted studies (non-coding variants are excluded from gene ranking calculations).

    The page also shows the distribution of variants across publications, functional effects and genomic features. Finally, it includes a heatmap of the variant overlap between publications, which illustrates the extent of variant double-reporting across the literature. The numbers in grid cell represent the number of same variant events that are reported in both of the intersecting papers. Hovering over a cell will produce a tool tip which list the two intersecting papers and the exact number of overlapping variants. The overlapping variants can be accessed by clicking on the number in the grid cell.

    Literature search and curation

    Initially we searched the literature for publications reporting genomic variants, SNVs and InDels, found in subjects with an ASD diagnosis. The ongoing data collection relies on customized Google Scholar and PubMed citation alerts using keywords (and their variants) “autism”, “genetic”, “variants”, and “sequencing”. We exclude a paper if it has one of the following issues:

    • it does not provide a clear ASD diagnosis for at least a subset of their subjects
    • it does not provide enough variant information to determine genomic coordinates and allele change unambiguously
    • it does not associate variants with subject IDs

    We prioritize whole-genome and whole-exome studies over candidate gene studies and tend to process studies with a higher number of subjects first. Papers currently included in VariCarta are listed on the Publications page.

    For each publication, we applied the following curation procedures, with an intermediate goal of organizing all relevant variant information in a tabular format that is ready for import. The first step in this process is to copy the relevant text from the source file (typically a supplementary file) as-is in a template import document. The completed document is composed of a set of predefined worksheets, which contain the publication’s metadata, variant data and a description of the steps needed to automatically extract, transform and load the data into a uniform variant data model. This document is parsed by a computational pipeline, which validates and stores the data into a relational database. The link to the pipeline’s source code as well as an example import document are found on the Downloads page.

    Availability and usage

    Data downloads and licensing information are available on the Download page. The source code for the web application and the variant processing pipeline is open source (Apache 2 license) and available on GitHub under https://github.com/PavlidisLab/ndb.

    Variant nomenclature

    To make this project possible, we had to consolidate different variant reporting conventions, coordinate schemes and genome assemblies.

    Conventions used in VariCarta

    • Allele change: : In cases of insertions or deletions, we use a one base anchor to disambiguate the allele change. For example, an insertion of CGTCATCA on the chromosome 1, at the coordinate 151377903 would be listed as T, TCGTCATCA, chr1:151377903-151377903 for its reference allele (REF), alternate allele (ALT), and genomic location. For single nucleotide variants (SNVs), we simply display the before and after nucleotides as REF and ALT.
    • Coordinate system: The index is 1-based, meaning that a variant from 10,000-10,003 would affect 4 bases.

    Some definitions

    A de novo variant is a type of DNA mutation that emerges for the first time in a family. This type of mutation arises in a germ cell (egg or sperm) of one of the parents or in the fertilized egg itself and it is not present in parental genomes of an individual. De novo mutations are usually identified using trio or quad family DNA sequencing. Some of publications primarily report de novo variants, while others report also parentally inherited variants. In VariCarta, the Inheritance column in the variant table would indicate if a variant is known to be de novo.

    Loss of Function (LoF) variants are DNA mutations that are predicted to disrupt the function of a protein coding gene. Mutations that introduce an early stop codon (nonsense variant), insertions or deletions (indels) that change the reading frame (frameshift variants) and splicing variants (DNA changes in the close vicinity of an exon-intron boundary affecting splicing of pre-mRNA) are examples of LoF variants. The following types of variants are considered to be LoF in VariCarta: stopgain, stoploss, splicing, frameshift insertion, frameshift deletion and frameshift substitution. The variant type is indicated in the Effect column in the variant table where the LoF variants are highlighted in red

    Contact

    To contact us, please send an e-mail to pavlab-support@msl.ubc.ca.