Documentation

Table of contents

    VariCarta is a web-based database developed with the goal of collecting, reconciling and consistently cataloguing literature-derived genomic variants found in ASD subjects using primarily whole exome or whole genome sequencing. We put emphasis on precise, systematic curation of the data, standardized processing and reporting, identification of overlaps, and comprehensive annotation.

    VariCarta can be used to query variants by gene, genomic location, publication or overlap between the studies (using the clickable heatmap on the Statistics page).

    A manuscript for VariCarta is available on Autism Research: "VariCarta: A Comprehensive Database of Harmonized Genomic Variants Found in Autism Spectrum Disorder Sequencing Studies." (2019) Belmadani M, Jacobson M, Holmes N, Phan M, Nguyen T, Pavlidis P, Rogic S. Autism Res. 2019 Dec;12(12):1728-1736. doi: 10.1002/aur.2236

    The variant database can be queried through the search bar displayed at the top of the Home page. Start typing a gene symbol under Gene Symbol, choose the desired gene from the offered list of suggestions and click Submit.

    Alternatively, you can search for variants by genomic region. We are currently using the Hg19/GRCh37 genome assembly. For search by Region, you need a chromosome, start coordinate, and stop coordinate. For example, to search the 1st chromosome from coordinates 1000000 to 2000000, enter chr1:1000000-2000000.

    Searching for variants from specific publications.

    Our Publication page lists descriptions of studies incorporated in our database. Click on the book icon in the Details column to see details of the specific publication. You can retrieve all the variants from the paper that are included in VariCarta by clicking the number under Variant Event Count.

    Example: The De Rubeis, 2014 study reports 1694 variants included in our database.

    Searching for variants overlapping different publications.

    It is possible to look up the overlapping variants between two different publications through our Statistics page. Simply go to the Variant Event Overlap heatmap, and click the grid cell with the intersecting papers of interest.

    Variant table

    Any kind of variant search will return a variant table that shows the summarized information about all the variants found in the query region. It also provides links to gene information from the Ensembl and NCBI databases, genomic annotation of the region using the UCSC Genome browser, information about papers that reported the variants and original, published variant data. The variants in the table are represented either as variant events or complex variant events.

    VariCarta’s variant event is a unique combination of a reference allele, its genomic location and alternative allele belonging to a single individual. This allows us to identify and group together the same variants from the same individuals that have been reported in multiple studies. Each variant event is displayed only once in VariCarta’s variant result table, with the Source column listing IDs of all publications reporting it. The publication IDs are tagged to indicate the scope of sequencing study ( G whole genome sequencing, E whole exome sequencing, T targeted sequencing). Here is an example of a variant event reported in two different papers with different sequencing methods.

    VariCarta complex variant event are a grouping of two or more variant events from the same individual that differ but have overlapping or adjacent genomic coordinates. This indicates that the grouped variant events might be describing the same underlying genotype but are incongruent due different reporting conventions. A complex event is initially displayed as one row that can be expanded to show the information about each grouped variant event. Here is an example of a complex variant event.

    The original variant data from the source publication can be accessed by clicking on a magnifying glass icon. All query results are available for download as plain text/csv format from the spreadsheet icon in the table navigation header or footer bar.

    Column description

    Below is a description of each column found in the variant table.

    Column name Description
    Gene HUGO Gene symbol. Clicking the symbol link will search all genes in the database for that gene. An icon for NCBI and Ensembl is also displayed and linked to their corresponding page. Internally, genes in VariCarta are anchored on the NCBI gene IDs.
    ID Subject/Sample identifier. We display the string identifier used in the literature (column sample_id in the exported data), and hovering your mouse cursor over the text will display the internal VariCarta subject identifier (subject_id in the exported data.)
    Location Genomic coordinated on the Hg19 human genome reference assembly. The format is CHROMOSOME:START_POSITION-END_POSITION. Coordinates use 1-based indexing.
    REF Reference DNA nucleotides.
    ALT Variant DNA nucleotides
    Context Functional context of the variant determined by annovar. Values can include: downstream, exonic, intergenic, intronic, ncRNA_exonic, ncRNA_intronic, ncRNA_splicing, splicing, upstream, UTR3, UTR5.
    ☑ (Validation) Indicates whether validation of the variant was done using an orthogonal method, such as sanger sequencing. Hovering your mouse cursor over the checkmark in individual variant table rows will display the validation method reported in the study.
    Inheritance Reported inheritance status for the variant.

    Values can be:
    d - "De novo" - Variant was not found the parents.
    i - "Inherited" - Variant is reported to be inherited.
    p - "Paternal" - Variant is reported to be inherited from the paternal side.
    m - "Maternal" - Variant is reported to be inherited from the maternal side.
    b - "Both parents" - Both parents are have the variant.
    u - "Unknown" - Inheritance is reported as unknown.
    mo - "Mosaic" - Variant is reported to be of mosaic origins.
    mm - "Mosaic Mat." - Variant is reported to be mosaic and phased from the maternal side.
    mp - "Mosaic Pat." - Variant is reported to be mosaic and phased from the paternal side.
    mi - "Mosaic Inh." - Variant is reported to be inherited but may have been mosaic in the parent.
    mb - "Mosaic Both" - Variant is reported to be inherited and appears to be mosaic in both parents.

    Please consider that our curators are using their best judgement based on the available information. Some studies label report mosaicism based on a qualitative confidence (High or Low), while others report an observed percentage of mosaic alleles.
    Effects Transcriptional consequence of the mutation. Values can include:


    frameshift
    nonframeshift
    nonsynonymous
    splicing
    stopgain
    stoploss
    synonymous
    unknown
    Transcript RefSeq transcripts affected. Each transcript is on a separate line, matched by row in the following cDNA and Protein columns. Provided by Annovar.
    cDNA HGVS cDNA variant for the transcript aligned in the "Transcript" column. Provided by Annovar.
    Protein HGVS protein variant for the transcript aligned in the "Transcript" column. Provided by Annovar.
    CADD-Phred v1.0 The phred-scaled CADD score for prediction of deleterious variants. Higher values indicate a prediction for higher likelihood of being deleterious.

    Note that CADD 1.3 is available in the exported data. Only coding variants are scored where CADD was predicted, meaning some indels may not be scored either.
    ExAC v0.3 ExAC v0.3 population frequencies derived from over 60,000 exomes.
    Sources Publications where this variant was reported. An icon display whether the study was Genomic (G), Exomic (E) or Targeted (T).

    Clicking on the paper link in the variant table will display information about the paper, while clicking the magnifying glass will return the original source/spreadsheet data for that variant.

    Publications

    Studies currently included in VariCarta are listed in a table on the Publications page. This table contains basic publication information as well as some details about the scope of the study (e.g. whole genome/exome vs targeted sequencing as given in the Technology column) and characteristics of the studied subjects (number of subjects, cohorts used, diagnosis).

    Clicking on a book icon in the Details column opens a page with more detailed information about the study, including methodology used, size and type of the cohort, and types of variants reported. We also provide curation notes, which detail issues that had to be resolved during the curation stage, or other noteworthy information regarding the study. Clicking on the Variant Event Count will display all the variants from that publication that are available in VariCarta.

    Statistics

    The Statistics page offers the basic database statistics, including the total number of subject with unique VariCarta subject IDs, the number of all variant events, as well as counts of some specific classes of variant events (de novo, LOF). We also include several gene rankings based on different criteria. These rankings are not intended to be used for ASD candidate gene pioritization because they do not exclude common variants and variants from targeted studies (non-coding variants are excluded from gene ranking calculations).

    The page also shows the distribution of variants across publications, functional effects and genomic features. Finally, it includes a heatmap of the variant overlap between publications, which illustrates the extent of variant double-reporting across the literature. The numbers in grid cell represent the number of same variant events that are reported in both of the intersecting papers. Hovering over a cell will produce a tool tip which list the two intersecting papers and the exact number of overlapping variants. The overlapping variants can be accessed by clicking on the number in the grid cell.

    Literature search and curation

    Initially we searched the literature for publications reporting genomic variants, SNVs and InDels, found in subjects with an ASD diagnosis. The ongoing data collection relies on customized Google Scholar and PubMed citation alerts using keywords (and their variants) “autism”, “genetic”, “variants”, and “sequencing”. We exclude a paper if it has one of the following issues:

    • it does not provide a clear ASD diagnosis for at least a subset of their subjects
    • it does not provide enough variant information to determine genomic coordinates and allele change unambiguously
    • it does not associate variants with subject IDs

    We prioritize whole-genome and whole-exome studies over candidate gene studies and tend to process studies with a higher number of subjects first. Papers currently included in VariCarta are listed on the Publications page.

    For each publication, we applied the following curation procedures, with an intermediate goal of organizing all relevant variant information in a tabular format that is ready for import. The first step in this process is to copy the relevant text from the source file (typically a supplementary file) as-is in a template import document. The completed document is composed of a set of predefined worksheets, which contain the publication’s metadata, variant data and a description of the steps needed to automatically extract, transform and load the data into a uniform variant data model. This document is parsed by a computational pipeline, which validates and stores the data into a relational database. The link to the pipeline’s source code as well as an example import document are found on the Downloads page.

    Availability and usage

    Data downloads and licensing information are available on the Download page. The source code for the web application and the variant processing pipeline is open source (Apache 2 license) and available on GitHub under https://github.com/PavlidisLab/ndb.

    Variant nomenclature

    To make this project possible, we had to consolidate different variant reporting conventions, coordinate schemes and genome assemblies.

    Conventions used in VariCarta

    • Allele change: : In cases of insertions or deletions, we use a one base anchor to disambiguate the allele change. For example, an insertion of CGTCATCA on the chromosome 1, at the coordinate 151377903 would be listed as T, TCGTCATCA, chr1:151377903-151377903 for its reference allele (REF), alternate allele (ALT), and genomic location. For single nucleotide variants (SNVs), we simply display the before and after nucleotides as REF and ALT.
    • Coordinate system: The index is 1-based, meaning that a variant from 10,000-10,003 would affect 4 bases.

    Some definitions

    A de novo variant is a type of DNA mutation that emerges for the first time in a family. This type of mutation arises in a germ cell (egg or sperm) of one of the parents or in the fertilized egg itself and it is not present in parental genomes of an individual. De novo mutations are usually identified using trio or quad family DNA sequencing. Some of publications primarily report de novo variants, while others report also parentally inherited variants. In VariCarta, the Inheritance column in the variant table would indicate if a variant is known to be de novo.

    Loss of Function (LoF) variants are DNA mutations that are predicted to disrupt the function of a protein coding gene. Mutations that introduce an early stop codon (nonsense variant), insertions or deletions (indels) that change the reading frame (frameshift variants) and splicing variants (DNA changes in the close vicinity of an exon-intron boundary affecting splicing of pre-mRNA) are examples of LoF variants. The following types of variants are considered to be LoF in VariCarta: stopgain, stoploss, splicing, frameshift insertion, frameshift deletion and frameshift substitution. The variant type is indicated in the Effect column in the variant table where the LoF variants are highlighted in red

    Contact

    To contact us, please send an e-mail to pavlab-support@msl.ubc.ca.