Introduction


Purpose

The ExpressionData resource was built to provide biologists and bioinformaticians a source of quality controlled, globally normalized and manually annotated datasets of gene expression for a variety of biological contexts. The data is provided by Nebion free of charge and was extracted from the GENEVESTIGATOR database. Typically, a data matrix will consist of vectors of expression summarizing the results from many individual experiments. The datasets can be used for a variety of purposes, such as test new analysis methods or to compare or benchmark other experiments.

Source of data

All data from the ExpressionData resource is public data that was manually curated and annotated, quality controlled and normalized by the GENEVESTIGATOR Team. Each of these processes is explained below. In brief, our curators perform high-level curation using controlled vocabularies and the software builds on search engine technology to process thousands of microarrays simultaneously and generate aggregated information. The below scheme shows the difference between a solution directly importing experimental data from repositories (i.e. without additional curation), and a solution based on deep data integration, such as GENEVESTIGATOR.

Schema Data Integration


Sample annotation

All samples curated in GENEVESTIGATOR are annotated using controlled vocabularies for tissue types, stages of development, perturbations, genotypes, and neoplasms. Data that is made available through expressiondata.org are derived from the GENEVESTIGATOR annotated database.

Quality control

All microarray hybridizations are quality controlled using a variety of Bioconductor packages. Low quality arrays are excluded. All summarized datasets made available here contain only high quality data.

Data normalization

GENEVESTIGATOR applies a normalization scheme that includes two levels:
  • Intra-experiment: RMA
  • Inter-experiment: global scaling to a target value of the trimmed mean of each experiment
See our Application Note on data normalization for more details.

Data summarization

Some of the datasets we make available here were summarized from thousands of samples and were combined to represent a particular biological dimension, such as tissue types or responses to perturbations. There are two types of data summarization:
  • Summarizing absolute data from identical categories. This is done only for data from anatomical parts, stages of development and neoplasms. In the summarization process, we calculate a representative expression vector from all samples that are annotated with the same category, e.g. liver or thymus. Our analysis reveals that such vectors are highly representative for the given tissue type (see figure below showing the results of a PCA).
  • Collecting relative data from many experiments experiments. In this case, it is not possible to calculate an average expression vector, as the underlying conditions may be different. The summarization of responses consists of building a compendium of various responses. Each response is a comparison between an experimental group and a control group from the same study.
The below schema shows how data summarization is performed on anatomical data. In brief, a mean expression vector is calculated from all samples across all experiments that refer to the same anatomical part (as annotated by our curators using controlled vocabularies).

Data summarization


Data from this type of summarization is highly robust and representative of biological processes. The figure below shows a proof of concept using a Principle Component Analysis (PCA) of mouse tissues based on 2,900 Affymetrix Mouse U74Av2 array hybridizations. Each dot represents an individual tissue type summarized from many different experiments. Tissues of closely related biological processes are projected close to each other, showing that the summarization process delivers biologically representative expression vectors.

Data summarization


Format

Datasets have the following format:
  • Header Rows: sample description information, usually a category of a given ontology (e.g. Name of tissue type).
  • Data Rows: contain measurement data (usually average signal intensities in log2-scale
  • Header Column: contains probe set identifiers
Currently, one type of datasets is available using absolute expression values (e.g. expression in different tissue types, neoplasms, or stages of development). The expression values are log2-scaled.
Schema Absolute

Absolute values are
symbolized by blue-white
heat maps.