The ExpressionData resource was built to provide biologists and
bioinformaticians a source of quality controlled, globally
normalized and manually annotated datasets of gene expression for a
variety of biological contexts. The data is provided by Nebion free
of charge and was extracted from the GENEVESTIGATOR database.
Typically, a data matrix will consist of vectors of expression
summarizing the results from many individual experiments. The
datasets can be used for a variety of purposes, such as test new
analysis methods or to compare or benchmark other experiments.
Source of data
All data from the ExpressionData resource is public data
that was manually curated and annotated, quality controlled and
normalized by the GENEVESTIGATOR Team. Each of these processes is
explained below. In brief, our curators perform high-level curation
using controlled vocabularies and the software builds on search
engine technology to process thousands of microarrays
simultaneously and generate aggregated information. The below
scheme shows the difference between a solution directly importing
experimental data from repositories (i.e. without additional
curation), and a solution based on deep data integration, such as
All samples curated in GENEVESTIGATOR are annotated using
controlled vocabularies for tissue types, stages of development,
perturbations, genotypes, and neoplasms. Data that is made
available through expressiondata.org are derived from the
GENEVESTIGATOR annotated database.
All microarray hybridizations are quality controlled
using a variety of Bioconductor packages. Low quality arrays are
excluded. All summarized datasets made available here contain only
high quality data.
GENEVESTIGATOR applies a normalization scheme that includes two
- Intra-experiment: RMA
- Inter-experiment: global scaling to a target value of the
trimmed mean of each experiment
See our Application Note on data normalization
for more details.
Some of the datasets we make available here were summarized from
thousands of samples and were combined to represent a particular
biological dimension, such as tissue types or responses to
perturbations. There are two types of data summarization:
- Summarizing absolute data from identical categories. This
is done only for data from anatomical parts, stages of
development and neoplasms. In the summarization process, we
calculate a representative expression vector from all samples
that are annotated with the same category, e.g. liver or thymus.
Our analysis reveals that such vectors are highly representative
for the given tissue type (see figure below showing the results
of a PCA).
- Collecting relative data from many experiments
experiments. In this case, it is not possible to calculate an
average expression vector, as the underlying conditions may be
different. The summarization of responses consists of building a
compendium of various responses. Each response is a comparison
between an experimental group and a control group from the same
The below schema shows how data summarization is
performed on anatomical data. In brief, a mean expression vector is
calculated from all samples across all experiments that refer to
the same anatomical part (as annotated by our curators using
Data from this type of summarization is highly robust and
representative of biological processes. The figure below shows a
proof of concept using a Principle Component Analysis (PCA) of
mouse tissues based on 2,900 Affymetrix Mouse U74Av2 array
hybridizations. Each dot represents an individual tissue type
summarized from many different experiments. Tissues of closely
related biological processes are projected close to each other,
showing that the summarization process delivers biologically
representative expression vectors.
Datasets have the following format:
- Header Rows: sample description information, usually a
category of a given ontology (e.g. Name of tissue type).
- Data Rows: contain measurement data (usually average
signal intensities in log2-scale
- Header Column: contains probe set identifiers
|Currently, one type of datasets is available using absolute
expression values (e.g. expression in different tissue types,
neoplasms, or stages of development). The expression values are log2-scaled.
Absolute values are
symbolized by blue-white