June 3, 2004
Agnes Paquet2, Andrea Barczak2, (Jean) Yee Hwa Yang1,2
1.
Department
of Medicine,
http://www.biostat.ucsf.edu/jean
2.
Functional
Genomics Core Facility,
apaquet@medsfgh.ucsf.edu
To install the arrayQuality
package for Windows
operating systems, first start R and make sure you are connected to the
Internet. Then, select “Packages”
from the menu, and
click on “Install package(s) from
Bioconductor…”. Select arrayQuality
in the list in the pop-up
window, and click on “OK”.
This section provides a quick starting guide to
use the
package. There are three main functions provided to automatically read
in
data, generate some diagnostic plots, and save them as images in png
format.
To use them, follow the steps below:
To load the package in your R session: type
> library(arrayQuality)
It will also
automatically
load marray and limma
packages.
> PRv9mers(prname=”12Mm”)
where prname is the name of the print-run.
> PRvQCHyb(prname=”9Mm”)
where prname is the name of the print-run.
This function supports Mouse (Mm) genome only at the moment.
> results <- gpQuality(organism="Mm")
This function supports Mouse (Mm) and Human (Hs) genomes only.
6. Each function will generate specific plots and files. For more details about each function’s arguments and results, please refer to the corresponding section of this document and to the online help.
Most users will not be required to use PRv9mers and PRvQCHyb. The key function
for
quality control is gpQuality.
A microarray experiment is composed of several steps, including experimental design, sample preparation, and various statistical analyses (figure 1). They are represented in the microarray lifecycle below. As microarray technology is complex and sensitive, it is important to assess the performance of each step before going to the next one. In addition, this is also a good way to trace back the cycle to understand potential causes for upstream problems.
Figure
1:
Microarray experiment lifecycle
For spotted array experiments, quality controls can be summarized into 4 steps:
- 9mers hybridization
- Quality Control hybridization
Each step must be performed in a sequential order,
as
represented in Figure 2.
Figure 2: Quality Control for spotted arrays experiment
Our package provides graphical tools to look at
two of
these components: print-run quality and array hybridization quality.
A) Print quality:
This component is highly tailored to the Shared Genomics Core Facility at UCSF, but the framework can be adapted to other Core facilities or laboratories printing their arrays. It is an essential component of a printed array experiment, as any print pin, probe or slide surface defect will affect the quality of hybridization to the slide, and this can’t be fixed by statistics. Only prints that did pass the quality control check will be used for actual hybridization.
B) Hybridization quality.
This is a global assessment of the hybridization performance. It helps determine for example any problem with the dyes, or uneven hybridization. Then, once you have determined that your hybridization is good, you can look at each individual spot quality, remove bad spots, and perform statistical analysis.
When a print-run is completed, it is necessary to verify the quality of the resulting arrays. This can be done by using two kinds of hybridization to the new slides. The first type of hybridization, which we term “9mers hyb”, uses small oligonucleotides (random 9-mers), which will hybridize to each probe. This hybridization will help to determine the quality of spot morphology as well as the presence or absence of spotted oligonucleotides. The resulting data will be used to create a list of all missing spots.
The second type of hybridization, which we will term Quality Control Hybridization (QCHyb), uses mRNA from predefined cell lines (e.g. liver vs. pool, K562 vs. Human Universal Reference pool from Stratagene). These hybridizations can be use as a more quantitative description of the slides. The same comparison hybridizations are done for different print-run, assessing their reproducibility. QCHybs are also used to verify accuracy of GAL files, number of missing spots, binding capacity, background signal intensity…
The arrayQuality package provides specific tools to help assess quality of slides for both 9-mers and QC hybridization.
In the package, the graphical function to assess 9mers hybridization
quality
is PRv9mers(). It runs using one single
command
line script. To use it:
- Copy all 9-mers hybridizations gpr files from the SAME print-run (same GAL file) to a directory.
- Change R working directory to the one containing your gpr files as described in section 1.
- Type:
> PRv9mers(prname=”12Mm”).
The prname
argument represents the name of your print-run. For more details about
other
arguments, please refer to the online manual.
PRv9mers()provides the following results:
1. Diagnostic plots as image in .png format for each tested slide
2. An Excel file (typically named 9Mm9mer.xls, where 9Mm is the name of your print-run, as passed to prname) containing for each spot on the slide:
- Name and ID of the spot
-
The probability of being present or absent (p from
EM
algorithm). If several files are tested together, you will have a
probability of being present/absent for each file.
A spot is considered
absent if
p < 0.5.
- The
average probability of being present or absent.
- The raw signal intensity (Signal column) or average raw signal intensity if several files are tested together for each spot.
Figure 3 shows an example from a typical 9-mers hybridization. This image is divided in 5 plots.
This example uses 9-mer hybridization data performed in the Functional Genomics Core Facility in UCSF. This print-run was created using Operon Version 2 Mouse oligonucleotides.
> library(arrayQuality)
> datadir <- system.file("data", package="arrayQuality")
>PRv9mers(fnames="12Mm250.gpr",path=datadir,
prname="12Mm")
Figure 3: Example of diagnostic plot for 9-mers hybridization
9-mers hybridizations help verify that oligonucleotides have been
spotted
properly on the slides. The next print-run quality control step will be:
1. Detect any difference in overall signal intensity compared to other print-runs
a. 70-mers oligonucleotides hybridizations
b. Selection of several test slides to ensure that the same quantity of material was spotted across the platter, as a print-run will generate 255 slides using the same well for one probe. QCHybs are performed using one slide from the beginning of the print, one from the middle, one from the end (e.g. numbers 20,100 and 255 in the Functional Genomics Core Facility).
2. Check if the GAL file was generated properly, i.e. check that no error was made with ordering or orientation of the plates during the print.
3. Reproducibility:
A good way to verify
the
quality of a new print is to hybridize known samples to new slides.
Then, we
can compare signal intensity from the new slides to existing data, and
check
that there is no loss in signal. Log ratios (M) for known samples
should be
similar across print-runs. Example of samples used for QCHybs includes
Human
Reference pool, Mouse liver, Mouse lung, with dye swaps.
The function in the package which performs the
quality
assessment for QCHybs is PRvQCHyb().
- Copy the QCHybs gpr files from the SAME print-run (same GAL file) in a directory.
- Change R working directory to the one containing your gpr files as described in section 1.
- Type:
> PRvQCHyb(prname="9Mm")
where prname is the
name
of the print-run. For more details about its arguments, please refer to
the
online manual.
PRvQCHyb()
returns a diagnostic plot as an image in .png format for each tested
slide.
Throughout our document, we will be using the color code described in Table 1 to highlight control spots.
Positive controls |
Red |
Empty controls |
Blue |
Negative controls |
Navy Blue |
Probes |
Green |
Missing spots |
White |
Currently, PRQCHyb() supports Mouse genome (Mm) only. We will add Human data as soon as it becomes available.
Figure 4 shows an example of a nice print-run
QCHyb.
Data for this example was provided by the
Functional
Genomics Core Facility in UCSF. We have tested slide number 137 from
print-run 9Mm. This print-run uses Operon Version 2 Mouse oligos.
Results are
represented figure 4.
> library(arrayQuality)
> datadir <- system.file("data", package="arrayQuality")
> PRvQCHyb(fnames=”9Mm137.png”, path=datadir, prname="9Mm")
Figure 4: Diagnostic plot for print-run Quality Control hybridization
This component is aimed at verifying the
performance of
your hybridization, given the good quality of the slide, before any
preprocessing steps or further quality assessment on individual spots.
This
is where you determine if your experiment quality is good enough to
enter
your dataset. For example, you will need to remove any hybridization
with
very low SNR, or large spatial artifacts.
Our package provides two kinds of quality control
plots.
The first one is a qualitative quality control measurement as a
diagnostic
plot. It is a quick visual way to determine hybridization quality
gathering
information from several statistical tools. More details on individual
diagnostic plots can be found in the vignette “marrayPlots” in
the package marray. The
second
one is a more quantitative comparison of slide quality. We extract some
statistical measures from the test slide and we compare them against
results
obtained for a collection of slides of “good quality” to assess
the quality of the hybridization. This comparison is visualized through
a
comparative boxplot. Results are displayed in a HTML report. Figure 5
shows a
screen shot of a typical HTML report. Users can click on each image to
obtain
a higher resolution plot.
Diagnostic plots can be generated from
GenePix format files (.gpr files) or from marrayRaw
or RGList
objects. Most arguments can also be customized to match your own data:
which probes are used as controls, which column of the gpr file is used
to define your spot types... You can also specify your own collection
of good quality
slides using the functions globalQuality and qualRefTable. For more
details about
these functions, please refer to the online help and the example at the
end
of this Section.
- Copy the gpr files from the SAME print-run (same GAL file) in a directory.
-
Change R working directory to the one containing
your
gpr files as described in Section 1
-
To generate both
diagnostic plots and comparative boxplots on all files in the
directory, run:
> result <-
gpQuality(organism=”Mm”)
-
To generate
diagnostic
plots only, run:
> result <-
gpQuality(organism="Mm", compBoxplot="FALSE")
In this case,
quantitative
quality measures will not be calculated and the HTML report will not be generated.
- To write down your quantitative quality measures and your normalized data to a file: set output = TRUE when calling gpQuality:
> result <- gpQuality(organism="Mm", output=TRUE)- To generate diagnostic plots: if rawdata is your marrayRaw/RGList object, type:
>
maQualityPlots(rawdata)
gpQuality() outputs
- two plots for each test slide (a diagnostic plot and a comparative boxplot)
-
a HTML quality report
- A marrayRaw object describing all tested slides
- A quality
measures matrix: this matrix contains all comparison measures values
extracted for each test slide. Each column of the matrix represents a
different slide.
For each slide, you will
find
on the report how many of your slide’s results are below the
recommended range. If you want to specify a directory to store the
results,
you can do it by modifying the argument resdir accordingly.
For
more details about gpQuality arguments, please
refer to
the online manual.
Figure 5: Example of
HTML report generated by gpQuality
gpQuality
calls two key functions, maQualityPlots
and qualBoxplot.
qualBoxplot supports
Mouse (Mm) and
Human (Hs) genomes only. To generate quality plots for other genomes,
you
need to set gpQuality argument
compBoxplot = FALSE. In
this
case, only the diagnostic plots will be generated.
Figure 6 represents an example of a good hybridization diagnostic plot.
Figure 7 shows an example of a comparative boxplot.
We have chosen a wide range of measures to
quantify the
quality of a typical hybridization: single channel measures (range of
foreground signal, MAD of background, signal to noise ratio…), two
channel measures (median A values for each type of controls, amount of
normalization needed…), percentage of flagged spots... Some measures
have been negated such that the quality scale had an increasing trend
from
problematic to good quality.
For each measure, we have represented the
following on
the graph :
- Boxplot of the reference slides values.
- 1st and 3rd quantiles before scaling for each boxplot.
-
Y-axis on the right : for each measure, we have
printed
2 values. The first one is the percentage of reference slides measures
under
your slide’s result. The second one is your slide value for this
measure before scaling.
1. rangeRf: Range of Cy5 foreground,
where the
range is defined by:
rangeRf = max(log2
(median
Cy5 foreground)) - min(log2(median Cy5 foreground))
where median Cy5 foreground corresponds to the "F635 Median" column of
the
gpr file.
2. rangeGf: Range of Cy3
foreground, where the
range is defined by:
rangeGf = max(log2
(median
Cy3 foreground)) - min(log2(median Cy3 foreground))
where median Cy3 foreground corresponds to the "F532 Median" column of
the
gpr file.
3. -RbMad: Cy5 background MAD
RbMad = mad[log2(Cy5 background)]
where:
- Cy5 background corresponds to the "B635 Median" column of the gpr file
- MAD = median{
| Y
–mu | }, when Y is normal
4.
-GbMad: Cy3 background MAD
GbMad = mad[log2(Cy3 background)]
RS2N = log2( mean Cy5 foreground / Median Cy5
background
)
RS2Nmedian = median(RS2N)
- mean Cy5 foreground is the "F635 Mean" column of the gpr file
- median Cy5 background is the "B635 Median" column of the gpr file
6. Median GS2N: Median Signal To Noise log-ratio for Cy3
GS2N = log2( mean Cy3 foreground / Median Cy3
background
)
GS2Nmedian = median(GS2N)
where:
- mean Cy3 foreground is the "F532 Mean" column of the gpr file
- median Cy3 background is the "B532 Median" column of the gpr file
7. -Median A for empty control:
Median A = [ log2(median Cy5 foreground) + log2(Cy3 foreground) ] / 2
Median A for empty control = median( A(Empty controls))
where:
- median Cy5 foreground corresponds to the "F635 Median" column of the
gpr
file
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
- Empty controls are the probes labelled "Empty"
8. -Median A for
negative
control:
Median A = [ log2(median Cy5 foreground) + log2(Cy3 foreground) ] / 2
Median A for negative control = median( A(Negativecontrols))
where:
- median Cy5 foreground corresponds to the "F635 Median" column of the
gpr
file
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
- Negative controls are the probes labelled "Negative"
9. Median A values for Positive controls:
A = [ log2(median Cy5 foreground) + log2(Cy3 foreground) ] / 2
Median A for positive control = median( A(Positive controls))
where:
- median Cy5 foreground corresponds to the "F635 Median" column of the
gpr
file
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
- Positive controls are the probes labelled "Positive"
10. Difference between A values for Positive controls and A values for Negative controls
11. -varRepA:
variance
of replicates spots A values
varRepA = var[
A(replicates)
]
where:
- A = [ log2(median Cy5 foreground) + log2(Cy3 foreground) ] / 2
- median Cy5 foreground corresponds to the "F635 Median" column of the
gpr
file
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
12. -msePtip: MSE of M values by print-tip group, no background subtraction
M =
log2(Median
Cy5 foreground) - log2(Median Cy3 foreground)
msePtip = MSE( mean M by print-tip)
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
- MSE(X) = E( (X-t)2
), with t a
parameter and X an estimator of t.
13. -mseFit: MSE of lowess curve
fit = lowess(A,
M)
mseFit = MSE(fit$y)
-
- M = log2(Median Cy5 foreground) - log2(Median Cy3 foreground)
- median Cy3 foreground corresponds to the "F532 Median" column of the
gpr
file
- MSE(X) = E( (X-t)2
), with t a
parameter and X an estimator of t.
14. -Percentage of flagged spots
[number of spot with flag < 0 / number of spots] * 100
where flag is the
information
from the "Flags" column of the gpr file. Only spots with flag less than
0 are
taken into account.
15. -M values MMRmad
MMR = Mmean –
Mmedian
MAD(MMR)
where:
- Mmean = log2(Mean Cy5 foreground) - log2(Mean Cy3 foreground)
M values calculated using mean signal
- Mmedian = log2(Median
Cy5
foreground) - log2(Median Cy3 foreground)
M values calculated using median signal
- mean Cy5 (Cy3)
foreground
is the "F635 Mean" ("F532 Mean") column of the gpr file
- median Cy5 (Cy3) foreground is the "F635 Median" ("F532
Median")
column of the gpr file
- MAD = median{
| Y
–mu | }, when Y is normal
16. -Percentage of spots with abs[MMR] >
0.5
where:
- MMR = Mmean – Mmedian
- Mmean =
log2(Mean Cy5
foreground) - log2(Mean Cy3 foreground)
M values calculated using mean signal
- Mmedian = log2(Median
Cy5
foreground) - log2(Median Cy3 foreground)
M values calculated using median signal
Data for this example was provided by the
Functional
Genomics Core Facility in UCSF. We have tested slide number "137" from
print-run "9Mm". This array was fabricated using Operon Version 2 Mouse
oligos and the hybridization measures differential gene expression in
two RNA
samples, Mouse Liver and Mouse Reference Pool. Results are represented
Figure
5 and Figure 6.
To generate diagnostic plots, comparative boxplots, HTML report and to write your quality measure and normalized data to a file in a directory named "Results":
> library(arrayQuality)
> datadir <- system.file("data", package="arrayQuality")
> result <-
gpQuality(fnames = "9Mm137.gpr", path =
datadir,
Figure 7: Comparative boxplot
Pattern |
Name |
Buffer |
Buffer |
Empty |
Empty |
EMPTY |
Empty |
AT |
Negative |
M200009348 |
Positive |
M200003425 |
Positive |
NLG |
con |
>
controlCode <-
readcontrolCode(file=”mySpotTypes.txt”, controlID="ID")
- Find which column
of the gpr file can be used to identify your new spot types. It is
typically the "ID" or the "Name" column.
- To generate both types of plots: call gpQuality
specifying your new controlCode matrix in controlMatrix
and which column is used to define your spot types in controlID.
> result <- gpQuality(controlMatrix = controlCode, controlId=”Id”)