back to      
   

PuMaQC: Public Microarray Data Quality Control

Joana P. Corte-Real, Petr V. Nazarov, 2011


  Introduction

      Motivation. Data-driven studies such as inference of gene regulatory networks and translational cancer research normally require large amounts of transcriptomic data. One simple and cost free solution comes from importing microarray data from public repository databases such as NCBI Gene Expression Omnibus (GEO), integrating hundreds of thousand experiments. Despite the existence of the MIAME guidelines for standard microarray information, there is still a lack of information related to the quality of submitted data. Given that low quality samples can add noise and impair the statistical and biological significance of microarray analysis, quality control and quality assessment (QC/QA) becomes an important step when using public microarray data. Taking this into account we have developed R-based PuMaQC (Public Microarray Quality Control) pipeline.

      PuMaQC is a robust, easy to use, all-in-one pipeline for public microarray data handling based on 3 sequential steps:
      i) search for raw Affymetrix data in GEO;
      ii) import and preprocessing of CEL files;
      iii) QC/QA with identification and removal of low quality arrays.

      Methods. The pipeline incorporates functions from GEOmetadb and arrayQualityMetrics R/Bioconductor packages and uses Affymetrix Power Tools (APT) for raw data extraction and normalization. We have included the possibility to filter out unwanted samples at step (i), and a platform dictionary that allows broadening sample search to several related GEO platforms (GPL).

  Downloads


  Installation and Requirements

      Hardware and OS. The Pipeline is not very sensitive to hardware performance. However, small size of RAM may hamper simultaneous analysis of big data sets (>500 arrays). PuMaQC can operate under the same operating systems as R language and programming environment.

      Software prerequisites. The following software should be installed:
      1) Affymetrix Power Tools, Affymetrix (need free registration)
      2) R programming language, www.r-project.org
      3) R/Bioconductor core packages, Bioconductor
      Other packages GEOmetadb, GEOquery, arrayQualityMetrics can be installed by the pipeline.

      Running. Before running the pipeline, you need to create a text ini-file, which contains the parameters of your study. Structure of ini-file is defined below. You can download examples: singlesearch.ini or combisearch.ini files.
    After this simply type in your R console:

      source("http://sablab.net/PuMaQC/PuMaQC.r")
      PuMaQC()

    Alternatively you can immediately specify the location of your ini-file and/or source codes:

      PuMaQC(ini.file = "full path to ini-file", src.path = "path to source")

  Description file (ini-file)

    General structure. <> 4 parts: Project, Search, Import, QC Use ";" or "#" characters to start comments. Sections are given in "[...]".
    Parameters are followed by "=". Keep all parameters. If you do not use one, simply remove the value after "=", e.g. Exclude =
    In the paths to files use either "\" or "/" - it is automatically corrected. File paths can be either relative to current INI file or absolute.

      [Project]
      Title = Title of the study (optional)
      Description = General description of the project or study content (optional)

      [Search]
      GEOmetadb = Path to GEOmetadb.sqlite file. If you do not have it, it will be automatically downloaded and stored in the working directory. Better to use the same GEOmetadb.sqlite for all your searches.
      Platform = Put here either GPL (e.g. GPL570) or one of the following platform IDs:
      HG-U133_Plus_2, HG-U133A_2, Mouse430_2, Rat230_2, HuEx-1_0-st, MoEx-1_0-st, RaEx-1_0-st, HuGene-1_0-st, MoGene-1_0-st, RaGene-1_0-st.
      Keep it empty for exploratory analysis of existing data.
      Organism = Optional parameter, which is needed only for exploratory search.
      Include = (only for single search, for combined, see IncludeX)
      List of include keywords to be searched. Logical "OR" is used for Include.
      Exclude = (only single search, for combined, see ExcludeX) Optional list of exclude keywords. Logical "AND" is used for Exclude.
      IncludeX = (only for combined search, for single, see Include)
      List of include keywords to be searched in the query X. X can be replaced by any name or number.
      ExcludeX = (only for combined search, for single, see Exclude)
      Optional list of exclude keywords. Logical "AND" is used for Exclude.
      Expression =(only for combined search, omit for a single search)
      Combine results of several queries using logical expression. Use AND, OR, NOT logical operators to unite, intersect or exclude the results. Alternatively &,|,! can be used.
        Example: let Include1, Include2 and Include3 be specified.
        Then Expression = (1 OR 2) AND 3 will give intersection b/w search result 3 union of 1 and 2.
      Results = File with final search results. You can manually curate this file to exclude unwanted arrays.

      [Import]
      DownloadTo = The folder where the downloaded CEL files will be stored.
      Note: it is not allowed to use path with SPACE here (APT restriction)
      CDF = Path and name of corresponding CDF file. You can download CDF from www.affymetrix.com (after free registration).
      Note: it is not allowed to use path with SPACE here (APT restriction)
      APT = Path to Affymetrix Power Tools (APT) executables. APT is accessible at Affy webpage.
      Results = File with the table of successfully downloaded and preprocessed data.

      [QC]
      Alpha = Quality Control threshold b/w 0 and 1, which is "alpha" parameter of a z-test. The bigger the value of alpha, the more arrays are filtered out.
      arrayQualityMetrics = Assign "TRUE" to run in addition Kauffmann and Huber's pipeline arrayQualityMetrics.
      Results = Text file with the good quality results.

  Demos

      Single search Look for arrays done for healty lung tissues. See the results and reports here
      Combined search Look for lung from smokers with cancer and smokers without cancer. See the results and reports here

  Authors

      Joana P. Corte-Real   joanacrgmail.com

      dr. Petr V. Nazarov   petr.nazarovcrp-sante.lu

    Microarray Center, CRP-Sante
    84 Val Fleuri,
    L-1526
    Luxembourg

    We would like to express thanks to the Head of Microarray Center dr. Laurent Vallar for the support and inspiring of this work.

In the case of mistakes or possible copyright violations, please, contact the webmaster. Last update 03-11-2011