Input: Gene Annotations


File information: gene annotations

A key component of ErmineJ is that it is organized around the idea of gene annotations. Gene annotations are entered into ErmineJ in two ways: The main annotation file you provide at startup, and via your “custom” gene sets that live in your ermineJ.data directory.

When ErmineJ starts up, you are asked to provide an “annotation file”. The annotation file provides three types of information important to the operation of ermineJ:

  • It provides mappings between probes and genes on microarray platforms. This need is predicated on the assumption that your input data files are keyed by the probes. If this is not true, then you won’t be using this feature, but you must still provide an annotation file.
  • It provides human-readable descriptions for the probes or genes. This is needed even if you aren’t using a microarray.
  • It provides Gene Ontology annotations for the genes. (These can be omitted if you are supplying some other gene groups)

Note: Even if you are not using GO, you must still provide at least a minimal annotation file that lists the genes you are using.

Note: Gene sets that have only one gene will not be loaded. In addition, by default the user interface hides sets that are empty, even though the software knows about them (such as GO terms). You can reveal these with the context (pop-up) menu in the table or tree view.

You can use two types of annotation files, ones we provide on our web site, or ones provided by Affymetrix or Agilent for their microarray platforms. Requests for support of other formats will gladly be entertained. You can also use files you have created, so long as they follow one of the supported formats. The files can be gzipped or zipped. There is no need to unpack them.

You inform ermineJ of which format you are using with the pull-down menu used at startup. As of ErmineJ 3.0, you can get annotation files from within the software by clicking on the “Get from Gemma” button on the startup screen. See also the startup screen documentation.

The list of platforms provided includes those in Gemma which have annotations available. Any files for platforms not from Gemma will be found here.

Using files we provide

In ermineJ, we refer to these as “ermineJ format”, but the files are very simple and useful in other contexts

You can download annotation files here. Note: For ermineJ, we recommend using the annotation files that have “No parents” (parent GO terms are not listed explicitly in the file). This is because ermineJ rapidly computes parents for each term when you load the file. Using the “No parents” versions will result in faster startup times for ermineJ.

We provide annotation files for some popular (and many not-so-popular) microarray platforms, provided by Gemma. These files contain the probe (or probe set) identifiers, the gene symbols and names, and GO membership information. For our current annotations, this means that a list of the Gene Ontology terms associated with a gene are listed. For each term, the ‘parent’ terms are also implicitly included, so that genes associated with very specific terms are also included in the less specific categories.

If you are not using an expression array platform, we provide some “generic” annotation files that are keyed to official gene symbols.

For species or platforms we don’t support, ask us for assistance or set it up yourself. The files are not hard to prepare if you have Gene Ontology (or other gene set descriptor) annotations available.

For species we support, but for new platforms, often you will be able to create a new annotation file by pulling information out of our existing files using a simple Perl script.

Description of the format

  • The file is tab-delimited text. Comma-delimited files or Excel spreadsheets (for example) are not supported.
  • There is a one-line header included in the file for readability.
  • The first column contains the probe identifier. The probe IDs must exactly match the ones you provide in your Gene score file. Any probes not having an entry will be ignored. If you are not using probes, this will probably contain gene symbols. The main requirement here is that it matches the identifiers you provide in your input data files.
  • The second column usually contains a gene symbol. This should not be blank. If the gene name is not known, a sequence identifier or arbitrary code can be used instead. This is used to determine whether a gene has more than one probe, as well as providing information for display purposes.
  • The third column contains the gene name (or description). This can be blank. It is only used for display purposes.
  • The fourth column contains a delimited list of GO identifiers. These include the “GO:” prefix. Thus they read “GO:00494494” and not “494494”. The ids within this field can be delimited by spaces, commas, or pipe (‘|’) symbols. This field can be blank if there are no GO annotations (or if you aren’t using GO).

Using files you create

Annotation files that you created can be used so long as they adhere to one of the accepted formats. There are a few things to consider:

  • The probe IDs must exactly match the ones you provide in your input data files (gene scores and raw data). Any probes not having an entry will be ignored.
  • The gene symbols are used internally by the software to decide which genes are present on the array more than once. Therefore, if two probes refer to the same gene, make sure the symbol you use is the same for both probes. (It doesn’t actually matter what the symbol is).
  • The gene names or descriptions are optional, and blank values will just show up as “No description” or something similar.
  • In the ermineJ format, the GO ids must be in “long” format (with the GO: prefix). The GO terms themselves should be omitted. The parents of all terms listed are automatically included in the analysis (subject to other constraints such as the maximum Gene Set size you set in the analysis), so there is no need to list these explicitly.

 

Using files from the Affymetrix web site

Obviously this only helps you if you are using an Affymetrix GeneChip. (The writers of ermineJ are not affiliated with Affymetrix in any way).

The files we tested our software on were obtained from the Affymetrix site and are in CSV (comma-separated value) format. As of 2011 ermineJ worked with these files (for example HG-U133_Plus_2.na31.annot.csv.zip, NetAffx account required for access), but if Affymetrix changes the format (they have in the past) it is possible that they will no longer work with ermineJ.

To use this, select the “Affy CSV” option on the startup dialog.

Using file from the Agilent web site

Similar to the Affymetrix files, ermineJ supports the use of annotation files from Agilent. These must be the text-formatted files downloaded from https://earray.chem.agilent.com/earray/ (registration required). If you have trouble locating the file for your microarray contact your Agilent technical support representative as we have no control over the availability of these files.