Gene Sets


Gene set files

User-defined gene sets can be created from within ErmineJ, or imported using several different formats. This way you can define gene sets yourself or use other schemes such as KEGG by providing an appropriate file for ErmineJ.

Knowledge of the formats gives you the ability to define gene sets outside of ErmineJ. When ErmineJstarts up, it looks for these files in a predefined location (see below) and loads them.

Gene set files created by ErmineJare saved in the directory ermineJ.data/genesets . (e.g., C:/Documents and Settings/[your user name]/ermineJ.data/genesets ). You should place your own “handmade” gene set files in this location so they are automatically visible to the software.

Note! If you create a gene set when using one platform, and then switch to another next time you run ErmineJ, ErmineJ will try to load your old gene sets. If any probes on the previous design match the identifiers on the current one, the gene set will be loaded to the extent it can. We may change this in a future version of ErmineJ, to provide more species and platform information in each file. Let us know if this is important to you.

Note: Gene sets that have only one gene will not be shown. In addition, by default the user interface hides sets that are empty, even though the software knows about them (such as GO terms).You can reveal these with the context (pop-up) menu in the table or tree view.


Formats

You can either use the ermineJ-native format, a second format with one gene set defined per tab-delimited line, or import a simple list of genes.

Option 1: ErmineJ-native format

This format allows you to store one or more gene set in a single file with a very simple format, identified either by probe (handy for mapping to expression arrays) or gene symbols that match the ones in your annotation file. Here is a sample:

# this is a comment
probe
MyGeneSet
Genes I Like
36495_at
271_s_at
37983_at
34071_at
128_at
129_g_at
206_at
38466_at
32017_at
346_s_at
32018_at
====
probe
MySecondGeneSet
More genes I like
37983_at
34071_at
128_at
129_g_at
206_at
38466_at
32017_at
346_s_at

(or download the sample as a file)

The full description of the format is follows.

  • The file is plain text (ASCII)
  • There can be more than one gene set defined, demarcated by “===” on a line by itself.
  • Lines beginning with “#” are ignored.
  • Blank lines are ignored.

Within each gene set definition, you must declare at least four non-blank, non-comment lines:

  • The first line describes the type of identifier in the file and is either “probe” or “gene”. The former must match the identifiers in the first column of your annotation file. The second should match the symbols in the second column in your annotation file.
  • The second line is the unique ID or name of the gene set. This name must be distinct from other groups used in the session (including GO terms).
  • The third line is a longer description of the gene set. There is no limit to the length of this description but in practice it should be just a few words.
  • The fourth and subsequent lines are the identifiers (probe ids or official gene names).


Option 2: A tab-delimited file with one set per line (e.g. MolSigDB)

This is a slightly simpler alternative to the native format, with the limitation that only gene symbols are supported. Your annotation file will be used to figure out which probes are relevant. Here is a sample. Like the files described for Option 1, these files should be placed in your ermineJ.data/genesets directory, where they will automatically be detected and imported by the software.

The format is essentially the same as the MolSigDB “gmt” format. In fact you can use a file directly obtained from MolSigDb (the gene saymbol version).

  • The file is tab-delimited ASCII text
  • There is one gene set defined per line
  • One one line, the fields are:
    1. A unique gene set identifier
    2. A description (can be blank, but cannot be ommitted)
    3. The remaining fields are interpreted as gene symbols (keyed to the second column of your annotation file).

MolSigDb provides files for KEGG, Biocarta and other gene classification schemes here. A file of gene-disease relationships from Phenocarta is available for mouse or human.

Option 3: Import files containing lists of genes using the “Define new gene set” menu item

This method has the benefit of requiring a very simple format, but you must load the files one at a time using ermineJ’s graphical interface. (If this is a pain, a simple Perl or Python script can convert the lists into the other format.)

The file in this case is just a list of genes, with one on each line. The names must be the gene symbols that are used in your gene annotation file. Other symbols will be ignored. Here’s an example with just three genes:

alox12b
ALOX15
alox12

A full description of the file format is:

  • Each file describes just one gene set.
  • The file is plain text (ASCII)
  • Each line contains the official symbol of one gene
  • Capitalization is ignored
  • Blank lines and symbols not found in the current array design are ignored

On loading in, the list of genes is converted to a list of probes. You will be given the chance to edit the gene list and give it a name before finalizing it.