HOWTO: An ontological based query using SGDI
| Date: | 1 November 2007 |
|---|---|
| Author: | Jeff Gentry |
Introduction
This document is intended to serve as a brief tutorial for performing a query with the SGDI software using an ontology to help select samples. While this tutorial is not exhasutive in showing potential functionality, this type of query should be common for many users. This example will display gene expression information from three datasets (Miller 2005, Sorlie 2001 and Sorlie 2003) which represent genes in the KEGG apoptosis pathway from patients who are ER negative and are over the age of 80.
Logging In
Your first step is to point your browser to the URL for the SGDI instance you wish to use. For this example we will use http://sgdi-dev.fhcrc.org/sgdi_public_breast/index.html. This instance is designed to provide a small number of breast cancer datasets which have multiple common terms mapped via the SGDI breast ontology. Upon arriving at the URL, you will see some basic information about this instance, and a button that says Log In. Press this button to authenticate with the server. Upon pushing the Log In button, you will receive an authentication challenge. Enter your name and password (provided by your administrator) and press Ok.
Selecting a Workspace
Here you will see a screen which will let you select a workspace to use. A SGDI workspace should be viewed as one line of thought. It represents a set of samples from a set of samples and a corresponding set of reporters (e.g. genes) which can then be used to extract the appropriate data. If you have no workspaces defined, a message will alert you to this effect, otherwise your current workspaces will be listed here. You can also create a new workspace on this screen. Create a new workspace here and select it to move on.
Selecting Experiments
You will now come to the basic SGDI page. On the left side you will have a variety of options, many related to workspace management but the first several devoted to selection (which are also repeated in the main frame).
The hierarchy of:
- Experiment Selection Page
- Select Samples Via Ontology
- Gene Expression Reporter Collapse Method
- Gene Expression Reporter Selection Page
- Display Selected Data
should be viewed as the appropriate order to do these steps. For now, we need to select experiments to explore, so click the "Experiment Selection Page" link
Here you will see a listing of all the datasets that you have permission to see, along with descriptions of the type of tissue they relate to, the microarray platform, and the number of samples. The latter three pieces of information can be filtered (e.g. Show only experiments from the 'hgu95av2' platform) or sorted by (by clicking on the hyperlink in the column heading). Check the boxes for the experiments that you wish to look at. For this example, we will use miller2005Hgu133a, miller2005Hgu133b, sorlie2001 and sorlie2003 as they come standard as example packages in every SGDI install. Before hitting Update Workspace & Return, there are two options at the bottom that you should make note of.
The first option specifies if you want to allow the use of any reporter from any of the platforms used or only the ones that they have in common. What this means is that if a particular gene is not represented on one of the platforms (e.g. the sorlie2001 spotted array chip) but is on the others it will not be included if you select the option to only use shared reporters. The default behavior is to use all reporters from all platforms.
The other option is which samples should be selected by default - all or none. The default is for no samples to be automatically included in these datasets, which is convenient when selecting samples via an ontology and to default to no samples selected. Press the Update Workspace and Return button.
Selecting Assays and Samples
For a description of selecting assays and selecting samples without ontologies, please see the section of the same name in PerformingBasicQuery
You will find that you are back at the main page, but with some key differences. The experiments you selected will now be listed on the main page, along with the number of currently selected samples. You will also have tabs across the main frame listing the selected experiments. Clicking on either one of these tabs or the hyperlinks from the selected experiments list will bring you to a specific page for that experiment. Within this page one can select specific samples for this experiment but in this tutorial we wish to select samples using our breast ontology.
Click on the Select Samples Via Ontology link. You will see a listing of which selected datasets match your current ontology (in this case all four of them) and which do not. Further down, you will be presented with two pairs of radio buttons. These buttons are used to use AND/OR and NOT operations when chaining multiple selections together. Any selections made will be based on these operators. The default is to use OR only, but one can select AND and toggle the NOT operator.
We wish to find the samples in all four of these datasets which have positive values for estrogen receptor immunohistochemistry and are also over 80 years old at diagnosis. To do this, scroll down and click on the + next to phenotype and then again the one for histopathologic_phenotype. You will see a listing of terms in this particular branch, which will also list how many of your sets map this term. At the bottom of the list you will see estrogen_receptor_immunohistochemistry and can see that all four sets map this term. Click on this link.
Here you will see a page which describes information particular to the current term. There will be a listing of which sets map to this term (in this case all of them) as well as which sets do not. Depending on whether or not the current term involves discrete values, the screen might have either a display with checkboxes or an area to type in free text. The estrogen_receptor_immunohistochemistry term only has a distinct set of values whic hare allowed, and are displayed as columns of a grid, with the sets as the rows. Each row details how many samples from that set map to which term. Check the Positive box and then the Submit Value button. You will be brought to a page showing you how many samples you've selected with this request, click the Continue Sample Selection button to move on.
You will find yourself back at the ontology sample selection page, however you can see that each dataset has samples now selected. Now we wish to subselect on the samples from the previous selection which involve patients over the age of 80 at diagnosis. Select the AND radio button, which will turn this into an intersection operation. Next, click the + for phenotype and then again for clinical_phenotype. You will see the term age_at_diagnosis_years, click this link.
The age_at_diagnosis_years uses a different form of entry method, as the potential values are not discrete. You can enter any number in the box as well as some basic logic (greater than, less than, etc). In this case we want people over 80, so type > 80 in the text box and hit Submit Value and then the Continue Sample Selection button again. We're brought back to the sample selection screen, but you can see at the top that now we're down to a very small selection of samples compared to before. If you look at one of the individual experiment pages, you should see that samples which fit our criteria are selected.
Reporter Selection
At this point you should select reporters that you wish to investigate across your selected samples. Clicking on the Reporter Selection Page link will take you to a page which lists different mechanisms for selecting reporters as well as the now familiar AND/OR and NOT options. You can currently select reporters by gene symbol (e.g. CHUK), by Chromosome ("all genes on chromosome 13"), GO term ("all genes associated with digestion") or by KEGG pathway ("all genes in the apoptosis pathway"). You are also allowed to upload this data by text file. Using the AND/OR and NOT functionality you can chain together complex constructs to get the exact reporter selection you wish ("All genes on chromosome 13 AND in the apoptosis pathway").
For this example, select KEGG Pathway by pushing the radio button. The format of the KEGG selection screen is the same as the other options. There is a text box to enter in a specific example if you know what pathway you would like to use, or you can select a pathway (or pathways) by hitting the Select from list link, and finally there is an option to upload from a local file. For this example, type apoptosis into the text box and hit the Submit button. You will see a description of the number of reporters you have selected (which you can see in more detail by hitting that hyperlink), and the ability to apply your selection or to cancel it. Select Apply Selection.
Reporter Collapsing
In many platforms, multiple reporters might map to the same mapped term (currently that would mean a single Entrez GeneID might be represented multiple times per chip). For some purposes, one might only want to look at a one-to-one mapping, and for this you will need to collapse your reporters. Currently, only one method for collapsing is supported, which is to use the reporter with the largest variance. Click on Reporter Collapse Methods and select the Largest Variance option.
Displaying Data
After clicking the Display Selected Data link, you will have multiple options. You can choose to extract a CSV file of the data or display in a browser (and if you choose the latter, you can export to a CSV later). You can choose to display with samples as columns or reporters as columns (this becomes important due to restrictions on the number of columns in many spreadsheet applications), and you can choose to view either a combined dataset or an individual dataset. You can also choose which datasets you wish to display the results for. Individually select the datasets or use the top checkbox to select (or unselect) all of them. Select Display In Browser, Display By Samples and use the top checkbox to select all experiments.
You will now be presented a listing of every unique clinical variable between all of the selected experiments. If you wish to display the values of these variables, select the ones you wish to see. You may recall that we specifically chose values from ER status and relapse_free, but perhaps we wish to see these with the sample information. Select these two variables and click the Select Variables & Display Output button:
You will now be presented with an HTML table displaying your output. If you wish to download a CSV file with this information, click on the Download To CSV File button:
