Notes on Bioconductor Core Classes

Date: 2006-02-15

Introduction

This page documents proposed design of eSet and its subclasses. See the background discussion page for some background info. Possible modifications are available in svn, at

https://hedgehog.fhcrc.org/bioconductor/branches/mtm/Biobase-eSet/

The section What is an eSet contains a sketch of the current (22 March, 2006) implementation, and previous discussion. The section Expected Subclasses of eSet outlines some subclasses, but help here would be appreciated.

What is an eSet?

The eSet class provides a generic structure for representing data arising from high-throughput genomic and proteomic assay experiments. The "e" in eSet stands for everything set, however, subclassing will almost always be needed to obtain useful representations for particular technologies.

eSet

Here is the current structure of eSet.

class:eSet
assayData:AssayData -- this is the main data storage.
phenoData:AnnotatedDataFrame - contains sample data, and metadata describing the sample data.
experimentData:MIAME - description of experiment and relevent meta data.
annotation:character - a string identifying a separately maintained annotation data.

Each slot has access and replace methods. In addition (with replace methods for many)

pData:data slot of phenoData
varMetadata:varMetadata slot of phenoData
sampleNames:common colnames of assayData elements
featureNames:common rownames of assayData elements
varLabels:varLabels of pData
dim:common dimensions of objects in assayData
dims:all dimensions of objects in assayData
ncol:common column number in assayData
nrow:common row number in assayData
"[":simultaneously subset assayData and phenoData
"$":column names of data slot of phenoData
show:
combine:(incomplete)
Discussion

Q: (from WH) How about renaming the phenoData slot to SampleData?

I think there is significant history to indicate the non-use of history, notes, and I would like to have MIAMElike capture the description, notes, annotation information. history is of interest but I think this should be a superclass-resident aspect, so that a class that extends withHistory gets the audit trail management infrastructure. This particular class should not be concerned with auditing. -- VJC

I think these are all good suggestions. So we remove notes, history and put everything in a MIAMElike description slot. Also, your comments on a withHistory superclass are in line with how I've thought about this -- SYF

Do we have any indication that this is a winning design? It seems we have used very often a thin reporterInfo, just a character tag, and the metadata are farmed out to annotation packages. I propose scrapping the reporterInfo, the tags must live in the rownames of each component of assayData. -- VJC

Isn't there a use-case where there is experiment-specific reporterInfo data to be tracked? I agree that identifying the quality control spots is chip data that belongs with the annotation/cdf/probe and not with the actual data. -- SYF

Indeed -- unique cDNA designs could benefit from this. Any chip that does not require a separate annotation package could use this. But such chips seem to be in the minority, when you reckon the whole universe of chips we are dealing with. So I would propose that such chips do not drive the design of the basic container. If we want annotation on the container, then we add reporterInfo. I am not digging in my heels on this, but I suspect reporterInfo will be underused. -- VJC

I just spoke with Rafa and he agreed that reporterInfo is an exception rather than the rule and thus is an appropriate addition to a subclass (and there, I might suggest reporterData to mirror phenoData). -- SYF
  • Rename phenoData to AnnotatedDataFrame (other name suggestions welcome; the point is that this same structure is useful for the reporterInfo.

[VJC: The phenoData class should go away, but the term is still used for the slot.]

  • Use an AnnotatedDataFrame for the reporterInfo slot.

[VJC: Consider eliminating reporterInfo in recognition of the key role of annotation packages for this functionality. Any class that needs rich reporterInfo can introduce this by extension.]

OK by me. If a number of subclasses show us that something belongs at this level, we can add it later. -- SYF
  • Consider removing sampleNames and reporterNames slots from eSet. This data is the responsibility of the phenoData and reporterInfo slots, respectively.

[VJC: yes]

  • Would it be worth having the assayData slot be of class AssayData which for now would be listOrEnv? The thing I'm worried about is that we might want to develop a data container that uses C-based tricks to avoid copying ala Biostrings. The slots of a subclass that are shared with its superclass have to maintain an is-a relationship.

    I guess this isn't needed as long as we can create a subclass of listOrEnv that has its data as external pointer and an interface that makes it look like a list/environment.

[VJC: I would wait for a good set of use cases. Is anyone capitalizing on listOrEnv yet?]

Not that I know of. I'm not suggesting that we go off and try to implement something in C. But committing to listOrEnv means that all future subclasses have to have an assayData slot that is a subclass of listOrEnv. Perhaps that isn't all that restrictive as it only says that we must look like a list/env in terms of "[", "[[", and supported methods. -- SYF

AssayData

This is a class union, currently defined as listOrEnv. Because it is a class union, it is virtual and cannot be instantiated with 'new'; there is a (non-exported) constructor assayDataNew that subclasses of eSet can use to return a list or environment containing specific bindings.

When an environment, AssayData imposes some controls to make the naive user less surprised. In particular, modifications to variables in the environment generally triggers a copy, so that the user is not surprised by the pointer-like behavior that would otherwise occur.

Each subclass of eSet specifies bindings in AssayData, and checks that those bindings are present; other bindings may also occur.

Some common generic functions (e.g., 'dim') are implemented as regular functions (assayDataDim) because otherwise the generic would apply to lists or environments that are not of class AssayData.

Methods include:

sampleNames:common colnames of members
featureNames:common rownames of members
show:

AnnotatedDataFrame

Somewhat like what a PhenoData instance is now, but with a name that is more generic and with varLabels incorporated as a column varLabelsDescription in varMetadata.

class:AnnotatedDataFrame
data:data.frame - rows correspond to samples, columns are variables describing the samples.
varMetadata:data.frame - Meta data describing the columns of the data frame in the data slot. This will always have a column varLabelsDescription that provides a description of the columns of data slot.

Exported methods are mostly like the methods for eSet. There is also a setAs method to convert phenoData to AnnotatedDataFrame.

dim:
nrow:
ncol:
pData:
sampleNames:
varLabels:
varMetadata:
"[":
"$":
"[[":
show:
Discussion

I think the phenoData class is mis-named. It is, in my opinion, more general. I suggest we rename it to AnnotatedDataFrame, but am open to other suggestions. -- SYF

We could condense this somewhat and promote better use of varMetadata, by eliminating varLabels and letting "labels" be a mandatory column of varMetadata. So AnnotatedDataFrame has representation(frame="data.frame", varMetadata="data.frame") -- VJC

Sure. Not really sure I see the advantage, but maybe it would encourage further use of varMetadata. And "description" might be better than "label". -- SYF

The discussion of feature level has moved to http://wiki.fhcrc.org/bioc/Core_Bioconductor_Classes_Discussion/FeatureLevelData

Core Bioconductor Classes Discussion (last edited 2007-08-20 16:38:32 by SethFalcon)