1. Multi-genome representation

    Develop a compact way of representing multiple genomes, or large genomic regions, from individuals of the same species. The representation needs to be able to represent SNPs, indels, and rearrangements. One use case is to query a collection of sequences for genomic features, for instance 'all SNPs in the the coding regions of gene XYZ', 'all rearrangements involving the sequence spanning coordinates xx, yy of hg19', 'indels within 5kb of the transcription start site of any gene'. A second use case is to reconstruct the sequence of related individuals. Compatibility with existing genome level operations defined in the BSgenome package is desirable. Representative data might include the 1000 genomes data, or a large collection of cancer cell lines.

    Multi Genome SNP Matrix

  2. Add support for multigraphs to the graph package

    More details on expected uses is required. Multigraph Requirements

  3. Gapped alignments to a reference

    Develop a representation of gapped alignments to a reference, e.g., of short reads to hg19. The representation must be compact, e.g., manipulating 20 million alignments on a laptop with performance characteristics that facilitate exploratory interactive analysis. The representation must allow indels and mismatches. The representation must allow computation, e.g., the depth of coverage over particular genomic coordinates. The representation must allow interval operations, e.g., selecting alignments that overlap transcribed genomic regions. The representation does NOT have to represent sequence-level or quality information; it MAY include a key establishing a relationship with a more complete representation (e.g., index into a BAM file). Efficient data input is not a primary concern. The implementation will reuse (IRanges) existing data structures and generics where appropriate.

    Multiple alignment rep v1, Alignment Use Cases

  4. Transcript-level annotation

    Develop an interface that provides a consistent way of retrieving transcript-level information from diverse data sources. The details of data sources (e.g., UCSC, biomaRt, NCBI) must not be a primary element of the interface. The implementation must be flexible enough to allow easy, modular addition of additional data sources as they become available. Retrieval must include the opportunity for input from local instances (e.g., serialized R objects). Retrieved data will be represented in a common data structure, regardless of data source.

    The retrieved information must contain sufficient metadata to unambiguously identify the originating source of the data. The representation in R must contain sufficient information to associate exons with transcripts, and vice versa, and to summarize all transcripts associated with a specified genomic region.

    The information must be in a form suitable for computation. Use cases include use of transcript data in aggregation (e.g., numbers of short reads overlapping each transcript) and overlap (e.g., subset AlignedRead objects to contain only those reads occurring in specific transcripts).

    Another desirable feature is the representation of alternative transcripts of a single genomic region as a graph, with nodes representing exons and edge weights representing evidence to support associations between exons.

    More details are available in AnnotationObjects.

Transcript Annotation v1
  1. Improve large graph support in the graph package

    The immediate need is to improve the performance of the following operations for graphs with thousands of nodes and tens of thousands of edges:

    • unions
    • intersections
    • edge removal based on edge weight threshold value
    • access to algorithms in RBGL
  2. Stream processing

    Many large-data operations may effectively be represented as processing over a stream, and typically result in a transformation of very large data (e.g., aligned reads) to data with much more modest space requirements (e.g., coverage). This motivates development of a reusable infrastructure for stream processing. The infrastructure must support different mechansisms of data input (e.g., scanning a text file, retrieving successive portions of a netCDF file, retrieving successive rows of a data base). The infrastructure must allow specification of block-level input filters (e.g., exclude all reads with average quality < X) and transformations (e.g., from data.frames to S4 objects), and the collation of transformed blocks into coherent user-specified structures. The filter likely requires specification of an interface that defines required components of filters and the data they operate on. A desireable additional capability is the ability to specify C-level filtering and transformation routines that can act in a record-at-a-time fashion. A desireable feature is the ability to pipe streamed processes together.

  3. Public presence

    The Bioconductor project requires a modernized public presence. This includes an updated web site, but also identification and use of appropriate 'Web 2.0' technologies. The public presence must present simple and organized access to the software, training, and community (e.g., mail archives) resources developed by the project. The presence must encourage user participation, including easing uptake by new users, and fostering advanced statistical and software development skills necessary for leading-edge package development.

Backlog (last edited 2010-01-17 17:33:59 by MartinMorgan)