Transcriptome

  • Gerstein et al., Comparative analysis of the transcriptome across distant species, Nature 512:445–448, 2014. 

    The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discovered co-expression modules shared across animals, many of which are enriched in developmental genes. We used expression patterns to align the stages in worm and fly development, finding a novel pairing between worm embryo and fly pupae, in addition to the expected embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we found that the extent of non-canonical, non-coding transcription is similar in each organism, per base-pair. Finally, we found in all three organisms the gene-expression levels, both coding and noncoding, can be quantitatively predicted from chromatin features at the promoter using a “universal model," based on a single set of organism-independent parameters.

Additional details on data generation and analysis can be found in the paper supplement.

1. Protein-Coding Gene Annotation

Description

The human, worm and fly protein-coding gene annotation are from GENCODE 10, extensions of WormBase WS220 and FlyBase 5.45, respectively.

Data Source

2. Fly Strict Non-coding Gene

Description

The fly non-coding gene annotation is developed beyond FlyBase 5.45.

Data Source

3. Comparable and Non-comparable Non-coding RNA Annotations

Description

The compressed GTF files for non-coding RNA gene annotations. For each species, there is one compressed file that contains the comparable (miRNA, tRNA, snoRNA, snRNA, pri-miRNA) and one of non-comparable (between organism) ncRNA annotations. The comparable annotations are further separated into the biotypes.

Data Source

4. Human-Worm-Fly Ortholog Lists

Description

We have compiled a complete list of ~28k triplets of orthologous genes among human, worm and fly (6353 unique genes in human, 5083 unique genes in worm, 4839 unique genes in fly). The list was merged from the MIT list and Ensembl. It contains all one to one, one to many and many to many orthologous relationships.

Data Source

5. Table Summarizing Processed Expression Values for all Annotated Genes

Description

These tables provide a summary of all annotations with processed expression values associated to protein coding genes in human, worm, and fly. These tables also include TF prediction power, orthology etc. Details of the values and features are provided in the excel sheet headers.

Data Source

6. Transcriptionally Active Regions (TARs)

Description

TARs refer to the non-canonical transcription in the regions excluding protein-coding exons, annotated ncRNAs and pseudogenes. Listed below are all the TARs locations with 90% and 98% exon discovery rate thresholds in the genome of each species, using the chromosome, start and stop. Details of TAR calling are in supplement.

Data Source

7. Enhancers

Description

Human enhancers are from Yip et al. “Classification of genomic regions based on experimentally-determined binding sites of more than 100 transcription-related factors in the whole human genome”. Genome Biol 13: R48. Worm and fly enhancers are identified using enhancer specific histone marks, see Ho et al. 2014.

Data Source

  • Human enhancers used for TAR analysis: Enhancers
  • Worm and fly enhancers used for TAR analysis: Enhancers
  • Alternative human enhancers: Enhancers

8. Clustering of ncRNAs and TARs with Co-expression Modules

Description

For each species, we mapped the ncRNAs and TARs to modules based on co-expression correlations, and found those highly mapped ncRNAs may have related functions with modular genes so that we can annotate them based on modular functions.

Data Source

  • incRNAs and TARs associated with the 16 modules in three species, tarball of txt files: 16_module_ncRNA.tar.gz

9. Supervised ncRNA predictions (novel ncRNA fragments)

Description

We applied the machine learning method, incRNA (Lu et al. “Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data”. Genome Res. 21:245-54) to predict ncRNAs in the genomes of human, worm and fly.

Data Source

10. Gene Co-expression Modules

Description

We built co-expression modules by combining across-species orthology and within-species co-expression relationships between protein coding genes. In the resulting multilayer network we searched for dense subgraphs, using simulated annealing. We used the Orthoclust methodology, see Yan et al. Genome Biol. (2014) 15:R100. To focus on the cross-species conserved functions, we restricted the clustering to orthologs, arriving at 16 conserved modules, which are enriched in a variety of functions, ranging from morphogenesis to chromatin remodeling.

Data source

11. Developmental Stage Mapping between Worm and Fly

Description

We used expression patterns to align the stages in the worm and fly development, finding a novel pairing between the worm embryo and fly pupa stages, in addition to the expected embryo-to-embryo and larvae-to-larvae pairings. See Li et al., Genome Res. (2014) 24:1086-101. for more details.

Data Source

  • Embryonic specific worm genes aligned with fly genes in both embryo stage and pupae stage: wf_dual_mapping.xls

12. Raw Data for the “Comparative Analysis of the Transcriptome across Distant Species”

Description

Description of all datasets in different formats, and links to raw data (reads).

Data Source