Regulation
The data below provide data from all stages of the uniformly processed ENCODE and modENCODE data for human, worm (C. elegans) and fly (D. melanogaster) until the July 2012 Freeze that is being used in the joint modENCODE/ENCODE analysis papers.
Available data
-
Human
0. Dataset reference
1. Genome Assembly
2. Raw Alignments
3. Unique Alignments
4. MetaData and Data Quality Metrics
5. Blacklist regions
6. IDR Uniform Peak Calls
7. Blacklist Filtered Peak Calls (for primary data analysis)
8. Unthresholded Peak Calls
9. Signal Tracks (input normalized)
10. Unique Mappability track
11. TF Target Predictions
12. HOT Regions -
Worm
0. Dataset reference
1. Genome Assembly
2. Raw Alignments
3. Unique Alignments
4. MetaData and Data Quality Metrics
5. Blacklist regions
6. IDR Uniform Peak Calls
7. Blacklist Filtered Peak Calls (for primary data analysis)
8. Unthresholded Peak Calls
9. Signal Tracks (input normalized)
10. Unique Mappability track
11. TF Target Predictions
12. HOT Regions -
Fly
0. Dataset reference
1. Genome Assembly
2. Raw Alignments
3. Unique Alignments
4. MetaData and Data Quality Metrics
5. Blacklist regions
6. IDR Uniform Peak Calls
7. Blacklist Filtered Peak Calls (for primary data analysis)
8. Unthresholded Peak Calls
9. Signal Tracks (input normalized)
10. Unique Mappability track
11. TF Target Predictions
12. HOT Regions
Human
0. Dataset Reference
Description
All human datasets were processed uniformly for the steps below. Here we provide a reference file to map all datasets to their originating ENCODE Accession ID as well as appropriate metadata. Downloadable files in the sections below map to either a File Prefix, a UCSC Accession ID or an EncodeDCC Accession ID.
Data Source
View on Google Docs or (Download)
Data Format
EXCEL/TSV/GoogleDoc
1. Genome Assembly (hg19)
Description
Per chromosome FASTA file (Random contigs are not used for mapping or computing unique mappability)
Data Source
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/referenceSequences/
Data Format
FASTA
2. Raw Alignments
Description
FASTQ and BAM files can be downloaded from the URL. Different labs used different mappers and mapping strategies. Hence, these files should be filtered to standardize them.
Data Source
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeOpenChromChip/
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhTfbs/
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUchicagoTfbs/
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwTfbs/
Data Format
BAM
3. Unique Alignments
Description
BAM files above are filtered to only keep unique mapping reads (tagAlign/ directory). Then duplicate reads were removed (only one read per position). These can be obtained in the distinctTagAlign/ directory”
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Human/alignments/distinctTagAlign/
Data Format
TAGALIGN
4. MetaData and Data Quality Metrics
Description
Measures of enrichment, signal-to-noise ratios, library complexity and peak calling statistics.
Data Source
https://docs.google.com/spreadsheet/ccc?key=0Am6FxqAtrFDwdHdRcHNQUy03SjBoSVMxdUNyZV9Rdnc#gid=9
Data Format
EXCEL/TSV/GoogleDoc
5. Blacklist Regions
Description
The Blacklisted Regions aim to identify a comprehensive set of regions in the Fly genome that have anomalous, unstructured, high signal/read counts in next gen sequencing experiments independent of cell line and type of experiment. These regions tend to have a very high ratio of multi-mapping to unique mapping reads and high variance in mappability. Some of these regions overlap pathological repeat elements such as satellite, centromeric and telomeric repeats. However, simple mappability based filters do not account for most of these regions. Hence, it is recommended to use this blacklist alongside mappability filters.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Human/blacklist/
Data Format
BED
6. IDR Uniform Peak Calls
Description
The SPP peak caller was used along with the IDR framework for calling peaks and thresholding based on reproducibility. IDR threshold of 0.02 was used. See https://sites.google.com/site/anshulkundaje/projects/idr for details”
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Human/peakCalls/uniformPk/
Data Format
NARROWPEAK
7. Blacklist Filtered Peak Calls
Description
IDR Peak calls are filtered against blacklists. THESE ARE THE HUMAN PEAK CALLS FOR PRIMARY ANALYSIS.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Human/peakCalls/finalPk/
Data Format
NARROWPEAK
8. Unthresholded Peak Calls
Description
These are a large set of unthresholded peak calls (up to 300K peaks) from SPP. Useful for analyses that want to analyze low signal peaks.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Human/peakCalls/unthresholdPk/
Data Format
NARROWPEAK
9. Signal Tracks
Description
Signal tracks are generated for each dataset using MACSv2’s signal processing module. Signal tracks represent ChIP signal compared to input control signal.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Human/signal/foldChange/
Data Format
BIGWIG
10. Unique Mappability track
Description
A position ‘i’ on a particular genomic strand ‘s’ is considered uniquely mappable for a read-length ‘k’ if the k-mer starting at ‘i’ on strand ‘s’ maps uniquely i.e. only to position ‘i’ on strand ‘s’ (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an ““optimistic”” idealized mappability mask not accounting for mismatches.
A whole genome index (except for the human female mask for which chrY was excluded from the index) is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome.
Each globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome)
(a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space)
(b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54)
(c) A value of ‘x’ at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand
(d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54)
(e) In order to obtain the uniqueness map for a particular read-length ‘k’, simply perform the following operation on each element of the vector (vector > 0) & (vector <= k)
(f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by
How to read the files in a programming language such as matlab/octave
%First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer"
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Human/mappability/
Data Format
BINARY
11. TF Target Predictions
Description
TIP algorithm for predicting TF target genes was applied to the input-normalized ChIP-seq tracks; these are the output files of that method. Note that TIP was run on all CHIP-seq datasets, including those with score -1. For most applications you should ignore those results, and treat the score=0 results cautiously.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Human/TFtargets/
Data Format
TIP
12. HOT Regions
Description
High-occupancy target (HOT) regions and extreme-occupancy target (XOT) regions from human (hs). HOT and XOT regions are called using the regulator-only peak sets (no polymerase datasets) for each organism, and using only datasets from the species contexts. HOT and XOT regions have higher density of binding than would be expected at a 5% significance threshold (HOT) or 1% significance threshold (XOT) based on 1000x simulations of clustered binding. Please note that the HOT regions include the XOT regions. Concordantly, ubiquitously HOT and ubiquitously XOT regions in each organism are defined as the regions that are HOT or XOT across all of the main contexts, respectively.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Human/hotRegions/
Data Format
BED
Worm
0. Dataset Reference
Description
All Worm datasets were processed uniformly for the steps below. Here we provide a reference file to map all datasets to their
originating ENCODE Accession ID as well as appropriate metadata.
Downloadable files in the sections below map to either a File Prefix or a modEncodeDCC Accession ID.
Data Source
View on Google Docs or Download
Data Format
EXCEL/TSV/Google Docs
1. Genome Assembly (ce10)
Description
Per chromosome FASTA file (Random contigs are not used for mapping or computing unique mappability)
Data Source
http://hgdownload.cse.ucsc.edu/goldenPath/ce10/chromosomes/
Data Format
FASTA
2. Raw Alignments
Description
FASTQ and BAM files can be downloaded from the URL. Different labs used different mappers and mapping strategies. Hence, these files should be filtered to standardize them.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/alignments/
Data Format
BAM
3. Unique Alignments
Description
BAM files above are filtered to only keep unique mapping reads (tagAlign/ directory). Then duplicate reads were removed (only one read per position). These can be obtained in the distinctTagAlign/ directory”
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/alignments/distinctTagAlign/
Data Format
TAGALIGN
4. MetaData and Data Quality Metrics
Description
Measures of enrichment, signal-to-noise ratios, library complexity and peak calling statistics.
Data Source
https://docs.google.com/spreadsheet/ccc?key=0Algk3BSZDYzgdDlYNU00d2p3azJyZWlrZ09OQXNXTGc#gid=0
Data Format
EXCEL/TSV/GoogleDoc
5. Blacklist Regions
Description
The Blacklisted Regions aim to identify a comprehensive set of regions in the Fly genome that have anomalous, unstructured, high signal/read counts in next gen sequencing experiments independent of cell line and type of experiment. These regions tend to have a very high ratio of multi-mapping to unique mapping reads and high variance in mappability. Some of these regions overlap pathological repeat elements such as satellite, centromeric and telomeric repeats. However, simple mappability based filters do not account for most of these regions. Hence, it is recommended to use this blacklist alongside mappability filters.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/blacklist/
Data Format
BED
6. IDR Uniform Peak Calls
Description
The SPP peak caller was used along with the IDR framework for calling peaks and thresholding based on reproducibility. IDR threshold of 0.05 was used. chrM peaks were removed as these were unreliable in most cases. See https://sites.google.com/site/anshulkundaje/projects/idr for details
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/peakCalls/uniformPk/
Data Format
NARROWPEAK
7. Blacklist Filtered Peak Calls
Description
IDR Peak calls are filtered against blacklists. THESE ARE THE WORM PEAK CALLS FOR PRIMARY ANALYSIS.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/peakCalls/finalPk/
Data Format
NARROWPEAK
8. Unthresholded Peak Calls
Description
These are a large set of unthresholded peak calls (up to 300K peaks) from SPP. Useful for analyses that want to analyze low signal peaks.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/peakCalls/unthresholdPk/
Data Format
NARROWPEAK
9. Signal Tracks
Description
Signal tracks are generated for each dataset using MACSv2’s signal processing module. Signal tracks represent ChIP signal compared to input control signal.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/signal/foldChange/
Data Format
BIGWIG/BEDGRAPH
10. Unique Mappability track
Description
A position ‘i’ on a particular genomic strand ‘s’ is considered uniquely mappable for a read-length ‘k’ if the k-mer starting at ‘i’ on strand ‘s’ maps uniquely i.e. only to position ‘i’ on strand ‘s’ (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an ““optimistic”” idealized mappability mask not accounting for mismatches.
A whole genome index is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome.
Each globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome)
(a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space)
(b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54)
(c) A value of ‘x’ at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand
(d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54)
(e) In order to obtain the uniqueness map for a particular read-length ‘k’, simply perform the following operation on each element of the vector (vector > 0) & (vector <= k)
(f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by
How to read the files in a programming language such as matlab/octave
%First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome e.g. chr1.uint8.unique % Read the files as a contiguous binary vector of unsigned 8 bit integers tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap); % You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer"
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/mappability/
Data Format
BINARY
11. TF Target Predictions
Description
TIP algorithm for predicting TF target genes was applied to the input-normalized ChIP-seq tracks; these are the output files of that method. Note that TIP was run on all CHIP-seq datasets, including those with score -1. For most applications you should ignore those results, and treat the score=0 results cautiously.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/TFtargets/
Data Format
TIP
12. HOT Regions
Description
High-occupancy target (HOT) regions and extreme-occupancy target (XOT) regions from worm (ce). HOT and XOT regions are called using the regulator-only peak sets (no polymerase datasets) for each organism, and using only datasets from the species contexts. HOT and XOT regions have higher density of binding than would be expected at a 5% significance threshold (HOT) or 1% significance threshold (XOT) based on 1000x simulations of clustered binding. Please note that the HOT regions include the XOT regions. Concordantly, ubiquitously HOT and ubiquitously XOT regions in each organism are defined as the regions that are HOT or XOT across all of the main contexts, respectively.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Worm/hotRegions/
Data Format
BED
Fly
0. Dataset Reference
Description
All fly datasets were processed uniformly for the steps below. Here we provide a reference file to map all datasets to their
originating ENCODE Accession ID as well as appropriate metadata.
Downloadable files in the sections below map to either a File Prefix or a modEncodeDCC Accession ID.
Data Source
View on Google Docs or (Download)
Data Format
EXCEL/TSV/GoogleDoc
1. Genome Assembly (dm3)
Description
Per chromosome FASTA file (Random contigs are not used for mapping or computing unique mappability)
Data Source
http://hgdownload.cse.ucsc.edu/goldenPath/dm3/chromosomes/
Data Format
FASTA
2. Raw Alignments
Description
FASTQ and BAM files can be downloaded from the URL. Different labs used different mappers and mapping strategies. Hence, these files should be filtered to standardize them.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Fly/alignments/
Data Format
BAM
3. Unique Alignments
Description
BAM files above are filtered to only keep unique mapping reads (tagAlign/ directory). Then duplicate reads were removed (only one read per position). These can be obtained in the distinctTagAlign/ directory”
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Fly/alignments/distinctTagAlign/
Data Format
TAGALIGN
4. MetaData and Data Quality Metrics
Description
Measures of enrichment, signal-to-noise ratios, library complexity and peak calling statistics.
Data Source
https://docs.google.com/spreadsheet/ccc?key=0Algk3BSZDYzgdDU3cXVVMHdQeHRTUWtnYk1aSG13NEE&pli=1#gid=6
Data Format
EXCEL/TSV/GoogleDoc
5. Blacklist Regions
Description
The Blacklisted Regions aim to identify a comprehensive set of regions in the Fly genome that have anomalous, unstructured, high signal/read counts in next gen sequencing experiments independent of cell line and type of experiment. These regions tend to have a very high ratio of multi-mapping to unique mapping reads and high variance in mappability. Some of these regions overlap pathological repeat elements such as satellite, centromeric and telomeric repeats. However, simple mappability based filters do not account for most of these regions. Hence, it is recommended to use this blacklist alongside mappability filters.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Fly/blacklist/
Data Format
BED
6. IDR Uniform Peak Calls
Description
The MACSv2 peak caller was used along with the IDR framework for calling peaks and thresholding based on reproducibility. IDR threshold of 0.05 was used.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Fly/peakCalls/uniformPk/
Data Format
NARROWPEAK
7. Blacklist Filtered Peak Calls
Description
IDR Peak calls are filtered against blacklists. THESE ARE THE Fly PEAK CALLS FOR PRIMARY ANALYSIS.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Fly/peakCalls/finalPk/
Data Format
NARROWPEAK
8. Unthresholded Peak Calls
Description
These are a large set of unthresholded peak calls using MACSv2. Useful for analyses that want to analyze low signal peaks.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Fly/peakCalls/unthresholdPk/
Data Format
NARROWPEAK
9. Signal Tracks
Description
Signal tracks are generated for each dataset using MACSv2’s signal processing module. Signal tracks represent ChIP signal compared to input control signal.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Fly/signal/foldChange/
Data Format
BIGWIG
10. Unique Mappability track
Description
A position ‘i’ on a particular genomic strand ‘s’ is considered uniquely mappable for a read-length ‘k’ if the k-mer starting at ‘i’ on strand ‘s’ maps uniquely i.e. only to position ‘i’ on strand ‘s’ (no mismatches allowed). There are other ways to define mappability e.g. allowing for mismatches but this is basically an ““optimistic”” idealized mappability mask not accounting for mismatches.
A whole genome index is created and the Bowtie mapper was used to try to map each k-mer against both strands of the genome.
Each globalmap_k20tok54.tgz file contains binary files representing uniqueness maps for each chromosome for all read-lengths ranging from 20 to 54 (encoded in a single file for each chromosome)
(a) The files are in uint8 (unsigned 8 bit integers) binary formats (saves disk space)
(b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0 (taking values 0 or 20 to 54)
(c) A value of ‘x’ at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand
(d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=20 to 54)
(e) In order to obtain the uniqueness map for a particular read-length ‘k’, simply perform the following operation on each element of the vector (vector > 0) & (vector <= k)
(f) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by
How to read the files in a programming language such as matlab/octave
%First gunzip and untar the globalmap_k20tok54.tgz file
%You will see one file for each chromosome e.g. chr1.uint8.unique
% Read the files as a contiguous binary vector of unsigned 8 bit integers
tmp_uMap = fopen('chr1.uint8.unique','r');
uMapdata = fread(tmp_uMap,'*uint8');
fclose(tmp_uMap);
% You can similarly read the files in any other programming language as a vector of unsigned 8bit
integers. Convert to doubles if you like (although this is a waste of memory)
or write it out as a text file if you prefer"
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Fly/mappability/
Data Format
BINARY
11. TF Target Predictions
Description
TIP algorithm for predicting TF target genes was applied to the input-normalized ChIP-seq tracks; these are the output files of that method. Note that TIP was run on all CHIP-seq datasets, including those with score -1. For most applications you should ignore those results, and treat the score=0 results cautiously.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Fly/TFtargets/
Data Format
TIP
12. HOT Regions
Description
High-occupancy target (HOT) regions and extreme-occupancy target (XOT) regions from Fly (dm). HOT and XOT regions are called using the regulator-only peak sets (no polymerase datasets) for each organism, and using only datasets from the species contexts. HOT and XOT regions have higher density of binding than would be expected at a 5% significance threshold (HOT) or 1% significance threshold (XOT) based on 1000x simulations of clustered binding. Please note that the HOT regions include the XOT regions. Concordantly, ubiquitously HOT and ubiquitously XOT regions in each organism are defined as the regions that are HOT or XOT across all of the main contexts, respectively.
Data Source
http://encodedcc.stanford.edu/ftp/modENCODE_VS_ENCODE/Regulation/Fly/hotRegions/
Data Format
BED