Overview

Goal

This is an example peak affinity analysis, looking for differences in peak affinity in the hindbrain and the forebrain of P0 mice, using two datasets from ENCODE ENCSR312LQX from the hindbrain and ENCSR310MLB from the forebrain. Each of these datasets has a single replicate.

sample region replicate
hindbrain_rep2 hindbrain_rep2 hindbrain rep2
hindbrain_rep1 hindbrain_rep1 hindbrain rep1
forebrain_rep1 forebrain_rep1 forebrain rep1
forebrain_rep2 forebrain_rep2 forebrain rep2

Cleanup

We need to do a little it of cleanup before contining. We need to remove peaks that appear in regions known to be false positive machines, called the blacklist regions. We also are interested in regions that could be affecting expression via chromatin accessability, so we’ll only consider peaks within a window around the transcription start sites of genes.

Blacklist region removal

A few peaks overlap the ENCODE blacklist regions, so we removed them from the analysis. There were 66 that overlapped with the blacklist regions.

Open region annotation

Here we annotate the peaks that we called with contextual genomic information. We can also see that some peaks are annotated in multiple regions, this can’t be helped as genes overlap so there are areas where multiple features of genes overlap.

>> preparing features information...         2020-05-13 14:16:22 
>> identifying nearest features...       2020-05-13 14:16:22 
>> calculating distance from peak to TSS...  2020-05-13 14:16:23 
>> assigning genomic annotation...       2020-05-13 14:16:23 
>> assigning chromosome lengths          2020-05-13 14:16:26 
>> done...                   2020-05-13 14:16:26 

Remove peaks not near TSS

Generally we only care about peaks that are close to gene, here we are pretty lenient and consider peaks that are within 1000 bases of a transcription start site of a gene.

Most peaks fall close to the TSS, but there are some that are very far away from the TSS. We’ll remove peaks that aren’t anywhere near the TSS for a gene from the analysis. We’ll keep only peaks within 1000 bases of a TSS. This leaves us with 11350 peaks to consider.

PCA

We can see there is quite a bit of variation between these samples that is not associated with the region the sample is from. hindbrain_rep2 is clustered far away from the other samples, and has almost double the input reads as the other samples. forebrain_rep1 is also off on its own, and that sample was an outlier for the percent of alignments in peaks, around 40% vs 20-30% for the other samples.

## mean-variance relationship Here we can observe a mean-variance relationship in the data, similar to other count-based NGS methods. At low counts we have a higher dispersion, meaning it is harder to estimate the true counts at low count levels.

Since we have a small number of replicates, we borrow information from peaks close to the same counts as other peaks to better estimate the variation of individual peaks. This plot shows the variation of individual peaks as the black dots, the red line fitting the mean-variance relationship and the final result of weighting the individual peak variation with the overall mean-variance line, which are the blue dots.

Differential affinity analysis

Here we look at the region_forebrain_vs_hindbrain coefficient, which will show us the peak affinity differences between the mouse forebrain and the mouse hindbrain samples.

>> preparing features information...         2020-05-13 14:16:35 
>> identifying nearest features...       2020-05-13 14:16:35 
>> calculating distance from peak to TSS...  2020-05-13 14:16:36 
>> assigning genomic annotation...       2020-05-13 14:16:36 
>> assigning chromosome lengths          2020-05-13 14:16:37 
>> done...                   2020-05-13 14:16:37 

There are 32 peaks using an adjusted p-value cutoff of 0.1. Here positive log2 fold changes are peaks with more counts in the forebrain than the hindbrain samples.

MA-plot

Volcano plot

Heatmap

Description of output files

  • start: start of peak
  • end: end of peak
  • width: width of peak
  • strand: strand of peak
  • seqname: chromosome of peak
  • peak: ID of peak
  • baseMean: mean count of reads at peak for all samples
  • estimate: log2 fold change of comparison
  • stderror: standard error of the log2 fold change
  • statistic: value of test statistic (Wald test)
  • p.value: unadjusted p-value
  • p.adjusted: adjusted p-value by BH correction
  • annotation: proximity to nearest gene feature
  • geneChr: chromosome of gene
  • geneStart: start coordinate of gene
  • geneEnd: end coordinate of gene
  • geneId: Entrez gene ID of gene
  • transcriptId: transcript ID
  • distanceToTSS: distance to nearest transcription start site
  • mgi_symbol: gene symbol from MGI