Overview

Goal

This is an example peak affinity analysis, looking for differences in peak affinity in the hindbrain and the forebrain of P0 mice, using two datasets from ENCODE ENCSR312LQX from the hindbrain and ENCSR310MLB from the forebrain. Each of these datasets has a single replicate.

	sample	region	replicate
hindbrain_rep2	hindbrain_rep2	hindbrain	rep2
hindbrain_rep1	hindbrain_rep1	hindbrain	rep1
forebrain_rep1	forebrain_rep1	forebrain	rep1
forebrain_rep2	forebrain_rep2	forebrain	rep2

Cleanup

We need to do a little it of cleanup before contining. We need to remove peaks that appear in regions known to be false positive machines, called the blacklist regions. We also are interested in regions that could be affecting expression via chromatin accessability, so we’ll only consider peaks within a window around the transcription start sites of genes.

Blacklist region removal

A few peaks overlap the ENCODE blacklist regions, so we removed them from the analysis. There were 66 that overlapped with the blacklist regions.

Open region annotation

Here we annotate the peaks that we called with contextual genomic information. We can also see that some peaks are annotated in multiple regions, this can’t be helped as genes overlap so there are areas where multiple features of genes overlap.

>> preparing features information...         2020-05-13 14:16:22 
>> identifying nearest features...       2020-05-13 14:16:22 
>> calculating distance from peak to TSS...  2020-05-13 14:16:23 
>> assigning genomic annotation...       2020-05-13 14:16:23 
>> assigning chromosome lengths          2020-05-13 14:16:26 
>> done...                   2020-05-13 14:16:26

Remove peaks not near TSS

Generally we only care about peaks that are close to gene, here we are pretty lenient and consider peaks that are within 1000 bases of a transcription start site of a gene.

Most peaks fall close to the TSS, but there are some that are very far away from the TSS. We’ll remove peaks that aren’t anywhere near the TSS for a gene from the analysis. We’ll keep only peaks within 1000 bases of a TSS. This leaves us with 11350 peaks to consider.

PCA

We can see there is quite a bit of variation between these samples that is not associated with the region the sample is from. hindbrain_rep2 is clustered far away from the other samples, and has almost double the input reads as the other samples. forebrain_rep1 is also off on its own, and that sample was an outlier for the percent of alignments in peaks, around 40% vs 20-30% for the other samples.

## mean-variance relationship Here we can observe a mean-variance relationship in the data, similar to other count-based NGS methods. At low counts we have a higher dispersion, meaning it is harder to estimate the true counts at low count levels.

Since we have a small number of replicates, we borrow information from peaks close to the same counts as other peaks to better estimate the variation of individual peaks. This plot shows the variation of individual peaks as the black dots, the red line fitting the mean-variance relationship and the final result of weighting the individual peak variation with the overall mean-variance line, which are the blue dots.

Differential affinity analysis

Here we look at the region_forebrain_vs_hindbrain coefficient, which will show us the peak affinity differences between the mouse forebrain and the mouse hindbrain samples.

>> preparing features information...         2020-05-13 14:16:35 
>> identifying nearest features...       2020-05-13 14:16:35 
>> calculating distance from peak to TSS...  2020-05-13 14:16:36 
>> assigning genomic annotation...       2020-05-13 14:16:36 
>> assigning chromosome lengths          2020-05-13 14:16:37 
>> done...                   2020-05-13 14:16:37

There are 32 peaks using an adjusted p-value cutoff of 0.1. Here positive log2 fold changes are peaks with more counts in the forebrain than the hindbrain samples.

MA-plot

Volcano plot

Heatmap

Description of output files

start: start of peak
end: end of peak
width: width of peak
strand: strand of peak
seqname: chromosome of peak
peak: ID of peak
baseMean: mean count of reads at peak for all samples
estimate: log2 fold change of comparison
stderror: standard error of the log2 fold change
statistic: value of test statistic (Wald test)
p.value: unadjusted p-value
p.adjusted: adjusted p-value by BH correction
annotation: proximity to nearest gene feature
geneChr: chromosome of gene
geneStart: start coordinate of gene
geneEnd: end coordinate of gene
geneId: Entrez gene ID of gene
transcriptId: transcript ID
distanceToTSS: distance to nearest transcription start site
mgi_symbol: gene symbol from MGI

Differential peak calling in forebrain/hindbrain of mice