This lesson is still being designed and assembled (Pre-Alpha version)

Gene Expression QTL Analysis

Introduction

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • What are expression quantitative trait loci (eQTL)?

  • How are eQTL used in genetic studies?

Objectives
  • Describe how an expression quantitative trait locus (eQTL) impacts gene expression.

  • Describe how eQTL are used in genetic studies.

Genome variability influences differential disease risk among individuals. Identifying the effects of genome variants is key to understanding disease biology or organismal phenotype. The effects of variants in many single-gene disorders, such as cystic fibrosis, are generally well-characterized and their disease biology well understood. In cystic fibrosis for example, mutations in the coding region of the CFTR gene alters the three-dimensional structure of resulting chloride channel proteins in epithelial cells, affecting not only chloride transport but also sodium and potassium transport in the lungs, pancreas and skin. The path from gene mutation to altered protein to disease phenotype is relatively simple and well understood.

Single-gene diseases like cystic fibrosis are relatively well understood. In cystic fibrosis, mutations in the coding region of the CFTR gene result in a defective protein, leading to excess mucus production that can damage the lungs and digestive system.

The most common human disorders, however, involve many genes interacting with the environment and with one another, a far more complicated path to follow than the path from a single gene mutation to its resulting protein to a disease phenotype. Cardiovascular disease, Alzheimer’s disease, arthritis, diabetes and cancer involve a complex interplay of genes with environment, and their mechanisms are not well understood. Genome-wide association studies (GWAS) associate genetic loci with disease traits, yet most GWAS variants for common diseases like diabetes are located in non-coding regions of the genome. These variants are therefore likely to be involved in gene regulation.

GWAS variants such as SNPs are often in non-coding regions of the genome, indicating that they regulate gene expression. Here a non-coding SNP influences expression of a gene, which in turn affects a disease phenotype or other outcome of interest.

Gene regulation controls the quantity, timing and locale of gene expression. Analyzing genome variants through cell or tissue gene expression is known as expression quantitative trait locus (eQTL) analysis. An eQTL is a locus associated with expression of a gene or genes. An eQTL explains some of the variation in gene expression. Specifically, genetic variants underlying eQTL explain variation in gene expression levels. eQTL studies can reveal the architecture of quantitative traits, connect DNA sequence variation to phenotypic variation, and shed light on transcriptional regulation and regulatory variation. Traditional analytic techniques like linkage and association mapping can be applied to thousands of gene expression traits (transcripts) in eQTL analysis, such that gene expression can be mapped in much the same way as a physiological phenotype like blood pressure or heart rate. Joining gene expression and physiological phenotypes with genetic variation can uncover genes with variants affecting disease phenotypes.

To the simple diagram above we’ll add two more details. Non-coding SNPs can regulate gene expression from nearby on the same chromosome (in cis):

Genetic variants like SNPs often affect gene expression locally near the gene that they regulate (in cis).

SNPs that affect gene expression from afar, often from a different chromosome from the gene that they regulate are called trans regulators.

Alternatively, SNPs often affect gene expression distally from the gene that they regulate (in trans), often from a different chromosome altogether.

In this lesson we revisit genetic mapping of quantitative traist and apply its methods to gene expression. The examples are from Genetic Drivers of Pancreatic Islet Function by Keller, et al. This study offers supporting evidence for type 2 diabetes-associated loci in human GWAS, most of which affect pancreatic islet function. The study assessed pancreatic islet gene expression in Diversity Outbred mice on either a regular chow or high-fat high-sugar diet. Islet mRNA abundance was quantified and analyzed, and the study identified more than 18,000 eQTL.

What are eQTL?

With a partner, discuss

  1. your understanding of eQTL.
  2. how mutations result in eQTL and ultimately in disease phenotypes.
  3. the methods that can be used for eQTL analysis.

Solution

Key Points

  • An expression quantitative trait locus (eQTL) explains part of the variation in gene expression.

  • Traditional linkage and association mapping can be applied to gene expression traits (transcripts).

  • Genetic variants, such as single nucleotide polymorphisms (SNPs), that underlie eQTL illuminate transcriptional regulation and variation.


Genetic Drivers of Pancreatic Islet Function

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How I can learn more from an example eQTL study?

  • What is the hypothesis of an example eQTL study?

Objectives
  • Describe an example eQTL study in Diversity Outbred mice.

  • State the hypothesis from an example eQTL study in Diversity Outbred mice.

Genome-wide association studies (GWAS) often identify variants in non-coding regions of the genome, indicating that regulation of gene expression predominates in common diseases like type II diabetes. Most of the more than 100 genetic loci associated with type II diabetes affect the function of pancreatic islets, which produce insulin for regulating blood glucose levels. Susceptibility to type II diabetes (T2D) increases with obesity, such that T2D-associated genetic loci operate mainly under conditions of obesity (See Keller, Mark P et al. “Genetic Drivers of Pancreatic Islet Function.” Genetics vol. 209,1 (2018): 335-356). Like most GWAS loci, the T2D-associated genetic loci identified from genome-wide association studies (GWAS) have very small effect sizes and odds ratios just slightly more than 1.

This study explored islet gene expression in diabetes. The authors hypothesized that gene expression changes in response to dietary challenge would reveal signaling pathways involved in stress responses. The expression of many genes often map to the same locus, indicating that expression of these genes is controlled in common. If their mRNAs encode proteins with common physiological functions, the function of the controlling gene(s) is revealed. Variation in expression of the controlling gene(s), rather than a genetic variant, can be tested as an immediate cause of a disease-related phenotype.

In this study, Diversity Outbred (DO) mice were fed a high-fat, high-sugar diet as a stressor, sensitizing the mice to develop diabetic traits. Body weight and plasma glucose, insulin, and triglyceride measurements were taken biweekly. Food intake could be measured since animals were individually housed. A glucose tolerance test at 18 weeks of age provided measurement of dynamic glucose and insulin changes at 5, 15, 30, 60 and 120 minutes after glucose administration. Area under the curve (AUC) was determined from these time points for both plasma glucose and insulin levels. Homeostatic model assessment (HOMA) of insulin resistance (IR) and pancreatic islet function (B) were determined after the glucose tolerance test was given. Islet cells were isolated from pancreas, and RNA extracted and libraries constructed from isolated RNA for gene expression measurements.

Genome scans were performed with the leave-one-chromosome-out (LOCO) method for kinship correction. Sex and experimental cohort (DO wave) were used as covariates. The results of one scan for insulin area under the curve (AUC) is shown below with a strong peak on chromosome 11. In this lesson, we will look into genes located under this peak.

LOD plot for insulin area under the curve

Key Points

  • .

  • .


Load and explore the data

Overview

Teaching: 15 min
Exercises: 30 min
Questions
  • What data are required for eqtl mapping?

Objectives
  • To provide an example and exploration of data used for eqtl mapping.

Load the libraries.

library(tidyverse)
library(knitr)
library(corrplot)
# the following analysis is from File S1 Attie_eQTL_paper_physiology.Rmd 
# compliments of Daniel Gatti. See Data Dryad entry for more information.

Physiological Phenotypes

The complete data used in these analyses are available from Data Dryad.

Load in the clinical phenotypes.

# load the data
load("../data/attie_DO500_clinical.phenotypes.RData")

See the data dictionary to see a description of each of these phenotypes. You can also view a table of the data dictionary.

pheno_clin_dict %>% 
  select(description, formula) %>% 
  kable()
  description formula
mouse Animal identifier. NA
sex Male (M) or female (F). NA
sac_date Date when mouse was sacrificed; used to compute days on diet, using birth dates. NA
partial_inflation Some mice showed a partial pancreatic inflation which would negatively effect the total number of islets collected from these mice. NA
coat_color Visual inspection by Kathy Schuler on coat color. NA
oGTT_date Date the oGTT was performed. NA
FAD_NAD_paired A change in the method that was used to make this measurement by Matt Merrins’ lab. Paired was the same islet for the value at 3.3mM vs. 8.3mM glucose; unpaired was where averages were used for each glucose concentration and used to compute ratio. NA
FAD_NAD_filter_set A different filter set was used on the microscope to make the fluorescent measurement; may have influenced the values. NA
crumblers Some mice store food in their bedding (hoarders) which would be incorrectly interpreted as consumed. NA
birthdate Birth date NA
diet_days Number of days. NA
num_islets Total number of islets harvested per mouse; negatively impacted by those with partial inflation. NA
Ins_per_islet Amount of insulin per islet in units of ng/ml/islet. NA
WPIC Derived number; equal to total number of islets times insulin content per islet. Ins_per_islet * num_islets
HOMA_IR_0min glucose*insulin/405 at time t=0 for the oGTT Glu_0min * Ins_0min / 405
HOMA_B_0min 360 * Insulin / (Glucose - 63) at time t=0 for the oGTT 360 * Ins_0min / (Glu_0min - 63)
Glu_tAUC Area under the curve (AUC) calculation without any correction for baseline differences. complicated
Ins_tAUC Area under the curve (AUC) calculation without any correction for baseline differences. complicated
Glu_6wk Plasma glucose with units of mg/dl; fasting. NA
Ins_6wk Plasma insulin with units of ng/ml; fasting. NA
TG_6wk Plasma triglyceride (TG) with units of mg/dl; fasting. NA
Glu_10wk Plasma glucose with units of mg/dl; fasting. NA
Ins_10wk Plasma insulin with units of ng/ml; fasting. NA
TG_10wk Plasma triglyceride (TG) with units of mg/dl; fasting. NA
Glu_14wk Plasma glucose with units of mg/dl; fasting. NA
Ins_14wk Plasma insulin with units of ng/ml; fasting. NA
TG_14wk Plasma triglyceride (TG) with units of mg/dl; fasting. NA
food_ave Average food consumption over the measurements made for each mouse. complicated
weight_2wk Body weight at indicated date; units are gm. NA
weight_6wk Body weight at indicated date; units are gm. NA
weight_10wk Body weight at indicated date; units are gm. NA
DOwave Wave (i.e., batch) of DO mice NA
# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
pheno_clin$DOwave = factor(pheno_clin$DOwave)

Figure 1 Boxplots

pheno_clin %>%
  select(mouse, sex, starts_with("weight")) %>%
  gather(week, value, -mouse, -sex) %>%
  separate(week, c("tmp", "week")) %>%
  mutate(week = factor(week, levels = c("2wk", "6wk", "10wk"))) %>%
  ggplot(aes(week, value, fill = sex)) +
    geom_boxplot() +
    scale_y_log10() +
    labs(title = "Body Weight", y = "Weight")

plot of chunk bw_boxplot

pheno_clin %>%
  select(mouse, sex, starts_with("Glu")) %>%
  select(mouse, sex, ends_with("wk")) %>%
  gather(week, value, -mouse, -sex) %>%
  separate(week, c("tmp", "week")) %>%
  mutate(week = factor(week, levels = c("6wk", "10wk", "14wk"))) %>%
  ggplot(aes(week, value, fill = sex)) +
    geom_boxplot() +
    scale_y_log10() +
    labs(title = "Glucose", y = "Glucose")

plot of chunk glucose_boxplot

pheno_clin %>%
  select(mouse, sex, starts_with("Ins")) %>%
  select(mouse, sex, ends_with("wk")) %>%
  gather(week, value, -mouse, -sex) %>%
  separate(week, c("tmp", "week")) %>%
  mutate(week = factor(week, levels = c("6wk", "10wk", "14wk"))) %>%
  ggplot(aes(week, value, fill = sex)) +
    geom_boxplot() +
    scale_y_log10() +
    labs(title = "Insulin", y = "Insulin")

plot of chunk insulin_boxplot

pheno_clin %>%
  select(mouse, sex, starts_with("TG")) %>%
  select(mouse, sex, ends_with("wk")) %>%
  gather(week, value, -mouse, -sex) %>%
  separate(week, c("tmp", "week")) %>%
  mutate(week = factor(week, levels = c("6wk", "10wk", "14wk"))) %>%
  ggplot(aes(week, value, fill = sex)) +
    geom_boxplot() +
    scale_y_log10() +
    labs(title = "TG", y = "TG")

plot of chunk trig_boxplot

pheno_clin %>%
  select(mouse, sex, num_islets:Ins_tAUC, food_ave) %>%
  gather(phenotype, value, -mouse, -sex) %>%
  ggplot(aes(sex, value, fill = sex)) +
    geom_boxplot() +
    scale_y_log10() +
    facet_wrap(~phenotype, scales = "free_y")

plot of chunk fig1_boxplots

QA/QC

Log transform and standardize each phenotype. Consider setting points that are more than 5 std. dev. from the mean to NA. Only do this if the final distribution doesn’t look skewed.

pheno_clin_log = pheno_clin %>%
                   mutate_if(is.numeric, log)
pheno_clin_std = pheno_clin_log %>%
                   select(mouse, num_islets:weight_10wk) %>%
                   mutate_if(is.numeric, scale)
pheno_clin_std %>%
  select(num_islets:weight_10wk) %>%
  gather(phenotype, value) %>%
  ggplot(aes(x = phenotype, y = value)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(title = "Distribution of Standardized Phenotypes")

plot of chunk pheno_std

Tests for sex, wave and diet_days.

tmp = pheno_clin_log %>%
        select(mouse, sex, DOwave, diet_days, num_islets:weight_10wk) %>%
        gather(phenotype, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(phenotype) %>%
        nest()
mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ)
Error in `mutate()`:
! Problem while computing `summ = map(model, tidy)`.
ℹ The error occurred in group 1: phenotype = "food_ave".
Caused by error in `as_mapper()`:
! object 'tidy' not found
# kable(tmp, caption = "Effects of Sex, Wave & Diet Days on Phenotypes")
tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~phenotype) +
    labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)
Error in `filter()`:
! Problem while computing `..1 = term != "(Intercept)"`.
ℹ The error occurred in group 1: phenotype = "food_ave".
Caused by error in `mask$eval_all_filter()`:
! object 'term' not found

Weight vs. Food Intake

pheno_clin_log %>%
  select(mouse, sex, food_ave:weight_10wk) %>%
  gather(phenotype, value, -mouse, -sex, -food_ave) %>%
  separate(phenotype, c("phenotype", "week")) %>%
  mutate(week = factor(week, levels = c("2wk", "6wk", "10wk"))) %>%
  ggplot(aes(food_ave, value, color = sex)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = "Food Intake vs. Body Weight", y = "log(Body Weight)") + 
    facet_wrap(~week)

plot of chunk bw_vs_food

model_fxn = function(df) { lm(value ~ sex*food_ave, data = df) }
tmp = pheno_clin_log %>%
  select(mouse, sex, food_ave:weight_10wk) %>%
  gather(phenotype, value, -mouse, -sex, -food_ave) %>%
  separate(phenotype, c("phenotype", "week")) %>%
  mutate(week = factor(week, levels = c("2wk", "6wk", "10wk"))) %>%
  group_by(week) %>%
  nest() %>%
  mutate(model = map(data, model_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ) %>%
  filter(term != "(Intercept)") %>%
  mutate(p.adj = p.adjust(p.value))
Error in `mutate()`:
! Problem while computing `summ = map(model, tidy)`.
ℹ The error occurred in group 1: week = 2wk.
Caused by error in `as_mapper()`:
! object 'tidy' not found
# kable(tmp, caption = "Effects of Sex and Food Intake on Body Weight")

Correlation Plots

Females

tmp = pheno_clin_log %>% 
        filter(sex == "F") %>%
        select(starts_with(c("Ins", "Glu", "TG")))
tmp = cor(tmp, use = "pairwise")
corrplot.mixed(tmp, upper = "ellipse", lower = "number", 
               main = "Female Clinical Phenotype Correlation")
corrplot.mixed(tmp, upper = "ellipse", lower = "number", 
               main = "Female Clinical Phenotype Correlation")

plot of chunk female_corr_plot

Males

tmp = pheno_clin_log %>% 
        filter(sex == "M") %>%
        select(starts_with(c("Ins", "Glu", "TG")))
tmp = cor(tmp, use = "pairwise")
corrplot.mixed(tmp, upper = "ellipse", lower = "number", 
               main = "Male Clinical Phenotype Correlation")
corrplot.mixed(tmp, upper = "ellipse", lower = "number", 
               main = "Male Clinical Phenotype Correlation")

plot of chunk male_corr_plot

Gene Expression Phenotypes

# load the expression data along with annotations and metadata
load("../data/dataset.islet.rnaseq.RData")
names(dataset.islet.rnaseq)
[1] "annots"        "covar"         "covar.factors" "datatype"     
[5] "display.name"  "expr"          "lod.peaks"     "raw"          
[9] "samples"      
# look at gene annotations
dataset.islet.rnaseq$annots[1:6,]
                              gene_id symbol chr     start       end strand
ENSMUSG00000000001 ENSMUSG00000000001  Gnai3   3 108.10728 108.14615     -1
ENSMUSG00000000028 ENSMUSG00000000028  Cdc45  16  18.78045  18.81199     -1
ENSMUSG00000000037 ENSMUSG00000000037  Scml2   X 161.11719 161.25821      1
ENSMUSG00000000049 ENSMUSG00000000049   Apoh  11 108.34335 108.41440      1
ENSMUSG00000000056 ENSMUSG00000000056   Narf  11 121.23725 121.25586      1
ENSMUSG00000000058 ENSMUSG00000000058   Cav2   6  17.28119  17.28911      1
                      middle nearest.marker.id        biotype      module
ENSMUSG00000000001 108.12671       3_108090236 protein_coding   darkgreen
ENSMUSG00000000028  18.79622       16_18817262 protein_coding        grey
ENSMUSG00000000037 161.18770       X_161182677 protein_coding        grey
ENSMUSG00000000049 108.37887      11_108369225 protein_coding greenyellow
ENSMUSG00000000056 121.24655      11_121200487 protein_coding       brown
ENSMUSG00000000058  17.28515        6_17288298 protein_coding       brown
                   hotspot
ENSMUSG00000000001    <NA>
ENSMUSG00000000028    <NA>
ENSMUSG00000000037    <NA>
ENSMUSG00000000049    <NA>
ENSMUSG00000000056    <NA>
ENSMUSG00000000058    <NA>
# look at raw counts
dataset.islet.rnaseq$raw[1:6,1:6]
      ENSMUSG00000000001 ENSMUSG00000000028 ENSMUSG00000000037
DO021              10247                108                 29
DO022              11838                187                 35
DO023              12591                160                 19
DO024              12424                216                 30
DO025              10906                 76                 21
DO026              12248                110                 34
      ENSMUSG00000000049 ENSMUSG00000000056 ENSMUSG00000000058
DO021                 15                120                703
DO022                 18                136                747
DO023                 18                275               1081
DO024                 81                160                761
DO025                  7                163                770
DO026                 18                204                644
# look at sample metadata
# summarize mouse sex, birth dates and DO waves
table(dataset.islet.rnaseq$samples[, c("sex", "birthdate")])
   birthdate
sex 2014-05-29 2014-10-15 2015-02-25 2015-09-22
  F         46         46         49         47
  M         45         48         50         47
table(dataset.islet.rnaseq$samples[, c("sex", "DOwave")])
   DOwave
sex  1  2  3  4  5
  F 46 46 49 47  0
  M 45 48 50 47  0

In order to make reasonable gene comparisons between samples, the count data need to be normalized. In the quantile-quantile (Q-Q) plot below, count data for the first gene are plotted over a diagonal line tracing a normal distribution for those counts. Notice that most of the count data values lie off of this line, indicating that these gene counts are not normally distributed.

plot of chunk view_manual_qqplot_raw

The same plot can be drawn as shown below. The diagonal line represents a standard normal distribution with mean 0 and standard deviation 1. Count data values are plotted against this standard normal distribution.

qqnorm(dataset.islet.rnaseq$raw[,1], 
     xlab="Theoretical normal percentiles", 
     ylab="Count percentiles",
     main="Count distribution for gene ENSMUSG00000000001")
qqline(dataset.islet.rnaseq$raw[,1]) 

plot of chunk view_qqplot_raw

Q-Q plots for the first six genes show that count data for these genes are not normally distributed. They are also not on the same scale. The y-axis values for each subplot range to 20,000 counts in the first subplot, 250 in the second, 80 in the third, and so on.

par(mfrow=c(2,3))
for (i in 1:6) {
  qqnorm(dataset.islet.rnaseq$raw[,i])
  qqline(dataset.islet.rnaseq$raw[,i])
  }

plot of chunk view_qqplots_raw

Q-Q plots of the normalized expression data for the first six genes show that the data values match the diagonal line well, meaning that they are now normally distributed. They are also all on the same scale now as well.

par(mfrow=c(2,3))
for (i in 1:6) {
  qqnorm(dataset.islet.rnaseq$expr[,i])
  qqline(dataset.islet.rnaseq$expr[,i])
  }

plot of chunk view_qqplots_normalized

Boxplots of raw counts for 5 example genes are shown at left below. Notice that the median count values (horizontal black bar in each boxplot) are not comparable between the genes because the counts are not on the same scale. At right, boxplots for the same 5 genes show normalized count data on the same scale.

par(las=2, mfrow=c(1,2))
boxplot(dataset.islet.rnaseq$raw[,c(5:9)], 
        main="Raw count distributions for 5 example genes")
boxplot(dataset.islet.rnaseq$expr[,c(5:9)], 
        main="Normalized count distributions for the same 5 example genes")

plot of chunk view_example_boxplots

# look at normalized counts
dataset.islet.rnaseq$expr[1:6,1:6]
      ENSMUSG00000000001 ENSMUSG00000000028 ENSMUSG00000000037
DO021        -0.09858728         0.19145703          0.5364855
DO022         0.97443030         2.41565800          1.1599398
DO023         0.75497867         1.62242658         -0.4767954
DO024         1.21299720         2.79216889          0.7205211
DO025         1.62242658        -0.27918258          0.3272808
DO026         0.48416010         0.03281516          0.8453674
      ENSMUSG00000000049 ENSMUSG00000000056 ENSMUSG00000000058
DO021         0.67037652         -1.4903750          0.6703765
DO022         1.03980755         -1.1221937          0.9744303
DO023         0.89324170          0.2317220          1.8245906
DO024         2.15179362         -0.7903569          0.9957498
DO025        -0.03281516         -0.2587715          1.5530179
DO026         0.86427916         -0.3973634          0.2791826

The expression data loaded provides LOD peaks for the eQTL analyses performed in this study. As a preview of what you will be doing next, extract the LOD peaks for chromosome 11.

# look at LOD peaks
dataset.islet.rnaseq$lod.peaks[1:6,]
            annot.id   marker.id chrom        pos       lod
1 ENSMUSG00000037922 3_136728699     3 136.728699 11.976856
2 ENSMUSG00000037926 2_164758936     2 164.758936  7.091543
3 ENSMUSG00000037926 5_147178504     5 147.178504  6.248598
4 ENSMUSG00000037933 5_147253583     5 147.253583  8.581871
5 ENSMUSG00000037933  13_6542783    13   6.542783  6.065497
6 ENSMUSG00000037935 11_99340415    11  99.340415  8.089051
# look at chromosome 11 LOD peaks
chr11_peaks <- dataset.islet.rnaseq$annots %>% 
   select(gene_id, chr) %>% 
   filter(chr=="11") %>%
   left_join(dataset.islet.rnaseq$lod.peaks, 
             by = c("chr" = "chrom", "gene_id" = "annot.id")) 

# look at the first several rows of chromosome 11 peaks
head(chr11_peaks)
             gene_id chr    marker.id       pos       lod
1 ENSMUSG00000000049  11 11_109408556 109.40856  8.337996
2 ENSMUSG00000000056  11 11_121434933 121.43493 11.379449
3 ENSMUSG00000000093  11  11_74732783  74.73278  9.889119
4 ENSMUSG00000000120  11  11_95962181  95.96218 11.257591
5 ENSMUSG00000000125  11         <NA>        NA        NA
6 ENSMUSG00000000126  11  11_26992236  26.99224  6.013533
# how many rows?
dim(chr11_peaks)
[1] 1810    5
# how many rows have LOD scores?
chr11_peaks %>% filter(!is.na(lod)) %>% dim()
[1] 1208    5
# sort chromosome 11 peaks by LOD score
chr11_peaks %>% arrange(desc(lod)) %>% head()
             gene_id chr   marker.id      pos      lod
1 ENSMUSG00000020268  11 11_55073940 55.07394 189.2332
2 ENSMUSG00000020333  11 11_53961279 53.96128 147.8134
3 ENSMUSG00000017404  11 11_98062374 98.06237 136.0430
4 ENSMUSG00000093483  11 11_83104215 83.10421 135.1067
5 ENSMUSG00000058546  11 11_78204410 78.20441 132.3365
6 ENSMUSG00000078695  11 11_97596132 97.59613 131.8005
# range of LOD scores and positions
range(chr11_peaks$lod, na.rm = TRUE)
[1]   6.000975 189.233211
range(chr11_peaks$pos, na.rm = TRUE)
[1]   3.126009 122.078650
# view LOD scores by position
chr11_peaks %>% arrange(desc(lod)) %>% 
  ggplot(aes(pos, lod)) + geom_point()
Warning: Removed 602 rows containing missing values (geom_point).

plot of chunk lod_peaks

Key Points


Review Mapping Steps

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • What are the steps involved in running QTL mapping in Diversity Outbred mice?

Objectives
  • Reviewing the QTL mapping steps learnt in the first two days

Before we begin to run QTL mapping on gene expression data to find eQTLs, let’s review the main QTL mapping steps that we learnt in the QTL mapping course. As a reminder, we are using data from the Keller et al. paper that are freely available to download.

Make sure that you are in your main directory. If you’re not sure where you are working right now, you can check your working directory with getwd(). If you are not in your main directory, run setwd("code") in the Console or Session -> Set Working Directory -> Choose Directory in the RStudio menu to set your working directory to the code directory.

Load Libraries

Below are the neccessary libraries that we require for this review. They are already installed on your machines so go ahead an load them using the following code:

library(tidyverse)
library(knitr)
library(broom)
library(qtl2)

Load Data

The data for this tutorial has been saved as several R binary files which contain several data objects, including phenotypes and mapping data as well as the genoprobs.

Load the data in now by running the following command in your new script.

##phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")

##mapping data
load("../data/attie_DO500_mapping.data.RData")

##genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")

We loaded in a few data objects today. Check the Environment tab to see what was loaded. You should see a phenotypes object called pheno_clin with 500 observations (in rows) of 32 variables (in columns), an object called map (physical map), and an object called probs (genotype probabilities).

Phenotypes

In this data set, we have 20 phenotypes for 500 Diversity Outbred mice. pheno_clin is a data frame containing the phenotype data as well as covariate information. Click on the triangle to the left of pheno_clin in the Environment pane to view its contents. Run names(pheno_clin) to list the variables.

pheno_clin_dict is the phenotype dictionary. This data frame contains information on each variable in pheno_clin, including name,short name, pheno_type, formula (if used) and description.

Since the paper is interested in type 2 diabetes and insulin secretion, we will choose insulin tAUC (area under the curve (AUC) which was calculated without any correction for baseline differences) for this review

Many statistical models, including the QTL mapping model in qtl2, expect that the incoming data will be normally distributed. You may use transformations such as log or square root to make your data more normally distributed. Here, we will log transform the data.

Here is a histogram of the untransformed data.

hist(pheno_clin$Ins_tAUC, main = "Insulin Area Under the Curve")

plot of chunk hist_untransformed

Now, let’s apply the log() function to this data to correct the distribution.

pheno_clin$Ins_tAUC_log <- log(pheno_clin$Ins_tAUC)

Now, let’s make a histogram of the log-transformed data.

hist(pheno_clin$Ins_tAUC_log, main = "insulin tAUC (log-transformed)")

plot of chunk hist_log_transform

This looks much better!

The Marker Map

The marker map for each chromosome is stored in the map object. This is used to plot the LOD scores calculated at each marker during QTL mapping. Each list element is a numeric vector with each marker position in megabases (Mb). Here we are using the 69K grid marker file. Often when there are numerous genotype arrays used in a study, we interoplate all to a 69k grid file so we are able to combine all samples across different array types.

Look at the structure of map in the Environment tab by clicking the triangle to the left or by running str(map) in the Console.

Genotype probabilities

Each element of probs is a 3 dimensional array containing the founder allele dosages for each sample at each marker on one chromosome. These are the 8 state alelle probabilities (not 32) using the 69k marker grid for same 500 DO mice that also have clinical phenotypes. We have already calculated genotype probabilities for you, so you can skip the step for calculating genotype probabilities and the optional step for calculating allele probabilities.

Next, we look at the dimensions of probs for chromosome 1:

dim(probs[[1]])
[1]  500    8 4711
plot_genoprob(probs, map, ind = 1, chr = 1)

plot of chunk geno_plot

In the plot above, the founder contributions, which range between 0 and 1, are colored from white (= 0) to black (= 1.0). A value of ~0.5 is grey. The markers are on the X-axis and the eight founders (denoted by the letters A through H) on the Y-axis. Starting at the left, we see that this sample has genotype GH because the rows for G & H are grey, indicating values of 0.5 for both alleles. Moving along the genome to the right, the genotype becomes HH where where the row is black indicating a value of 1.0. This is followed by CD, DD, DG, AD, AH, CE, etc. The values at each marker sum to 1.0.

Kinship Matrix

The kinship matrix has already been calculated and loaded in above

n_samples <- 50
heatmap(K[[1]][1:n_samples, 1:n_samples])

plot of chunk kinship_probs

The figure above shows kinship between all pairs of samples. Light yellow indicates low kinship and dark red indicates higher kinship. Orange values indicate varying levels of kinship between 0 and 1. The dark red diagonal of the matrix indicates that each sample is identical to itself. The orange blocks along the diagonal may indicate close relatives (i.e. siblings or cousins).

Covariates

Next, we need to create additive covariates that will be used in the mapping model. First, we need to see which covariates are significant. In the data set, we have sex, DOwave (Wave (i.e., batch) of DO mice) and diet_days (number of days on diet) to test whether there are any gender, batch or diet effects.

First we are going to select out the covariates and phenotype we want from pheno_clin data frame. Then reformat these selected variables into a long format (using the gather command) grouped by the phenotypes (in this case, we only have Ins_tAUC_log)

### Tests for sex, wave and diet_days.

tmp = pheno_clin %>%
        dplyr::select(mouse, sex, DOwave, diet_days, Ins_tAUC_log) %>%
        gather(phenotype, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(phenotype) %>%
        nest()

Let’s create a linear model function that will regress the covariates and the phenotype.

mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}

Now let’s apply that function to the data object tmp that we created above.

tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ)

tmp
# A tibble: 7 × 8
# Groups:   phenotype [1]
  phenotype    data               model  term  estimate std.e…¹ stati…²  p.value
  <chr>        <list>             <list> <chr>    <dbl>   <dbl>   <dbl>    <dbl>
1 Ins_tAUC_log <tibble [500 × 5]> <lm>   (Int…  4.49    0.447    10.1   1.10e-21
2 Ins_tAUC_log <tibble [500 × 5]> <lm>   sexM   0.457   0.0742    6.17  1.48e- 9
3 Ins_tAUC_log <tibble [500 × 5]> <lm>   DOwa… -0.294   0.118    -2.49  1.33e- 2
4 Ins_tAUC_log <tibble [500 × 5]> <lm>   DOwa… -0.395   0.118    -3.36  8.46e- 4
5 Ins_tAUC_log <tibble [500 × 5]> <lm>   DOwa… -0.176   0.118    -1.49  1.38e- 1
6 Ins_tAUC_log <tibble [500 × 5]> <lm>   DOwa… -0.137   0.118    -1.16  2.46e- 1
7 Ins_tAUC_log <tibble [500 × 5]> <lm>   diet…  0.00111 0.00346   0.322 7.48e- 1
# … with abbreviated variable names ¹​std.error, ²​statistic
tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~phenotype) +
    labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)

plot of chunk covariates_sig_function_plots

We can see that sex and DOwave (especially the third batch) are significant. Here DOwave is the group or batch number as not all mice are in the experient at the same time. Because of this, we now have to correct for it.

# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
pheno_clin$DOwave = factor(pheno_clin$DOwave)

covar = model.matrix(~sex + DOwave, data = pheno_clin)

REMEMBER: the sample IDs must be in the rownames of pheno, addcovar, genoprobs and K. qtl2 uses the sample IDs to align the samples between objects.

Performing a genome scan

At each marker on the genotyping array, we will fit a model that regresses the phenotype (insulin secretion AUC) on covariates and the founder allele proportions. Note that this model will give us an estimate of the effect of each founder allele at each marker. There are eight founder strains that contributed to the DO, so we will get eight founder allele effects.

Now let’s perform the genome scan, using the scan1 function.

qtl = scan1(genoprobs = probs, 
            pheno = pheno_clin[,"Ins_tAUC_log", drop = FALSE], 
            kinship = K, 
            addcovar = covar)

Next, we plot the genome scan.

plot_scan1(x = qtl, map = map, lodcolumn = "Ins_tAUC_log")
  abline(h = 6, col = 2, lwd = 2)

plot of chunk qtl_plot

We can see a very strong peak on chromosome 11 with no other distibuishable peaks.

Finding LOD peaks

We can find all of the peaks above the significance threshold using the find_peaks function.

The support interval is determined using the Bayesian Credible Interval and represents the region most likely to contain the causative polymorphism(s). We can obtain this interval by adding a prob argument to find_peaks. We pass in a value of 0.95 to request a support interval that contains the causal SNP 95% of the time.

In case there are multiple peaks are found on a chromosome, the peakdrop argument allows us to the find the peak which has a certain LOD score drop between other peaks.

Let’s find LOD peaks. Here we are choosing to find peaks with a LOD score greater than 6.

lod_threshold = 6
peaks = find_peaks(scan1_output = qtl, map = map, 
                   threshold = lod_threshold, 
                   peakdrop = 4, 
                   prob = 0.95)

kable(peaks %>% dplyr::select (-lodindex) %>% 
        arrange(chr, pos), caption = "Phenotype QTL Peaks with LOD >= 6")

Table: Phenotype QTL Peaks with LOD >= 6

lodcolumn chr pos lod ci_lo ci_hi
Ins_tAUC_log 11 83.59467 11.25884 83.58553 84.95444

Challenge

Now choose another phenotype in pheno_clin and perform the same steps.
1). Check the distribution. Does it need transforming? If so, how would you do it? 2). Are there any sex, batch, diet effects?
3). Run a genome scan with the genotype probabilities and kinship provided.
4). Plot the genome scan for this phenotype.
5). Find the peaks above LOD score of 6.

Solution

Replace <pheno name> with your choice of phenotype

#1).

hist(pheno_clin$<pheno name>)

pheno_clin$<pheno name>_log <- log(pheno_clin$<pheno name>)

hist(pheno_clin$<pheno name>_log)


#2).

tmp = pheno_clin %>%
        dplyr::select(mouse, sex, DOwave, diet_days, <pheno name>) %>%
        gather(phenotype, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(phenotype) %>%
        nest()
mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ) 

tmp

tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~phenotype) +
    labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)

#3).

qtl = scan1(genoprobs = probs, 
           pheno = pheno_clin[,"<pheno name>", drop = FALSE], 
        kinship = K, 
        addcovar = covar)

#4).

plot_scan1(x = qtl, map = map, lodcolumn = "<pheno name>")
abline(h = 6, col = 2, lwd = 2)

#5). 

peaks = find_peaks(scan1_output = qtl, map = map, 
               threshold = lod_threshold, 
               peakdrop = 4, 
               prob = 0.95)

Key Points

  • To understand the key steps running a QTL mapping analysis


Mapping A Single Gene Expression Trait

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • How do I map one gene expression trait?

Objectives
  • QTL mapping of an expression data set

In this tutorial, we are going to use the gene expression data as our phenotype to map QTLs. That is, we are looking to find any QTLs (cis or trans) that are responsible for variation in gene expression across samples.

We are using the same directory as our previous tutorial, which you can check by typing getwd() in the Console. If you are not in the correct directory, type in setwd("code") in the console or Session -> Set Working Directory -> Choose Directory in the RStudio menu to set your working directory to the main directory.

If you have started a new R session, you will need to load the libraries again. If you haven’t, the libraries listed below are already loaded.

Load Libraries

library(tidyverse)
library(knitr)
library(broom)
library(qtl2)

Load Data

In this lesson, we are loading in gene expression data for 21,771 genes: attie_DO500_expr.datasets.RData that includes normalised and raw gene expression data. dataset.islet.rnaseq is the dataset you can download directly from Dryad. Again, if you are using the same R session, you will not need to the load the mapping, phenotypes and genotype probabilities data, again.

# expression data
load("../data/attie_DO500_expr.datasets.RData")

# data from paper
load("../data/dataset.islet.rnaseq.RData")

# phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")

# mapping data
load("../data/attie_DO500_mapping.data.RData")

# genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")

Expression Data

Raw gene expression counts are in the counts data object. These counts have been normalised and saved in the norm data object. More information is about normalisation is here.

To view the contents of either one of these data , click the table on the right hand side either norm or counts in the Environment tab. If you type in names(counts), you will see the names all start with ENSMUSG. These are Ensembl IDs. If we want to see which gene these IDs correspond to, type in dataset.islet.rnaseq$annots[dataset.islet.rnaseq$annots, which gives information about each gene, including ensemble id, gene symbol as well as start & stop location of the gene and chromsome on which the gene lies.

Because we are working with the insulin tAUC phenotype, let’s map the expression counts for Hnf1b which is known to influence this phenotype is these data. First, we need to find the Ensembl ID for this gene:

dataset.islet.rnaseq$annots[dataset.islet.rnaseq$annots$symbol == "Hnf1b",]
                              gene_id symbol chr    start      end strand
ENSMUSG00000020679 ENSMUSG00000020679  Hnf1b  11 83.85006 83.90592      1
                     middle nearest.marker.id        biotype       module
ENSMUSG00000020679 83.87799       11_84097611 protein_coding midnightblue
                   hotspot
ENSMUSG00000020679    <NA>

We can see that the ensembl ID of Hnf1b is ENSMUSG00000020679. If we check the distribution for Hnf1b expression data between the raw and normalised data, we can see there distribution has been corrected. Here is the distribution of the raw counts:

hist(counts$ENSMUSG00000020679, main = "Hnf1b")

plot of chunk hist_untransformed

and here is the distribution of the normalised counts:

hist(norm$ENSMUSG00000020679, main = "Hnf1b")

plot of chunk hist_transformed

The histogram indicates that distribution of these counts are normalised.

The Marker Map

We are using the same marker map as in the previous lesson

Genotype probabilities

We have explored this earlier in th previous lesson. But, as a reminder, we have already calculated genotype probabilities which we loaded above called probs. This contains the 8 state genotype probabilities using the 69k grid map of the same 500 DO mice that also have clinical phenotypes.

Kinship Matrix

We have explored the kinship matrix in the previous lesson. It has already been calculated and loaded in above.

Covariates

Now let’s add the necessary covariates. For Hnf1b expression data, let’s see which covariates are significant.

###merging covariate data and expression data to test for sex, wave and diet_days.

cov.counts <- merge(covar, norm, by=c("row.names"), sort=F)

#testing covairates on expression data

tmp = cov.counts %>%
        dplyr::select(mouse, sex, DOwave, diet_days, ENSMUSG00000020679) %>%
        gather(expression, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(expression) %>%
        nest()
mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ) 
#  kable(tmp, caption = "Effects of Sex, Wave & Diet Days on Expression")

tmp
# A tibble: 4 × 8
# Groups:   expression [1]
  expression         data     model  term      estimate std.e…¹ stati…²  p.value
  <chr>              <list>   <list> <chr>        <dbl>   <dbl>   <dbl>    <dbl>
1 ENSMUSG00000020679 <tibble> <lm>   (Interce… -1.70    0.506     -3.36 8.72e- 4
2 ENSMUSG00000020679 <tibble> <lm>   sexM      -0.199   0.0824    -2.42 1.60e- 2
3 ENSMUSG00000020679 <tibble> <lm>   DOwave     0.510   0.0370    13.8  3.73e-35
4 ENSMUSG00000020679 <tibble> <lm>   diet_days  0.00403 0.00386    1.04 2.97e- 1
# … with abbreviated variable names ¹​std.error, ²​statistic
tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~expression) +
    labs(title = "Significance of Sex, Wave & Diet Days on Expression") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)

plot of chunk covariatesa_sig

We can see that sex and DOwave are significant. Here DOwave is the group or batch number as not all mice were submitted for genotyping at the same time. Because of this, we now have to correct for it.

# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
pheno_clin$DOwave = factor(pheno_clin$DOwave)
pheno_clin$diet_days = factor(pheno_clin$DOwave)

covar = model.matrix(~sex + DOwave + diet_days, data = pheno_clin)

Performing a genome scan

Now let’s perform the genome scan!

qtl = scan1(genoprobs = probs, 
            pheno = norm[,"ENSMUSG00000020679", drop = FALSE], 
            kinship = K, 
            addcovar = covar, 
            cores = 2)

Let’s plot it

plot_scan1(x = qtl, 
           map = map, 
           lodcolumn = "ENSMUSG00000020679", 
           main = colnames(qtl))
abline(h = 6, col = 2, lwd = 2)

plot of chunk qtl_plot

Finding LOD peaks

Let’s find LOD peaks

lod_threshold = 6
peaks = find_peaks(scan1_output = qtl, 
                   map = map, 
                   threshold = lod_threshold, 
                   peakdrop = 4, prob = 0.95)
kable(peaks %>% 
        dplyr::select(-lodindex) %>% 
        arrange(chr, pos), caption = "Phenotype QTL Peaks with LOD >= 6")

Table: Phenotype QTL Peaks with LOD >= 6

lodcolumn chr pos lod ci_lo ci_hi
ENSMUSG00000020679 9 109.08150 6.402594 106.66772 111.56048
ENSMUSG00000020679 11 84.40138 36.878942 83.64714 84.40138
ENSMUSG00000020679 15 68.91018 6.121254 67.82232 70.38004
ENSMUSG00000020679 18 70.89669 6.042135 33.80428 71.96869

Challenge

Now choose another gene expression trait in norm data object and perform the same steps.
1). Check the distribution of the raw counts and normalised counts 2). Are there any sex, batch, diet effects?
3). Run a genome scan with the genotype probabilities and kinship provided.
4). Plot the genome scan for this gene.
5). Find the peaks above LOD score of 6.

Solution

Replace <ensembl id> with your choice of gene expression trait

#1).

hist(pheno_clin$<ensembl id>)

pheno_clin$<ensembl id>_log <- log(pheno_clin$<ensembl id>)

hist(pheno_clin$<ensembl id>_log)


#2).

tmp = pheno_clin %>%
        dplyr::select(mouse, sex, DOwave, diet_days, <ensembl id>) %>%
        gather(expression, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(expression) %>%
        nest()
mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ) 

tmp

tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~expression) +
    labs(title = "Significance of Sex, Wave & Diet Days on Phenotypes") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)

#3).

qtl = scan1(genoprobs = probs, 
           pheno = pheno_clin[,"<ensembl id>", drop = FALSE], 
        kinship = K, 
        addcovar = covar)

#4).

plot_scan1(x = qtl, map = map, lodcolumn = "<ensembl id>")
abline(h = 6, col = 2, lwd = 2)

#5). 

peaks = find_peaks(scan1_output = qtl, map = map, 
               threshold = lod_threshold, 
               peakdrop = 4, 
               prob = 0.95)

Key Points

  • To run a QTL analysis for expression data


Mapping Many Gene Expression Traits

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • How do I map many genes?

Objectives
  • To map several genes at the same time

Load Libraries

library(tidyverse)
library(knitr)
library(broom)
library(qtl2)
library(qtl2ggplot)
library(RColorBrewer)

source("../code/gg_transcriptome_map.R")
source("../code/qtl_heatmap.R")

Before we begin this lesson, we need to create another directory called results in our main directory. You can do this by clicking on the “Files” tab and navigate into the main directory. Then select “New Folder” and name it “results”.

Load Data

# expression data
load("../data/attie_DO500_expr.datasets.RData")

# data from paper
load("../data/dataset.islet.rnaseq.RData")

# phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")

# mapping data
load("../data/attie_DO500_mapping.data.RData")

# genotype probabilities
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")

Data Selection

For this lesson, lets choose a random set of 50 gene expression phenotypes.

genes = colnames(norm)

sams <- sample(length(genes), 50, replace = FALSE, prob = NULL)
genes <- genes[sams]

gene.info <- dataset.islet.rnaseq$annots[genes,]
rownames(gene.info) = NULL
kable(gene.info)
gene_id symbol chr start end strand middle nearest.marker.id biotype module hotspot
ENSMUSG00000048865 Arhgap30 1 171.388954 171.410298 1 171.399626 1_171385294 protein_coding darkorange NA
ENSMUSG00000035910 Dcdc2a 13 25.056004 25.210706 1 25.133355 13_25119107 protein_coding brown chr7
ENSMUSG00000022536 Glyr1 16 5.013906 5.049910 -1 5.031908 16_5060435 protein_coding yellow NA
ENSMUSG00000014980 Tsen15 1 152.370735 152.386688 -1 152.378712 1_152383052 protein_coding turquoise NA
ENSMUSG00000022668 Gtpbp8 16 44.736768 44.746363 -1 44.741566 16_44738910 protein_coding green NA
ENSMUSG00000086968 4933431E20Rik 3 107.888850 107.896213 -1 107.892532 3_107868965 processed_transcript grey NA
ENSMUSG00000027720 Il2 3 37.120523 37.125959 -1 37.123241 3_37117746 protein_coding grey NA
ENSMUSG00000044948 Wdr96 19 47.737561 47.919287 -1 47.828424 19_47839103 protein_coding grey NA
ENSMUSG00000027570 Col9a3 2 180.597790 180.622189 1 180.609990 2_180571050 protein_coding magenta NA
ENSMUSG00000068015 Lrch1 14 74.754674 74.947877 -1 74.851276 14_74853309 protein_coding skyblue3 NA
ENSMUSG00000021749 Oit1 14 8.348948 8.378763 -1 8.363856 14_8142958 protein_coding yellowgreen NA
ENSMUSG00000026103 Gls 1 52.163448 52.233232 -1 52.198340 1_52214099 protein_coding turquoise NA
ENSMUSG00000091277 Gm1818 12 48.554182 48.559971 -1 48.557076 12_48556291 protein_coding grey NA
ENSMUSG00000028882 Ppp1r8 4 132.826929 132.843169 -1 132.835049 4_132829993 protein_coding turquoise NA
ENSMUSG00000027506 Tpd52 3 8.928626 9.004723 -1 8.966674 3_8963976 protein_coding darkgreen NA
ENSMUSG00000037577 Ephx3 17 32.183770 32.189549 -1 32.186660 17_32195476 protein_coding yellow NA
ENSMUSG00000022565 Plec 15 76.170974 76.232574 -1 76.201774 15_76166226 protein_coding midnightblue NA
ENSMUSG00000032551 1110059G10Rik 9 122.945089 122.951000 -1 122.948044 9_122947512 protein_coding turquoise NA
ENSMUSG00000028988 Ctnnbip1 4 149.518236 149.566437 1 149.542336 4_149549101 protein_coding grey60 NA
ENSMUSG00000029632 Ndufa4 6 11.900373 11.907446 -1 11.903910 6_11911091 protein_coding grey NA
ENSMUSG00000040697 Dnajc16 4 141.760189 141.790931 -1 141.775560 4_141789962 protein_coding darkturquoise NA
ENSMUSG00000021250 Fos 12 85.473890 85.477273 1 85.475582 12_85483011 protein_coding darkolivegreen NA
ENSMUSG00000074800 Gm4149 13 75.644864 75.645226 1 75.645045 13_75598665 pseudogene grey NA
ENSMUSG00000022449 Adamts20 15 94.270163 94.465418 -1 94.367790 15_94372429 protein_coding grey NA
ENSMUSG00000006301 Tmbim1 1 74.288247 74.305622 -1 74.296934 1_74253876 protein_coding grey NA
ENSMUSG00000096183 Gm3727 14 7.255889 7.264714 -1 7.260302 14_7247401 protein_coding grey NA
ENSMUSG00000036390 Gadd45a 6 67.035096 67.037457 -1 67.036276 6_67037091 protein_coding lightgreen NA
ENSMUSG00000037652 Phc3 3 30.899295 30.969415 -1 30.934355 3_30935517 protein_coding pink NA
ENSMUSG00000087679 C330006A16Rik 2 26.136814 26.140521 -1 26.138668 2_26091626 lincRNA black NA
ENSMUSG00000086391 1700042O10Rik 11 11.868123 11.885321 1 11.876722 11_11852460 antisense grey NA
ENSMUSG00000039116 Gpr126 10 14.402585 14.545036 -1 14.473810 10_14453739 protein_coding cyan NA
ENSMUSG00000039713 Plekhg5 4 152.072498 152.115400 1 152.093949 4_152082355 protein_coding purple NA
ENSMUSG00000039682 Lap3 5 45.493374 45.512674 1 45.503024 5_45529351 protein_coding salmon NA
ENSMUSG00000091844 Gm8251 1 44.055952 44.061936 -1 44.058944 1_44093079 protein_coding grey NA
ENSMUSG00000042737 Dpm3 3 89.259358 89.267079 1 89.263218 3_89265523 protein_coding yellow NA
ENSMUSG00000052310 Slc39a1 3 90.248172 90.253612 1 90.250892 3_90257553 protein_coding royalblue chr13
ENSMUSG00000097321 1700028E10Rik 5 151.368709 151.414083 1 151.391396 5_151520204 lincRNA grey NA
ENSMUSG00000060044 Tmem26 10 68.723746 68.782654 1 68.753200 10_68735167 protein_coding grey NA
ENSMUSG00000044676 Zfp612 8 110.079764 110.090181 1 110.084972 8_110136810 protein_coding lightgreen NA
ENSMUSG00000040712 Camta2 11 70.669463 70.688105 -1 70.678784 11_70661194 protein_coding black NA
ENSMUSG00000046753 Ccdc66 14 27.482412 27.508460 -1 27.495436 14_27610895 protein_coding pink NA
ENSMUSG00000022656 Pvrl3 16 46.387706 46.498525 -1 46.443116 16_46416309 protein_coding turquoise NA
ENSMUSG00000097610 A930012L18Rik 18 44.661665 44.676271 1 44.668968 18_44532060 lincRNA grey NA
ENSMUSG00000071519 Prss3 6 41.373795 41.377678 -1 41.375736 6_41406513 protein_coding darkgrey NA
ENSMUSG00000048701 Ccdc6 10 70.097121 70.193200 1 70.145160 10_70202299 protein_coding grey NA
ENSMUSG00000094574 Gm15027 X 13.179443 13.179580 1 13.179512 X_13155556 pseudogene grey NA
ENSMUSG00000000168 Dlat 9 50.634633 50.659780 -1 50.647206 9_50632541 protein_coding grey NA
ENSMUSG00000041073 Nacad 11 6.597823 6.606053 -1 6.601938 11_6600691 protein_coding purple NA
ENSMUSG00000029250 Polr2b 5 77.310147 77.349324 1 77.329736 5_77264293 protein_coding turquoise NA
ENSMUSG00000030031 Kbtbd8 6 95.117240 95.129718 1 95.123479 6_95084044 protein_coding turquoise NA

Expression Data

Lets check the distribution for the first 20 gene expression phenotypes. If you would like to check the distribution of all 50 genes, change for(gene in genes[1:20]) in the code below to for(gene in genes).

par(mfrow=c(3,4))
for(gene in genes[1:20]){
  hist(norm[,gene], main = gene)
  }

plot of chunk hist_untransformedplot of chunk hist_untransformed The histogram indicates that distribution of these counts are normalised (as they should be).

The Marker Map

We are using the same marker map as in the previous lesson

Genotype probabilities

We have explored this earlier in th previous lesson. But, as a reminder, we have already calculated genotype probabilities which we loaded above called probs. This contains the 8 state genotype probabilities using the 69k grid map of the same 500 DO mice that also have clinical phenotypes.

Kinship Matrix

We have explored the kinship matrix in the previous lesson. It has already been calculated and loaded in above.

Covariates

Now let’s add the necessary covariates. For these 50 gene expression data, let’s see which covariates are significant.

###merging covariate data and expression data to test for sex, wave and diet_days.

cov.counts <- merge(covar, norm[,genes], by=c("row.names"), sort=F)

#testing covairates on expression data

tmp = cov.counts %>%
        dplyr::select(mouse, sex, DOwave, diet_days, names(cov.counts[,genes])) %>%
        gather(expression, value, -mouse, -sex, -DOwave, -diet_days) %>%
        group_by(expression) %>%
        nest()
mod_fxn = function(df) {
  lm(value ~ sex + DOwave + diet_days, data = df)
}
tmp = tmp %>%
  mutate(model = map(data, mod_fxn)) %>%
  mutate(summ = map(model, tidy)) %>%
  unnest(summ) 
#  kable(tmp, caption = "Effects of Sex, Wave & Diet Days on Expression")

tmp
# A tibble: 200 × 8
# Groups:   expression [50]
   expression         data     model  term      estimate std.e…¹ stati…² p.value
   <chr>              <list>   <list> <chr>        <dbl>   <dbl>   <dbl>   <dbl>
 1 ENSMUSG00000048865 <tibble> <lm>   (Interce…  0.628   0.620     1.01   0.311 
 2 ENSMUSG00000048865 <tibble> <lm>   sexM      -0.178   0.101    -1.76   0.0791
 3 ENSMUSG00000048865 <tibble> <lm>   DOwave    -0.0464  0.0454   -1.02   0.308 
 4 ENSMUSG00000048865 <tibble> <lm>   diet_days -0.00327 0.00474  -0.690  0.491 
 5 ENSMUSG00000035910 <tibble> <lm>   (Interce…  0.227   0.616     0.369  0.713 
 6 ENSMUSG00000035910 <tibble> <lm>   sexM       0.167   0.100     1.66   0.0972
 7 ENSMUSG00000035910 <tibble> <lm>   DOwave    -0.0359  0.0451   -0.795  0.427 
 8 ENSMUSG00000035910 <tibble> <lm>   diet_days -0.00172 0.00470  -0.365  0.715 
 9 ENSMUSG00000022536 <tibble> <lm>   (Interce… -0.188   0.512    -0.368  0.713 
10 ENSMUSG00000022536 <tibble> <lm>   sexM      -0.0204  0.0834   -0.245  0.806 
# … with 190 more rows, and abbreviated variable names ¹​std.error, ²​statistic
tmp %>%
  filter(term != "(Intercept)") %>%
  mutate(neg.log.p = -log10(p.value)) %>%
  ggplot(aes(term, neg.log.p)) +
    geom_point() +
    facet_wrap(~expression, ncol=10) +
    labs(title = "Significance of Sex, Wave & Diet Days on Expression") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
rm(tmp)

plot of chunk covariates_sig

We can see that DOwave is the most significant. However, given that a few are influenced by sex and diet_days, we will have to correct for those as well.

# convert sex and DO wave (batch) to factors
pheno_clin$sex = factor(pheno_clin$sex)
pheno_clin$DOwave = factor(pheno_clin$DOwave)
pheno_clin$diet_days = factor(pheno_clin$DOwave)

covar = model.matrix(~sex + DOwave + diet_days, data = pheno_clin)

Performing a genome scan

Now lets perform the genome scan! We are also going to save our qtl results in an Rdata file to be used in further lessons.

QTL Scans

qtl.file = "../results/gene.norm_qtl_cis.trans.Rdata"

if(file.exists(qtl.file)) {
  load(qtl.file)
  } else {
    qtl = scan1(genoprobs = probs, 
                pheno = norm[,genes, drop = FALSE],
                kinship = K, 
                addcovar = covar, 
                cores = 2)
    save(qtl, file = qtl.file)
    }

QTL plots

Let’s plot the first 20 gene expression phenotypes. If you would like to plot all 50, change for(i in 1:20) in the code below to for(i in 1:ncol(qtl)).

par(mfrow=c(3,4))
for(i in 1:20) {
  plot_scan1(x = qtl, 
             map = map, 
             lodcolumn = i, 
             main = colnames(qtl)[i])
  abline(h = 6, col = 2, lwd = 2)
  }

plot of chunk qtl_plotsplot of chunk qtl_plots

QTL Peaks

We are also going to save our peak results so we can use these again else where. First, lets get out peaks with a LOD score greater than 6.

lod_threshold = 6
peaks = find_peaks(scan1_output = qtl, 
                   map = map, 
                   threshold = lod_threshold, 
                   peakdrop = 4, 
                   prob = 0.95)

We will save these peaks into a csv file.

kable(peaks %>% 
        dplyr::select(-lodindex) %>% 
        arrange(chr, pos), caption = "Expression QTL (eQTL) Peaks with LOD >= 6")

write_csv(peaks, "../results/gene.norm_qtl_peaks_cis.trans.csv")

Table: Phenotype QTL Peaks with LOD >= 6

lodcolumn chr pos lod ci_lo ci_hi
ENSMUSG00000033389 1 24.116007 6.296328 18.001823 32.48454
ENSMUSG00000041189 1 38.681019 8.009293 37.690607 38.99625
ENSMUSG00000015112 1 74.157715 6.390019 73.792993 190.01604
ENSMUSG00000058420 1 190.002808 6.531293 135.481336 191.89737
ENSMUSG00000026637 1 192.137668 13.183821 191.655992 192.51205
ENSMUSG00000019945 2 72.077418 6.609965 17.713445 169.28966
ENSMUSG00000081046 2 129.184651 6.497957 12.915206 135.45041
ENSMUSG00000097283 2 153.419672 8.793228 150.991127 153.83765
ENSMUSG00000027487 2 156.136279 6.608979 154.085478 156.75363
ENSMUSG00000027487 2 161.576254 7.091103 161.309135 162.89119
ENSMUSG00000054455 2 165.445995 8.888555 163.582672 166.47560
ENSMUSG00000097283 2 178.833418 30.515193 176.416737 178.90319
ENSMUSG00000027718 3 33.606276 31.217791 32.304940 33.65661
ENSMUSG00000027718 3 37.165286 75.126515 36.852209 37.32921
ENSMUSG00000027722 3 37.374870 6.214390 34.262328 114.46680
ENSMUSG00000048416 3 67.791179 7.134414 66.785770 73.56933
ENSMUSG00000000340 3 114.948141 9.484962 111.607392 114.97076
ENSMUSG00000037994 3 135.232817 49.079161 135.232817 135.23282
ENSMUSG00000021127 4 3.798433 6.132635 3.000000 149.94046
ENSMUSG00000070980 4 54.766817 11.218926 54.527023 55.60324
ENSMUSG00000096177 4 65.427487 8.792047 63.528309 78.76958
ENSMUSG00000033389 4 111.238045 6.214717 109.522381 116.12848
ENSMUSG00000050854 4 118.471245 35.618326 118.397472 118.54342
ENSMUSG00000029804 4 143.211132 6.035045 138.750051 148.34659
ENSMUSG00000029321 4 154.874097 6.069940 20.129417 156.21626
ENSMUSG00000081046 5 98.997378 6.773936 92.854978 103.17849
ENSMUSG00000029321 5 103.614608 21.562401 103.586587 105.67241
ENSMUSG00000099095 5 108.046168 6.777903 102.956536 112.75631
ENSMUSG00000032305 6 37.349577 6.720697 16.376879 48.92068
ENSMUSG00000071519 6 41.406513 17.315115 40.920539 41.40651
ENSMUSG00000002797 6 54.997148 39.222855 54.855033 55.18362
ENSMUSG00000029804 6 58.488862 8.546978 40.993625 60.76170
ENSMUSG00000089634 6 86.133168 11.672479 85.420854 86.55060
ENSMUSG00000048416 6 125.226964 6.154786 45.277826 125.94593
ENSMUSG00000031532 6 125.553788 6.312708 63.210018 126.65001
ENSMUSG00000070980 6 127.078807 6.130341 126.722486 135.05159
ENSMUSG00000080968 6 133.867522 6.004287 129.390996 139.21454
ENSMUSG00000015112 6 134.185817 6.123813 129.110159 135.65721
ENSMUSG00000070980 6 145.839910 6.202777 145.391545 146.31853
ENSMUSG00000025557 7 36.613646 6.506206 35.767556 133.15669
ENSMUSG00000038738 7 44.445619 19.010804 43.844466 44.86733
ENSMUSG00000080383 7 57.912994 6.407887 50.696529 101.62734
ENSMUSG00000058420 7 118.237771 29.916509 117.782005 118.45922
ENSMUSG00000021676 8 22.086620 6.149473 18.457359 115.32557
ENSMUSG00000031532 8 33.827024 17.787609 33.095603 34.20777
ENSMUSG00000002797 8 95.702894 6.560175 94.937120 103.87985
ENSMUSG00000049307 9 14.764262 117.447471 14.703558 14.84658
ENSMUSG00000054455 9 27.198455 7.336424 20.908490 110.55064
ENSMUSG00000097554 9 41.186050 28.140130 40.780764 41.20384
ENSMUSG00000039542 9 49.615275 75.098477 49.494008 50.10869
ENSMUSG00000038738 9 50.108691 6.657000 49.051211 78.22393
ENSMUSG00000038086 9 51.898259 9.842327 51.472947 52.50449
ENSMUSG00000032305 9 57.362395 14.166389 57.359763 57.96257
ENSMUSG00000039542 9 58.357965 24.965759 58.264792 58.51857
ENSMUSG00000026637 9 66.202095 6.128876 61.127900 118.85930
ENSMUSG00000039542 9 107.647626 6.709880 103.046146 108.28455
ENSMUSG00000032305 9 117.664006 7.429687 116.937976 118.80569
ENSMUSG00000044795 10 22.909158 7.195228 21.125750 23.14018
ENSMUSG00000019945 10 68.436555 36.340070 68.296752 68.55129
ENSMUSG00000019945 10 68.966184 32.570583 68.966184 69.22692
ENSMUSG00000071519 10 75.574081 6.462753 73.705382 83.04843
ENSMUSG00000029321 10 125.591103 7.538025 125.539735 126.86601
ENSMUSG00000080968 11 16.903986 6.044069 9.065479 41.96490
ENSMUSG00000069910 11 34.822220 26.757661 34.701511 34.82222
ENSMUSG00000033389 11 65.063246 9.777574 64.648612 68.88967
ENSMUSG00000041189 11 67.078257 6.065475 64.368932 70.61624
ENSMUSG00000044795 11 68.892129 10.613848 65.096598 69.41636
ENSMUSG00000027722 11 73.111862 6.349757 73.005652 77.67011
ENSMUSG00000039542 11 78.151256 6.402459 5.561303 105.51410
ENSMUSG00000027718 11 79.152432 6.886785 70.782062 82.07630
ENSMUSG00000051452 11 84.107102 10.077904 83.647144 84.68618
ENSMUSG00000025557 11 98.275118 6.140239 90.223909 99.34041
ENSMUSG00000080383 11 109.423954 11.546496 109.406840 111.19709
ENSMUSG00000027487 12 6.347016 6.708630 5.846368 11.34456
ENSMUSG00000033373 12 76.788953 32.214804 76.548097 76.88841
ENSMUSG00000021127 12 83.238468 6.982385 82.129546 86.61000
ENSMUSG00000058420 12 84.148297 6.072393 80.160035 117.40074
ENSMUSG00000021271 12 87.166517 6.136045 56.572766 88.46608
ENSMUSG00000049307 12 107.833410 6.639038 106.974113 109.52447
ENSMUSG00000021271 12 110.197950 33.653196 110.180714 111.45223
ENSMUSG00000099095 13 15.611917 8.154411 13.283182 18.32901
ENSMUSG00000015112 13 35.457506 6.362942 34.416849 105.15802
ENSMUSG00000027487 13 64.344624 9.185658 63.548321 67.69218
ENSMUSG00000097283 13 65.206454 10.783729 64.841763 66.93469
ENSMUSG00000021676 13 95.733655 100.243082 95.733655 95.85903
ENSMUSG00000069910 13 101.786348 6.927969 101.137104 102.83022
ENSMUSG00000050854 14 31.435803 6.178650 28.932629 94.70510
ENSMUSG00000086341 14 48.126291 7.138599 46.678372 51.99293
ENSMUSG00000025557 14 99.566266 6.659106 13.014489 101.18782
ENSMUSG00000025557 14 121.571691 35.458616 121.480418 121.67373
ENSMUSG00000022292 15 38.227976 7.784003 36.256559 41.45437
ENSMUSG00000015112 15 58.990004 7.527785 58.658042 60.07672
ENSMUSG00000060487 15 60.990009 6.909902 58.857152 67.06317
ENSMUSG00000022292 15 64.643518 6.404510 62.181647 72.90648
ENSMUSG00000024085 15 66.679768 7.227881 63.659243 68.92043
ENSMUSG00000080968 15 100.424363 59.731554 99.800294 100.43483
ENSMUSG00000026637 16 94.789233 6.874482 3.000000 94.97204
ENSMUSG00000015112 17 43.242630 7.536525 34.179599 43.56276
ENSMUSG00000024085 17 64.655048 55.404509 64.600441 64.72895
ENSMUSG00000069910 17 69.244748 6.998414 68.506902 70.46603
ENSMUSG00000086219 17 78.830855 7.823507 77.910515 80.14648
ENSMUSG00000070980 17 89.875434 6.190791 78.362313 94.98443
ENSMUSG00000032305 18 68.177134 6.075548 65.041330 68.20743
ENSMUSG00000002797 19 38.421678 6.062762 36.662639 40.14056
ENSMUSG00000044795 19 40.204814 6.610261 38.774165 40.22262
ENSMUSG00000021676 X 12.322085 7.621003 11.895188 12.99006
ENSMUSG00000086219 X 49.017962 7.209342 48.174569 50.20412
ENSMUSG00000009079 X 53.140129 7.474459 51.892232 59.08406
ENSMUSG00000067878 X 56.774258 16.144863 53.969865 56.80714
ENSMUSG00000015112 X 150.513256 10.317359 145.335742 157.49709
ENSMUSG00000058420 X 153.682008 7.479411 146.728544 163.71076
ENSMUSG00000022292 X 158.566592 6.854800 155.383862 160.34933
ENSMUSG00000071519 X 166.335626 8.361242 164.415283 167.22935

QTL Peaks Figure

qtl_heatmap(qtl = qtl, map = map, low.thr = 3.5)

plot of chunk qtl_heatmap

Challenge

What do the qtl scans for all gene exression traits look like? Note: Don’t worry, we’ve done the qtl scans for you!!! You can read in this file, ../data/gene.norm_qtl_all.genes.Rdata, which are the scan1 results for all gene expression traits.

Solution

load("../data/gene.norm_qtl_all.genes.Rdata")

lod_threshold = 6
peaks = find_peaks(scan1_output = qtl, 
               map = map, 
               threshold = lod_threshold, 
               peakdrop = 4, 
               prob = 0.95)
write_csv(peaks, "../results/gene.norm_qtl_all.genes_peaks.csv")

## Heat Map
qtl_heatmap(qtl = qtl, map = map, low.thr = 3.5)

Key Points


Creating A Transcriptome Map

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • How do I create and interpret a transcriptome map?

Objectives
  • Describe a transcriptome map.

  • Interpret a transcriptome map.

Load Libraries

library(tidyverse)
library(qtl2)
library(qtl2convert)
library(GGally)
library(broom)
library(corrplot)
library(RColorBrewer)
library(qtl2ggplot)

source("../code/gg_transcriptome_map.R")
source("../code/qtl_heatmap.R")

Load Data

#expression data
load("../data/attie_DO500_expr.datasets.RData")

##mapping data
load("../data/attie_DO500_mapping.data.RData")

##genoprobs
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")

##phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")

expr.mrna <- norm

In this lesson we are going to learn how to create a transcriptome map. At transcriptome map shows the location of the eQTL peak compared its gene location, giving information about cos and trans eQTLs.

We are going to use the file that we created in the previous lesson with the lod scores greater than 6: gene.norm_qtl_peaks_cis.trans.csv.

You can load it in using the following code:

lod_summary = read.csv("../results/gene.norm_qtl_peaks_cis.trans.csv")
lod_summary$marker.id <- paste0(lod_summary$chr,
                                "_",
                                lod_summary$pos * 1000000)
ensembl = get_ensembl_genes()
snapshotDate(): 2022-04-25
loading from cache
Importing File into R ..
id    = ensembl$gene_id
chr   = seqnames(ensembl)
start = start(ensembl) * 1e-6
end   = end(ensembl)   * 1e-6
df = data.frame(ensembl = id, 
                gene_chr = chr, 
                gene_start = start, 
                gene_end = end,
                stringsAsFactors = F)
colnames(lod_summary)[colnames(lod_summary) == "lodcolumn"] = "ensembl"
colnames(lod_summary)[colnames(lod_summary) == "chr"] = "qtl_chr"
colnames(lod_summary)[colnames(lod_summary) == "pos"] = "qtl_pos"
colnames(lod_summary)[colnames(lod_summary) == "lod"] = "qtl_lod"
lod_summary = left_join(lod_summary, df, by = "ensembl")
lod_summary = mutate(lod_summary, 
                     gene_chr = factor(gene_chr, levels = c(1:19, "X")),
                     qtl_chr = factor(qtl_chr, levels = c(1:19, "X")))
rm(df)

We can summarise which eQTLs are cis or trans. Remember a cis qtl is… and trans eqtl is….

lod_summary$cis.trans <- ifelse(lod_summary$qtl_chr == lod_summary$gene_chr, "cis", "trans")
table(lod_summary$cis.trans)

  cis trans 
   47    65 

Plot Transcriptome Map

lod_summary = mutate(lod_summary, cis = (gene_chr == qtl_chr) & (abs(gene_start - qtl_pos) < 4))
out.plot = ggtmap(data = lod_summary %>% filter(qtl_lod >= 7.18), cis.points = TRUE, cis.radius = 4)

plot of chunk unnamed-chunk-2

pdf("../results/transcriptome_map_cis.trans.pdf", width = 10, height = 10)
out.plot
dev.off()
quartz_off_screen 
                2 
out.plot

plot of chunk unnamed-chunk-2

Key Points

  • Transcriptome maps aid in understanding gene expression regulation.


Transcriptome Map of cis and trans eQTL

Overview

Teaching: 10 min
Exercises: 20 min
Questions
  • How do I create a full transcriptome map?

Objectives

Load Libraries

library(tidyverse)
library(qtl2)
library(qtl2convert)
#library(qtl2db)
library(GGally)
library(broom)
library(knitr)
library(corrplot)
library(RColorBrewer)
library(qtl2ggplot)

source("../code/gg_transcriptome_map.R")
source("../code/qtl_heatmap.R")

Load Data

#expression data
load("../data/attie_DO500_expr.datasets.RData")

##loading previous results
load("../data/dataset.islet.rnaseq.RData")

##mapping data
load("../data/attie_DO500_mapping.data.RData")

##genoprobs
probs = readRDS("../data/attie_DO500_genoprobs_v5.rds")

##phenotypes
load("../data/attie_DO500_clinical.phenotypes.RData")
lod_summary = dataset.islet.rnaseq$lod.peaks

ensembl = get_ensembl_genes()
id    = ensembl$gene_id
chr   = seqnames(ensembl)
start = start(ensembl) * 1e-6
end   = end(ensembl)   * 1e-6
df = data.frame(ensembl = id, gene_chr = chr, gene_start = start, gene_end = end,
     stringsAsFactors = F)
colnames(lod_summary)[colnames(lod_summary) == "annot.id"] = "ensembl"
colnames(lod_summary)[colnames(lod_summary) == "chrom"] = "qtl_chr"
colnames(lod_summary)[colnames(lod_summary) == "pos"] = "qtl_pos"
colnames(lod_summary)[colnames(lod_summary) == "lod"] = "qtl_lod"
lod_summary = left_join(lod_summary, df, by = "ensembl")
lod_summary = mutate(lod_summary, gene_chr = factor(gene_chr, levels = c(1:19, "X")),
                     qtl_chr = factor(qtl_chr, levels = c(1:19, "X")))
rm(df)

## summary:

lod_summary$cis.trans <- ifelse(lod_summary$qtl_chr == lod_summary$gene_chr, "cis", "trans")
table(lod_summary$cis.trans)

  cis trans 
14549 24299 

Plot Transcriptome Map

lod_summary = mutate(lod_summary, cis = (gene_chr == qtl_chr) & (abs(gene_start - qtl_pos) < 4))
out.plot = ggtmap(data = lod_summary %>% filter(qtl_lod >= 7.18), cis.points = TRUE, cis.radius = 4)

plot of chunk unnamed-chunk-2

out.plot

plot of chunk unnamed-chunk-2

QTL Density Plot

breaks = matrix(c(seq(0, 200, 4), seq(1, 201, 4), seq(2, 202, 4), seq(3, 203, 4)), ncol = 4)
tmp = as.list(1:ncol(breaks)) 
for(i in 1:ncol(breaks)) {
tmp[[i]] = lod_summary %>%
             filter(qtl_lod >= 7.18 & cis == FALSE) %>%
             arrange(qtl_chr, qtl_pos) %>%
             group_by(qtl_chr) %>%
             mutate(win = cut(qtl_pos, breaks = breaks[,i])) %>%
             group_by(qtl_chr, win) %>% 
             summarize(cnt = n()) %>%
             separate(win, into = c("other", "prox", "dist")) %>%
             mutate(prox = as.numeric(prox), 
                    dist = as.numeric(dist), 
                    mid = 0.5 * (prox + dist)) %>%
             dplyr::select(qtl_chr, mid, cnt)
}

trans = bind_rows(tmp[[1]], tmp[[1]], tmp[[3]], tmp[[4]])
rm(tmp)

out.plot = ggplot(trans, aes(mid, cnt)) +
             geom_line() +
             geom_hline(aes(yintercept = 100), linetype = 2, color = "grey50") +
             facet_grid(.~qtl_chr, scales = "free") +
             theme(panel.background = element_blank(),
             panel.border = element_rect(fill = 0, color = "grey70"),
             panel.spacing = unit(0, "lines"),
             axis.text.x = element_text(angle = 90)) +
             labs(title = "trans-eQTL Histogram", x = "Mb", y = "Number of Transcripts")

out.plot

plot of chunk trans_eqtl_sliding_window

breaks = matrix(c(seq(0, 200, 4), seq(1, 201, 4), seq(2, 202, 4), seq(3, 203, 4)), ncol = 4)
tmp = as.list(1:ncol(breaks)) 
for(i in 1:ncol(breaks)) {
tmp[[i]] = lod_summary %>%
             filter(qtl_lod >= 7.18 & cis == TRUE) %>%
             arrange(qtl_chr, qtl_pos) %>%
             group_by(qtl_chr) %>%
             mutate(win = cut(qtl_pos, breaks = breaks[,i])) %>%
             group_by(qtl_chr, win) %>% 
             summarize(cnt = n()) %>%
             separate(win, into = c("other", "prox", "dist")) %>%
             mutate(prox = as.numeric(prox), 
                    dist = as.numeric(dist), 
                    mid = 0.5 * (prox + dist)) %>%
             dplyr::select(qtl_chr, mid, cnt)
}

cis = bind_rows(tmp[[1]], tmp[[2]], tmp[[3]], tmp[[4]])
rm(tmp1, tmp2, tmp3, tmp4)

out.plot = ggplot(cis, aes(mid, cnt)) +
             geom_line(color = "#4286f4") +
             geom_hline(aes(yintercept = 100), linetype = 2, color = "grey50") +
             facet_grid(.~qtl_chr, scales = "free") +
             theme(panel.background = element_blank(),
                   panel.border = element_rect(fill = 0, color = "grey70"),
                   panel.spacing = unit(0, "lines"),
                   axis.text.x = element_text(angle = 90)) +
           labs(title = "cis-eQTL Histogram", x = "Mb", y = "Number of Transcripts")

out.plot

plot of chunk cis_eqtl_sliding_window

tmp = lod_summary %>%
        filter(qtl_lod >= 7.18) %>%
        group_by(cis) %>%
        count()
kable(tmp, caption = "Number of cis- and trans-eQTL")

Table: Number of cis- and trans-eQTL

cis n
FALSE 5617
TRUE 12677
NA 526
rm(tmp)

Islet RNASeq eQTL Hotspots

Select eQTL Hotspots

Select trans-eQTL hotspots with more than 100 genes at the 7.18 LOD thresholds. Retain the maximum per chromosome.

hotspots = trans %>%
             group_by(qtl_chr) %>%
             filter(cnt >= 100) %>%
             summarize(center = median(mid)) %>%
             mutate(proximal = center - 2, distal = center + 2)
kable(hotspots, caption = "Islet trans-eQTL hotspots")

Table: Islet trans-eQTL hotspots

qtl_chr center proximal distal
2 165.5 163.5 167.5
5 146.0 144.0 148.0
7 46.0 44.0 48.0
11 71.0 69.0 73.0
13 112.5 110.5 114.5
cis.hotspots = cis %>%
             group_by(qtl_chr) %>%
             filter(cnt >= 100) %>%
             summarize(center = median(mid)) %>%
             mutate(proximal = center - 2, distal = center + 2)
kable(cis.hotspots, caption = "Islet cis-eQTL hotspots")

Table: Islet cis-eQTL hotspots

qtl_chr center proximal distal
7 29.0 27.0 31.0
17 34.5 32.5 36.5

Given the hotspot locations, retain all genes with LOD > 7.18 and trans-eQTL within +/- 4Mb of the mid-point of the hotspot.

hotspot.genes = as.list(hotspots$qtl_chr)
names(hotspot.genes) = hotspots$qtl_chr
for(i in 1:nrow(hotspots)) {
  hotspot.genes[[i]] = lod_summary %>% 
                         filter(qtl_lod >= 7.18) %>%
                         filter(qtl_chr == hotspots$qtl_chr[i] & 
                           qtl_pos >= hotspots$proximal[i] & 
                           qtl_pos <= hotspots$distal[i] &
                           (gene_chr != hotspots$qtl_chr[i] |
                           (gene_chr == hotspots$qtl_chr[i] &
                            gene_start > hotspots$distal[i] + 1 &
                            gene_end < hotspots$proximal[i] - 1)))
  write_csv(hotspot.genes[[i]], file = paste0("../results/chr", names(hotspot.genes)[i], "_hotspot_genes.csv"))
}

Number of genes in each hotspot.

hotspots = data.frame(hotspots, count = sapply(hotspot.genes, nrow))
kable(hotspots, caption = "Number of genes per hotspot")

Table: Number of genes per hotspot

  qtl_chr center proximal distal count
2 2 165.5 163.5 167.5 147
5 5 146.0 144.0 148.0 182
7 7 46.0 44.0 48.0 123
11 11 71.0 69.0 73.0 126
13 13 112.5 110.5 114.5 104
cis.hotspot.genes = as.list(cis.hotspots$qtl_chr)
names(cis.hotspot.genes) = cis.hotspots$qtl_chr
for(i in 1:nrow(cis.hotspots)) {
  cis.hotspot.genes[[i]] = lod_summary %>% 
                             dplyr::select(ensembl, marker.id, qtl_chr, qtl_pos, qtl_lod) %>%
                             filter(qtl_lod >= 7.18) %>%
                             filter(qtl_chr == cis.hotspots$qtl_chr[i] & 
                                    qtl_pos >= cis.hotspots$proximal[i] & 
                                    qtl_pos <= cis.hotspots$distal[i])
  write_csv(cis.hotspot.genes[[i]], file = paste0("../results/chr", names(cis.hotspot.genes)[i], "_cis_hotspot_genes.csv"))
}

Number of genes in each cis-hotspot.

cis.hotspots = data.frame(cis.hotspots, count = sapply(cis.hotspot.genes, nrow))
kable(cis.hotspots, caption = "Number of genes per cis-hotspot")

Table: Number of genes per cis-hotspot

  qtl_chr center proximal distal count
7 7 29.0 27.0 31.0 120
17 17 34.5 32.5 36.5 188

Get the expression of genes that map to each hotspot.

for(i in 1:length(hotspot.genes)) {
  tmp = data.frame(ensembl = hotspot.genes[[i]]$ensembl, t(expr.mrna[,hotspot.genes[[i]]$ensembl]))
  hotspot.genes[[i]] = left_join(hotspot.genes[[i]], tmp, by = "ensembl")
  write_csv(hotspot.genes[[i]], file = paste0("../results/chr", names(hotspot.genes)[i], "_hotspot_genes.csv"))
}
Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 't': object 'expr.mrna' not found

Hotspot Gene Correlation

breaks = -100:100/100
colors = colorRampPalette(rev(brewer.pal(11, "Spectral")))(length(breaks) - 1)
for(i in 1:length(hotspot.genes)) {
  chr = names(hotspot.genes)[i]
  tmp = hotspot.genes[[i]] %>%
    dplyr::select(starts_with("DO")) %>%
    t() %>%
    as.matrix() %>%
    cor()
  dimnames(tmp) = list(hotspot.genes[[i]]$ensembl, hotspot.genes[[i]]$ensembl)
  side.colors = cut(hotspot.genes[[i]]$qtl_lod, breaks = 100)
  side.colors = colorRampPalette(rev(brewer.pal(9, "YlOrRd")))(length(levels(side.colors)))[as.numeric(side.colors)]
  names(side.colors) = rownames(tmp)

  heatmap(tmp, symm = TRUE, scale = "none", main = paste("Chr", chr, "Gene Correlation"), breaks = breaks, col = colors, RowSideColors = side.colors, ColSideColors = side.colors)
}
Error in hclustfun(distfun(x)): NA/NaN/Inf in foreign function call (arg 10)

Hotspot Principal Components

hotspot.pcs = as.list(names(hotspot.genes))
names(hotspot.pcs) = names(hotspot.genes)
do.wave = pheno_clin[rownames(expr.mrna),"DOwave",drop=F]
Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'expr.mrna' not found
wave.col = as.numeric(as.factor(do.wave[,1]))
Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.factor': object 'do.wave' not found
#hotspot.genes.rns <- list()
for(i in 1:length(hotspot.genes)) {
  tmp = hotspot.genes[[i]] %>%
          dplyr::select(starts_with("DO")) %>%
          as.matrix() %>%
          t() %>%
          prcomp()
  hotspot.pcs[[i]] = tmp$x
  #rownames(hotspot.pcs[[i]]) <- gsub("\\.x","", rownames(hotspot.pcs[[i]]))
  #rownames(hotspot.pcs[[i]]) <- gsub("\\.y","", rownames(hotspot.pcs[[i]]))
  tmp = gather(data.frame(mouse = rownames(hotspot.pcs[[i]]), hotspot.pcs[[i]]), pc, value, -mouse)
  tmp$mouse <- gsub("\\.x","", tmp$mouse)
  tmp$mouse <- gsub("\\.y","", tmp$mouse)
  tmp = left_join(tmp, pheno_clin %>% dplyr::select(mouse, sex, DOwave, diet_days), by = "mouse")
  print(tmp %>%
    filter(pc %in% paste0("PC", 1:4)) %>%
    mutate(DOwave = factor(DOwave)) %>%
    ggplot(aes(DOwave, value, fill = sex)) +
    geom_boxplot() +
    facet_grid(pc~.) +
    labs(title = paste("Chr", names(hotspot.genes)[i], "Hotspot")))
}
Error in svd(x, nu = 0, nv = k): a dimension is zero

Key Points

  • .


Maximum eQTL Peaks and Nearby Genes

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • How do I display maximum eQTL peaks and nearby genes?

Objectives
  • .

LOD plot for gene Hnf1b.

SNP association plot for gene Hnf1b.

Key Points

  • .


Interpreting qtl2 results

Overview

Teaching: 10 min
Exercises: 2 min
Questions
  • How do I interpret qtl2 results?

Objectives
  • Interpret the relationship between sequence, expression and phenotype variation from qtl2 mapping results.

Attie et al, 2018 Supplementary Figure S7

Challenge: Interpreting qtl2 results

Refer to the figure above.

1). What does panel A show? What conclusions could you draw from panel A?
2). What does panel B show? What conclusions could you draw from panel B?
3). What does panel C show? What conclusions could you draw from panel C?
4). How are panels A through C related to one another? What story do they tell together?

Solution

Key Points

  • Edit the .Rmd files not the .md files


Mediation Analysis

Overview

Teaching: 10 min
Exercises: 2 min
Questions
  • What is mediation analysis?

  • How is mediation analysis used in genetics and genomics?

  • How do I explore causal relations with mediation analysis?

Objectives
  • Describe mediation analysis as applied in genetics and genomics.

A non-coding SNP affects expression of gene 1 in cis. Gene 1 mediates expression of gene 2.

Chromosome 13 gene Akr1e1 is affected by expression in both cis and trans by genes on chromosomes 13 and 4.

Chromosome 11 gene Myo15b is affected by expression of a trans gene in the chromosome 2 hotspot.

Mediating expression of Myo15b identifies Hnf4a as the gene that drops the LOD score from greater than 70 to less than 50.

Challenge:

Solution

Key Points

  • Mediation analysis investigates an intermediary between an independent variable and its effect on a dependent variable.

  • Mediation analysis is used in high-throughput genomics studies to identify molecular phenotypes, such as gene expression or methylation traits, that mediate the effect of genetic variation on disease phenotypes or other outcomes of interest.