Investigating large datasets in R

Overview

Teaching: min
Exercises: min

Questions

How can we evaluate direct relationships between two variables?

How can we evaluate more complex relationships between multiple, interacting variables?

Objectives

Explore a large dataset graphically

Develop simple models with lm()

Choose new packages and interpret package descriptions

Analyze complex models using Information Theoretic (IT) Model Averaging (arm and MuMIn packages)

Setup

Let’s begin by loading the data and packages necessary to start this lesson.

# Load packages
library(ggplot2)

# Load data
phys_data<-read.csv("data/Physiology_Environmental_Data.csv")

Exploring the dataset

In previous lessons we’ve worked with the plant physiology dataset from O’Keefe and Nippert 2018. As a reminder, these data were collected to address the following objectives:

How do nocturnal and daytime transpiration vary among coexisting grasses, forbs, and shrubs in a tallgrass prairie?
What environmental variables drive nocturnal transpiration and do these differ from the drivers of daytime transpiration?
Are nocturnal transpiration and stomatal conductance associated with daytime physiological processes?

Let’s use this dataset to explore the relationships between leaf transpiration and potential drivers of transpiration

Transpiration parameters:

Nocturnal transpiration (Trmmol_night)
Nocturnal stomatal conductance (Cond_night)
Daytime transpiration (Trmmol_day)
Daytime stomatal conductance (Cond_night)

Potential drivers:

Nocturnal vapor pressure deficit (VPD_N)
Nocturnal air temperature (TAIR_N)
Daytime vapor pressure deficit (VPD_D)
Daytime air temperature (TAIR_D)
Daily average soil moisture (Soil_moisture)
Predawn leaf water potential (PD)
Midday leaf water potential (MD)
Photosynthesis (Photo)
Plant functional group (Fgroup)

Challenge 1

Use ggplot to create a scatterplot of the relationship between a transpiration parameter and a potential driver.
Solution to Challenge 1
# Nocturnal transpiration vs. nocturnal VPD
ggplot(phys_data, aes(x=VPD_N, y=Trmmol_night)) +
  geom_point() +
  xlab("Nocturnal VPD (kPa)") +
  ylab(expression(paste(italic('E')[night]," ", "(mmol"," ",m^-2,s^-1,")"))) +
 theme_bw() 

Challenge 2

How does this relationship differ between the different plant functional groups (grass, forb, woody)?
Solution to Challenge 2
ggplot(phys_data, aes(x=VPD_N, y=Trmmol_night, color=Fgroup)) +
 geom_point() +
 xlab("Nocturnal VPD") +
 ylab(expression(paste(italic('E')[night]," ", "(mmol"," ",m^-2,s^-1,")"))) +
 theme_bw()

Using a Simple Model

As you can see, plotting our data can be informative, but it doesn’t necessarily tell us much about the statistical relationships between variables. We are going to start investigate the relationships between these data by using a simple linear regression. Linear regression is used to predict the value of a dependent variable Y based on one or more input independent (predictor) variables X, with the following basic equation:

$y = \beta_1 + \beta_2 X + \epsilon$

In this equation, $y$ is the dependent variable, $\beta_1$ is the intercept, $\beta_2$ is the slope, and $\epsilon$ is the error term.

To build this equation in R, we’ll use the lm() function. lm() is included in the base R package, so you don’t have to load it into your workspace to use. lm() takes three basic arguments: the dependent variable, the independent variable, and the dataframe from which the data is used. The model is specified using a particular format, and is typically assigned to an object:

model_name <- lm(dependent_variable ~ independent_variable, data=dataframe_name )

We can then view our model output with the summary() function.

fit_vpd <- lm(Trmmol_night ~ VPD_N, data=phys_data)
summary(fit_vpd)

Call:
lm(formula = Trmmol_night ~ VPD_N, data = phys_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.57289 -0.27516 -0.07538  0.15691  1.36512 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.64929    0.04480  14.493  < 2e-16 ***
VPD_N       -0.21335    0.04867  -4.384 1.65e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3783 on 283 degrees of freedom
Multiple R-squared:  0.06359,	Adjusted R-squared:  0.06028 
F-statistic: 19.22 on 1 and 283 DF,  p-value: 1.646e-05

From this output, we’re usually interested in the following results:

(Intercept) Estimate: This is the y-intercept of our regression
VPD_N Estimate: This is the slope of our regression
Std. Error: The standard error for our intercept and slope estimates
Pr(>|t|): These are the p-values associated with the intercept and our independent variable
Multiple R-squared: The $r^2$ value that indicates model fit
F-statistic and p-value: Indicate overall model significance

More Complex Models

Now that we know how to test for the effect of one variable on the dependent variable, let’s test for the effect of multiple variables on the dependent variable. You can add additional variables to your model by using the following formula:

model_name <- lm(dependent_variable ~ independent_variable_1 * independent_variable_2, data=dataframe_name )

Note that adding an asterisk * between the two independent variables indicates that we will be testing for the effects of independent variable 1, independent variable 2, and their interaction on the dependent variable. If we were not interested in testing for interactions among independent variables, we could replace the * with a + sign.

Challenge 3

Create a model that tests for the effects of four factors on either transpiration or stomatal conductance.

Solution to Challenge 3

fit_all <- lm(Trmmol_night ~ VPD_N * TAIR_N * Soil_moisture * Fgroup, data=phys_data)
summary(fit_all)

Call:
lm(formula = Trmmol_night ~ VPD_N * TAIR_N * Soil_moisture * 
   Fgroup, data = phys_data)

Residuals:
   Min      1Q  Median      3Q     Max 
-0.7984 -0.2043 -0.0572  0.1351  1.4284 

Coefficients:
                                       Estimate Std. Error t value
(Intercept)                             -3.96407    3.44780  -1.150
VPD_N                                    2.06121    6.10586   0.338
TAIR_N                                   0.25750    0.18189   1.416
Soil_moisture                           22.02982   22.42903   0.982
Fgroupgrass                             -2.56106    4.49212  -0.570
Fgroupwoody                              2.52347    4.45109   0.567
VPD_N:TAIR_N                            -0.14744    0.26059  -0.566
VPD_N:Soil_moisture                     -0.99095   45.52504  -0.022
TAIR_N:Soil_moisture                    -1.31389    1.17040  -1.123
VPD_N:Fgroupgrass                        7.51010    7.89466   0.951
VPD_N:Fgroupwoody                        1.51518    7.88263   0.192
TAIR_N:Fgroupgrass                       0.14667    0.23783   0.617
TAIR_N:Fgroupwoody                      -0.16006    0.23482  -0.682
Soil_moisture:Fgroupgrass               30.54712   29.24017   1.045
Soil_moisture:Fgroupwoody              -11.37981   28.95575  -0.393
VPD_N:TAIR_N:Soil_moisture               0.40143    1.91713   0.209
VPD_N:TAIR_N:Fgroupgrass                -0.34894    0.33756  -1.034
VPD_N:TAIR_N:Fgroupwoody                -0.02159    0.33642  -0.064
VPD_N:Soil_moisture:Fgroupgrass        -81.58945   58.85944  -1.386
VPD_N:Soil_moisture:Fgroupwoody        -22.69473   58.77257  -0.386
TAIR_N:Soil_moisture:Fgroupgrass        -1.52112    1.53236  -0.993
TAIR_N:Soil_moisture:Fgroupwoody         0.78695    1.51098   0.521
VPD_N:TAIR_N:Soil_moisture:Fgroupgrass   3.57627    2.48279   1.440
VPD_N:TAIR_N:Soil_moisture:Fgroupwoody   0.67813    2.47500   0.274
                                      Pr(>|t|)
(Intercept)                               0.251
VPD_N                                     0.736
TAIR_N                                    0.158
Soil_moisture                             0.327
Fgroupgrass                               0.569
Fgroupwoody                               0.571
VPD_N:TAIR_N                              0.572
VPD_N:Soil_moisture                       0.983
TAIR_N:Soil_moisture                      0.263
VPD_N:Fgroupgrass                         0.342
VPD_N:Fgroupwoody                         0.848
TAIR_N:Fgroupgrass                        0.538
TAIR_N:Fgroupwoody                        0.496
Soil_moisture:Fgroupgrass                 0.297
Soil_moisture:Fgroupwoody                 0.695
VPD_N:TAIR_N:Soil_moisture                0.834
VPD_N:TAIR_N:Fgroupgrass                  0.302
VPD_N:TAIR_N:Fgroupwoody                  0.949
VPD_N:Soil_moisture:Fgroupgrass           0.167
VPD_N:Soil_moisture:Fgroupwoody           0.700
TAIR_N:Soil_moisture:Fgroupgrass          0.322
TAIR_N:Soil_moisture:Fgroupwoody          0.603
VPD_N:TAIR_N:Soil_moisture:Fgroupgrass    0.151
VPD_N:TAIR_N:Soil_moisture:Fgroupwoody    0.784

Residual standard error: 0.3598 on 261 degrees of freedom
Multiple R-squared:  0.219,	Adjusted R-squared:  0.1502 
F-statistic: 3.182 on 23 and 261 DF,  p-value: 3.459e-06

How to Work with Complex Models: IT Model Averaging

As you can see, we have a lot of potential interactions with just four independent variables in our model! This becomes incredibly difficult to interpret, so we often have to look for other methods to analyze complex relationships in our data.

Although there are many different and valid ways to approach a statistical problem, today we are going to use Information Theoretic (IT) Model Averaging. Rather than creating a model with all possible variables and interactions, IT Model Averaging:

Compares mulitple competing models using information criteria
Ranks and weights each competing model
Averages a top model set to produce a final model that only include predictor variables represented in the top model set

IT Model Averaging quantifies multiple competing hypotheses and is better able to avoid over-parameterization than traditional methods using a single model. To get started, we need to create a “global model” that includes all possible predictor variables. To simplify things even further, lets only consider pairwise interactions among predictors. For this we need to use a + sign between variables and square the variables with ()^2.

model_name <- lm(dependent_variable ~ (independent_variable_1 + independent_variable_2)^2, data=dataframe_name )

# Global model
fit_all <- lm(Trmmol_night ~ (VPD_N + TAIR_N + Soil_moisture + Fgroup)^2, data=phys_data)

IT Model Averaging requires the following packages: arm: Includes the standardize() function that standardizes the input variables MuMIn: The dredge() function creates a full submodel set, the get.models() function creates a top model set, and model.avg() creates the average model and importance() calculates relative importance. Let’s install and load these packages before we begin:

# Install packages
install.packages("arm")
install.packages("MuMIn")

# Load packages
library(arm)
library(MuMIn)

Before we start using these functions, let’s take a look at the package descriptions for arm and MuMIn

Note: When reading package descriptions, it is important to:

Read the overall package description. This will give you a good overall idea of what this package does, and if it would be useful for you.
Identify particular functions of interest. You don’t necessarily need to read the documentation for all of the functions within a package (typically, you’ll only use a few functions).
For individual functions, read through the basic description, usage, argument descriptions, details / notes, and examples. This will tell you what syntax to use and what the syntax means.
You can access package descriptions online (google search), from the CRAN website, or by using the help() function in R.

First, let’s standardize our input variables with the standardize() function in the arm package. standardize() rescales numeric variables that take on more than two values to have a mean of 0 and a standard deviation of 0.5. To do this, we just need to specify the object to standardize (our global model, fit_all):

# Standardize the global model

stdz.model<-standardize(fit_all)
summary(stdz.model)

Next, we create the full submodel set with the dredge() function in the MuMIn package by specifying the object that dredge() will evaluate (our standardized model, stdz.model). We also need to change the default “na.omit” to prevent models from being fitted to different datasets in case of missing values using options(na.action=na.fail) :

options(na.action=na.fail) 
model.set<-dredge(stdz.model)

The get.models() function will then create a top model set. First, we need to specify the object that get.models will evaluate (our model set, model.set), and then we need to specify the subset of models to include all models within 4AICcs with subset=delta<4:

top.models<-get.models(model.set, subset=delta<4)
top.models

Finally, we’ll use the model.avg() function to create our average model and calculates relative importance. We need to specify the object that model.avg() will evaluate (in this case our top model set, top.models):

average_model<-model.avg(top.models)

To check out our final average model, use the summary() function:

summary(average_model)

This is a lot of information! We can see which component models were chosen as the top model set and then averaged, as well as their associated ranking information. We can also see the model-averaged coefficients for both the full average model and conditional average model.

The conditional average only averages over the models where the parameter appears. Conversely, the full average assumes that a variable is included in every model, but in some models the corresponding coefficient (and its respective variance) is set to zero. Unlike the conditional average, the full average does not bias the value away from zero.

We can also check out the relative importance of each factor within the model with the importance() function. Relative importance is a unitless metric ranging from 0 (doesn’t contribute to the model at all) to 1 (contributes heavily to the model). This function will also give use the number of models within the top model set that contain each factor.

importance(average_model)

Challenge 4

Using IT Model Averaging, create a model that tests for the effects of four factors on either transpiration or stomatal conductance. Then, create a plot that best illustrates the main finding of your top model. Ultimately, your results should be able to address one of these questions:

What environmental variables drive nocturnal transpiration / stomatal conductance, and do these differ from the drivers of daytime transpiration / stomatal conductance?

Are nocturnal transpiration and stomatal conductance associated with daytime physiological processes?

Key Points

Linear regression is a quick and easy way to evaluate the direct relationship between two variables.

IT Model Averaging is one approach to evaluate more complex relationships between variables.

There are often many valid approaches to address a question in R, and understanding how to interpret R packages can help determine which approach might be most appropriate.

previous episode

R for plant science

lesson home

Investigating large datasets in R

Overview

Setup

Exploring the dataset

Challenge 1

Solution to Challenge 1

Challenge 2

Solution to Challenge 2