Investigating large datasets in R
Overview
Teaching: min
Exercises: minQuestions
How can we evaluate direct relationships between two variables?
How can we evaluate more complex relationships between multiple, interacting variables?
Objectives
Explore a large dataset graphically
Develop simple models with
lm()
Choose new packages and interpret package descriptions
Analyze complex models using Information Theoretic (IT) Model Averaging (
arm
andMuMIn
packages)
Setup
Let’s begin by loading the data and packages necessary to start this lesson.
# Load packages
library(ggplot2)
# Load data
phys_data<-read.csv("data/Physiology_Environmental_Data.csv")
Exploring the dataset
In previous lessons we’ve worked with the plant physiology dataset from O’Keefe and Nippert 2018. As a reminder, these data were collected to address the following objectives:
- How do nocturnal and daytime transpiration vary among coexisting grasses, forbs, and shrubs in a tallgrass prairie?
- What environmental variables drive nocturnal transpiration and do these differ from the drivers of daytime transpiration?
- Are nocturnal transpiration and stomatal conductance associated with daytime physiological processes?
Let’s use this dataset to explore the relationships between leaf transpiration and potential drivers of transpiration
Transpiration parameters:
- Nocturnal transpiration (Trmmol_night)
- Nocturnal stomatal conductance (Cond_night)
- Daytime transpiration (Trmmol_day)
- Daytime stomatal conductance (Cond_night)
Potential drivers:
- Nocturnal vapor pressure deficit (VPD_N)
- Nocturnal air temperature (TAIR_N)
- Daytime vapor pressure deficit (VPD_D)
- Daytime air temperature (TAIR_D)
- Daily average soil moisture (Soil_moisture)
- Predawn leaf water potential (PD)
- Midday leaf water potential (MD)
- Photosynthesis (Photo)
- Plant functional group (Fgroup)
Challenge 1
Use ggplot to create a scatterplot of the relationship between a transpiration parameter and a potential driver.
Solution to Challenge 1
# Nocturnal transpiration vs. nocturnal VPD ggplot(phys_data, aes(x=VPD_N, y=Trmmol_night)) + geom_point() + xlab("Nocturnal VPD (kPa)") + ylab(expression(paste(italic('E')[night]," ", "(mmol"," ",m^-2,s^-1,")"))) + theme_bw()
Challenge 2
How does this relationship differ between the different plant functional groups (grass, forb, woody)?
Solution to Challenge 2
ggplot(phys_data, aes(x=VPD_N, y=Trmmol_night, color=Fgroup)) + geom_point() + xlab("Nocturnal VPD") + ylab(expression(paste(italic('E')[night]," ", "(mmol"," ",m^-2,s^-1,")"))) + theme_bw()
Using a Simple Model
As you can see, plotting our data can be informative, but it doesn’t necessarily tell us much about the statistical relationships between variables. We are going to start investigate the relationships between these data by using a simple linear regression. Linear regression is used to predict the value of a dependent variable Y based on one or more input independent (predictor) variables X, with the following basic equation:
$y = \beta_1 + \beta_2 X + \epsilon$
In this equation, $y$ is the dependent variable, $\beta_1$ is the intercept, $\beta_2$ is the slope, and $\epsilon$ is the error term.
To build this equation in R, we’ll use the lm()
function. lm()
is included in the base R package, so you don’t have to load it into your workspace to use. lm()
takes three basic arguments: the dependent variable, the independent variable, and the dataframe from which the data is used. The model is specified using a particular format, and is typically assigned to an object:
model_name <- lm(dependent_variable ~ independent_variable, data=dataframe_name )
We can then view our model output with the summary()
function.
fit_vpd <- lm(Trmmol_night ~ VPD_N, data=phys_data)
summary(fit_vpd)
Call:
lm(formula = Trmmol_night ~ VPD_N, data = phys_data)
Residuals:
Min 1Q Median 3Q Max
-0.57289 -0.27516 -0.07538 0.15691 1.36512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.64929 0.04480 14.493 < 2e-16 ***
VPD_N -0.21335 0.04867 -4.384 1.65e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3783 on 283 degrees of freedom
Multiple R-squared: 0.06359, Adjusted R-squared: 0.06028
F-statistic: 19.22 on 1 and 283 DF, p-value: 1.646e-05
From this output, we’re usually interested in the following results:
-
(Intercept) Estimate
: This is the y-intercept of our regression -
VPD_N Estimate
: This is the slope of our regression -
Std. Error
: The standard error for our intercept and slope estimates -
Pr(>|t|)
: These are the p-values associated with the intercept and our independent variable -
Multiple R-squared
: The $r^2$ value that indicates model fit -
F-statistic
andp-value
: Indicate overall model significance
More Complex Models
Now that we know how to test for the effect of one variable on the dependent variable, let’s test for the effect of multiple variables on the dependent variable. You can add additional variables to your model by using the following formula:
model_name <- lm(dependent_variable ~ independent_variable_1 * independent_variable_2, data=dataframe_name )
Note that adding an asterisk *
between the two independent variables indicates that we will be testing for the effects of independent variable 1, independent variable 2, and their interaction on the dependent variable. If we were not interested in testing for interactions among independent variables, we could replace the *
with a +
sign.
Challenge 3
Create a model that tests for the effects of four factors on either transpiration or stomatal conductance.
Solution to Challenge 3
fit_all <- lm(Trmmol_night ~ VPD_N * TAIR_N * Soil_moisture * Fgroup, data=phys_data) summary(fit_all)
Call: lm(formula = Trmmol_night ~ VPD_N * TAIR_N * Soil_moisture * Fgroup, data = phys_data) Residuals: Min 1Q Median 3Q Max -0.7984 -0.2043 -0.0572 0.1351 1.4284 Coefficients: Estimate Std. Error t value (Intercept) -3.96407 3.44780 -1.150 VPD_N 2.06121 6.10586 0.338 TAIR_N 0.25750 0.18189 1.416 Soil_moisture 22.02982 22.42903 0.982 Fgroupgrass -2.56106 4.49212 -0.570 Fgroupwoody 2.52347 4.45109 0.567 VPD_N:TAIR_N -0.14744 0.26059 -0.566 VPD_N:Soil_moisture -0.99095 45.52504 -0.022 TAIR_N:Soil_moisture -1.31389 1.17040 -1.123 VPD_N:Fgroupgrass 7.51010 7.89466 0.951 VPD_N:Fgroupwoody 1.51518 7.88263 0.192 TAIR_N:Fgroupgrass 0.14667 0.23783 0.617 TAIR_N:Fgroupwoody -0.16006 0.23482 -0.682 Soil_moisture:Fgroupgrass 30.54712 29.24017 1.045 Soil_moisture:Fgroupwoody -11.37981 28.95575 -0.393 VPD_N:TAIR_N:Soil_moisture 0.40143 1.91713 0.209 VPD_N:TAIR_N:Fgroupgrass -0.34894 0.33756 -1.034 VPD_N:TAIR_N:Fgroupwoody -0.02159 0.33642 -0.064 VPD_N:Soil_moisture:Fgroupgrass -81.58945 58.85944 -1.386 VPD_N:Soil_moisture:Fgroupwoody -22.69473 58.77257 -0.386 TAIR_N:Soil_moisture:Fgroupgrass -1.52112 1.53236 -0.993 TAIR_N:Soil_moisture:Fgroupwoody 0.78695 1.51098 0.521 VPD_N:TAIR_N:Soil_moisture:Fgroupgrass 3.57627 2.48279 1.440 VPD_N:TAIR_N:Soil_moisture:Fgroupwoody 0.67813 2.47500 0.274 Pr(>|t|) (Intercept) 0.251 VPD_N 0.736 TAIR_N 0.158 Soil_moisture 0.327 Fgroupgrass 0.569 Fgroupwoody 0.571 VPD_N:TAIR_N 0.572 VPD_N:Soil_moisture 0.983 TAIR_N:Soil_moisture 0.263 VPD_N:Fgroupgrass 0.342 VPD_N:Fgroupwoody 0.848 TAIR_N:Fgroupgrass 0.538 TAIR_N:Fgroupwoody 0.496 Soil_moisture:Fgroupgrass 0.297 Soil_moisture:Fgroupwoody 0.695 VPD_N:TAIR_N:Soil_moisture 0.834 VPD_N:TAIR_N:Fgroupgrass 0.302 VPD_N:TAIR_N:Fgroupwoody 0.949 VPD_N:Soil_moisture:Fgroupgrass 0.167 VPD_N:Soil_moisture:Fgroupwoody 0.700 TAIR_N:Soil_moisture:Fgroupgrass 0.322 TAIR_N:Soil_moisture:Fgroupwoody 0.603 VPD_N:TAIR_N:Soil_moisture:Fgroupgrass 0.151 VPD_N:TAIR_N:Soil_moisture:Fgroupwoody 0.784 Residual standard error: 0.3598 on 261 degrees of freedom Multiple R-squared: 0.219, Adjusted R-squared: 0.1502 F-statistic: 3.182 on 23 and 261 DF, p-value: 3.459e-06
How to Work with Complex Models: IT Model Averaging
As you can see, we have a lot of potential interactions with just four independent variables in our model! This becomes incredibly difficult to interpret, so we often have to look for other methods to analyze complex relationships in our data.
Although there are many different and valid ways to approach a statistical problem, today we are going to use Information Theoretic (IT) Model Averaging. Rather than creating a model with all possible variables and interactions, IT Model Averaging:
-
Compares mulitple competing models using information criteria
-
Ranks and weights each competing model
-
Averages a top model set to produce a final model that only include predictor variables represented in the top model set
IT Model Averaging quantifies multiple competing hypotheses and is better able to avoid over-parameterization than traditional methods using a single model. To get started, we need to create a “global model” that includes all possible predictor variables. To simplify things even further, lets only consider pairwise interactions among predictors. For this we need to use a +
sign between variables and square the variables with ()^2
.
model_name <- lm(dependent_variable ~ (independent_variable_1 + independent_variable_2)^2, data=dataframe_name )
# Global model
fit_all <- lm(Trmmol_night ~ (VPD_N + TAIR_N + Soil_moisture + Fgroup)^2, data=phys_data)
IT Model Averaging requires the following packages:
arm
: Includes the standardize()
function that standardizes the input variables
MuMIn
: The dredge()
function creates a full submodel set, the get.models()
function creates a top model set, and model.avg()
creates the average model and importance()
calculates relative importance. Let’s install and load these packages before we begin:
# Install packages
install.packages("arm")
install.packages("MuMIn")
# Load packages
library(arm)
library(MuMIn)
Before we start using these functions, let’s take a look at the package descriptions for arm and MuMIn
Note: When reading package descriptions, it is important to:
-
Read the overall package description. This will give you a good overall idea of what this package does, and if it would be useful for you.
-
Identify particular functions of interest. You don’t necessarily need to read the documentation for all of the functions within a package (typically, you’ll only use a few functions).
-
For individual functions, read through the basic description, usage, argument descriptions, details / notes, and examples. This will tell you what syntax to use and what the syntax means.
-
You can access package descriptions online (google search), from the CRAN website, or by using the
help()
function in R.
First, let’s standardize our input variables with the standardize()
function in the arm
package. standardize()
rescales numeric variables that take on more than two values to have a mean of 0 and a standard deviation of 0.5. To do this, we just need to specify the object to standardize (our global model, fit_all
):
# Standardize the global model
stdz.model<-standardize(fit_all)
summary(stdz.model)
Next, we create the full submodel set with the dredge()
function in the MuMIn
package by specifying the object that dredge()
will evaluate (our standardized model, stdz.model
). We also need to change the default “na.omit” to prevent models from being fitted to different datasets in case of missing values using options(na.action=na.fail)
:
options(na.action=na.fail)
model.set<-dredge(stdz.model)
The get.models()
function will then create a top model set. First, we need to specify the object that get.models
will evaluate (our model set, model.set
), and then we need to specify the subset of models to include all models within 4AICcs with subset=delta<4
:
top.models<-get.models(model.set, subset=delta<4)
top.models
Finally, we’ll use the model.avg()
function to create our average model and calculates relative importance. We need to specify the object that model.avg()
will evaluate (in this case our top model set, top.models
):
average_model<-model.avg(top.models)
To check out our final average model, use the summary()
function:
summary(average_model)
This is a lot of information! We can see which component models were chosen as the top model set and then averaged, as well as their associated ranking information. We can also see the model-averaged coefficients for both the full average model and conditional average model.
The conditional average only averages over the models where the parameter appears. Conversely, the full average assumes that a variable is included in every model, but in some models the corresponding coefficient (and its respective variance) is set to zero. Unlike the conditional average, the full average does not bias the value away from zero.
We can also check out the relative importance of each factor within the model with the importance()
function. Relative importance is a unitless metric ranging from 0 (doesn’t contribute to the model at all) to 1 (contributes heavily to the model). This function will also give use the number of models within the top model set that contain each factor.
importance(average_model)
Challenge 4
Using IT Model Averaging, create a model that tests for the effects of four factors on either transpiration or stomatal conductance. Then, create a plot that best illustrates the main finding of your top model. Ultimately, your results should be able to address one of these questions:
- What environmental variables drive nocturnal transpiration / stomatal conductance, and do these differ from the drivers of daytime transpiration / stomatal conductance?
- Are nocturnal transpiration and stomatal conductance associated with daytime physiological processes?
Key Points
Linear regression is a quick and easy way to evaluate the direct relationship between two variables.
IT Model Averaging is one approach to evaluate more complex relationships between variables.
There are often many valid approaches to address a question in R, and understanding how to interpret R packages can help determine which approach might be most appropriate.