rm(list=ls())
library("sfsmisc") # mathematical integration through data points
try(dev.off())Practical 1: Statistical modeling and likelihood
We learn about the likelihood function and do some didactic examples of maximum likelihood estimation (MLE) with the optim() function.
Example 1: Survival rate
Data from 3 habitats and their deer population.
We generate population size before and after the winter.
data = data.frame(total = c(18, 25, 20),
survived = c(9, 10, 8))Question: What is the average survival rate?
Deterministic part: \(\theta\) (just fitting a mean, or the intercept if you will)
Stochastic part: \(survived \sim \text{Binomial}(total,\theta)\)
Naively, we could just calculate rate for each habitat and compute their mean.
data$survived / data$total[1] 0.5 0.4 0.4
mean(data$survived / data$total)[1] 0.4333333
But we can do better. Use the statistical model.
Probability density function \(p(y|\theta)\)
If we know the survival rate, we can compute probability for each possible outcome
x = 0:20
y = dbinom(x=x, size=20, prob=0.3)
sum(y)[1] 1
barplot(y~x, xlab="Survived out of 20", ylab="Probability")Same for a different value of survival rate
x = 0:20
y = dbinom(x=x, size=20, prob=0.75)
sum(y)[1] 1
barplot(y~x, xlab="Survived out of 20", ylab="Probability")Likelihood \(L(\theta)=p(\theta|y)\)
Likelihood is probability in reverse. For a given datapoint, likelihood is the probability of that observation as a function of parameter value (survival probability \(\theta\)).
theta = seq(0,1, length.out=100)
head(theta)[1] 0.00000000 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
Likelihood function for parameter theta and 1st datapoint
lik = dbinom(x=9, size=18, prob=theta)
plot(theta, lik, type="l", xlab="theta", ylab="Likelihood")Likelihood function for parameter theta and 2nd datapoint
lik = dbinom(x=10, size=25, prob=theta)
plot(theta, lik, type="l", xlab="theta", ylab="Likelihood")Attention: Likelihood \(L(\theta)\) is not a probability density function for theta. It does not integrate to 1. This means we can use it for maximum likelihood, but not for statistical inference without knowing the full integral.
integral = integrate.xy(theta, lik)
integral[1] 0.03846154
However, if we scale (divide) \(L(\theta)\) by its integral, this new function integrates to 1. But this is unfeasible in most applications (>1 parameter).
plot(theta, lik/integral, type="l", xlab="theta")integrate.xy(theta, lik/integral) [1] 1
OK, so far we only calculated likelihood of single datapoints. The likelihood function for the whole dataset is the product of these likelihoods.
lik = dbinom(x=data$survived[1], size=data$total[1], prob=theta)*
dbinom(x=data$survived[2], size=data$total[2], prob=theta)*
dbinom(x=data$survived[3], size=data$total[3], prob=theta)
plot(theta, lik, type="l", xlab="theta", ylab="Likelihood")Maximum likelihood means finding the parameter value for which the observed data is most likely to have occurred. Here we use GLM to find that value.
model1 = glm( cbind(survived, total-survived) ~ 1, data=data,
family=binomial)
summary(model1)
Call:
glm(formula = cbind(survived, total - survived) ~ 1, family = binomial,
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2877 0.2546 -1.13 0.258
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.52207 on 2 degrees of freedom
Residual deviance: 0.52207 on 2 degrees of freedom
AIC: 12.975
Number of Fisher Scoring iterations: 3
What is happening? GLM estimates a negative survival probability? No, the binomial family uses a log-link as default. We can override the default by specifying an identity link function (parameter is estimated on its original scale). More on that in Lesson 5.
model1 = glm( cbind(survived, total-survived) ~ 1, data=data,
family=binomial(link="identity"))
summary(model1)
Call:
glm(formula = cbind(survived, total - survived) ~ 1, family = binomial(link = "identity"),
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.42857 0.06235 6.874 6.25e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.52207 on 2 degrees of freedom
Residual deviance: 0.52207 on 2 degrees of freedom
AIC: 12.975
Number of Fisher Scoring iterations: 3
confint(model1) 2.5 % 97.5 %
0.3110495 0.5517955
Here, we get a mean survival prob of 0.43 with 95%-CI [0.31, 0.55]. CIs are not computed from the full likelihood, but rely on a local approximation.
We used glm to automatically find the parameter value that maximizes the likelihood. Next, we see how to code the likelihood as a function, which we can use for manual search or iterative optimization.
Example 2: Mean body size of a mammal population
Now, bodysize is a continuous response. Generate the weight of 7 individuals.
data = data.frame(weight = c(104, 120, 118, 115, 99, 110, 102))
boxplot(data$weight, ylab="Weight")Question: What’s the mean bodysize?
Deterministic part: \(\mu\) (just fitting a mean, or the intercept if you will)
Stochastic part: \(weight \sim \text{Normal}(\mu,\sigma)\)
First we assume \(\sigma\) is known. Estimate just 1 parameter \(\mu\)
sigma=10Probability distribution of the data for given parameters \(\mu\) and \(\sigma\)
x = seq(50, 150, length.out=100)
y = dnorm(x=x, mean=100, sd=sigma)
plot(x,y, type="l")or, for another value for the mean \(\mu\)
y = dnorm(x=x, mean=110, sd=sigma)
plot(x,y, type="l")Likelihood, in reverse, is the probability density of given data for a parameter
E.g., likelihood function for 1 datapoint
mu = seq(60, 130, length.out=100)
lik = dnorm(data$weight[1], mean=mu, sd=sigma)
plot(mu, lik, type="l")Likelihood function for all datapoints is the product of all datapoints’ likelihood functions
lik = dnorm(data$weight[1], mean=mu, sd=sigma)*
dnorm(data$weight[2], mean=mu, sd=sigma)*
dnorm(data$weight[3], mean=mu, sd=sigma)*
dnorm(data$weight[4], mean=mu, sd=sigma)*
dnorm(data$weight[5], mean=mu, sd=sigma)*
dnorm(data$weight[6], mean=mu, sd=sigma)*
dnorm(data$weight[7], mean=mu, sd=sigma)
plot(mu, lik, type="l")In our statistical model, there is a second parameter \(\sigma\). Likelihood is a function of both parameters \(L(\mu,\sigma)\)
Write likelihood as a function but use negative log-likelihood (NLL). This makes things easier since \(L\) can be VERY small and numerically unstable. (Minimizing the NLL is the same as maximizing \(L\).)
likelihood = function(parameters, weights){
lik = dnorm(weights, mean=parameters[1], sd=parameters[2], log=TRUE)
return(-sum(lik))
}Try guessing values to minimize the NLL
likelihood(c(110,10), data$weight)[1] 24.60067
likelihood(c(111,10), data$weight)[1] 24.65567
likelihood(c(109,10), data$weight)[1] 24.61567
likelihood(c(108,10), data$weight)[1] 24.70067
likelihood(c(109,10), data$weight)[1] 24.61567
likelihood(c(109,11), data$weight)[1] 24.92445
likelihood(c(109,9), data$weight)[1] 24.36252
likelihood(c(109,8), data$weight)[1] 24.21522
likelihood(c(109,7), data$weight)[1] 24.26823
likelihood(c(109,8), data$weight)[1] 24.21522
likelihood(c(110,8), data$weight)[1] 24.19179
likelihood(c(108,8), data$weight)[1] 24.34804
likelihood(c(109,8), data$weight)[1] 24.21522
We can do better, with an iterative algorithm to search for optimum \((\mu,\sigma)\) with initial guess (100,10) using optim(). It is a mathematical convention to find minima instead of maxima, therefore optim minimizes the NLL.
ml = optim(fn = likelihood,
par = c(100,10),
weights = data$weight) # the data
ml$par
[1] 109.715724 7.647299
$value
[1] 24.17355
$counts
function gradient
53 NA
$convergence
[1] 0
$message
NULL
lm-solution for comparison
model2 = lm(weight ~ 1, data=data)
summary(model2)
Call:
lm(formula = weight ~ 1, data = data)
Residuals:
Min 1Q Median 3Q Max
-10.7143 -6.7143 0.2857 6.7857 10.2857
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 109.714 3.122 35.14 3.54e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.261 on 6 degrees of freedom
Visualization of 2-dimensional likelihood and plot ML solution
mu.plot = seq(from=80, to=140, length.out=200)
sigma.plot = seq(from=5, to=15, length.out=200)
test = expand.grid(mu=mu.plot, sigma=sigma.plot)
test$lik = NA
for(i in 1:nrow(test)){
test$lik[i] = likelihood(c(test$mu[i], test$sigma[i]), data$weight)
}
lik.plot = matrix(test$lik, nrow=length(mu.plot), ncol=length(sigma.plot))
image(mu.plot, sigma.plot, exp(-lik.plot), # exp(-NLL)=likelihood, since NLL=-log(likelihood)
xlab="mu", ylab="sigma")
points(ml$par[1], ml$par[2], col="white", lwd=2)Example 3: body size vs age
Now we have a continuous predictor age. Generate data for 7 individuals.
data = data.frame(weight = c(104, 120, 118, 115, 99, 110, 102),
age = c(10, 12, 11, 11, 9, 11, 10))
plot(data$age, data$weight)Question: What’s the average growth per year? (Slope in age)
Deterministic part: \(\mu=a+b\cdot age\)
Stochastic part: \(weight \sim \text{Normal}(\mu,\sigma)\)
We could easily fit this model with LM
model3 = lm(weight ~ age, data=data)
summary(model3)
Call:
lm(formula = weight ~ age, data = data)
Residuals:
1 2 3 4 5 6 7
-1.2 -1.0 4.9 1.9 1.7 -3.1 -3.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.200 14.423 1.817 0.12899
age 7.900 1.359 5.811 0.00213 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.25 on 5 degrees of freedom
Multiple R-squared: 0.871, Adjusted R-squared: 0.8452
F-statistic: 33.77 on 1 and 5 DF, p-value: 0.002129
coef(model3)(Intercept) age
26.2 7.9
intercept = coef(model3)[1]
slope = coef(model3)[2]But for didactic purposes, we’re doing it the hard way. Maximum likelihood. We code the likelihood function (probability density of all datapoints for a given set of parameters \(a,b,\sigma\))
likelihood=function(parameters, weights, ages){
lik = dnorm(weights,
mean=parameters[1] + parameters[2]*ages, # regression line a+b*age
sd=rep(parameters[3],length(weights)), # identical sigma for all
log=TRUE)
return(-sum(lik)) # negative log likelihood NLL
}This is the lm-solution and its likelihood. Other solutions fit the data worse and have a higher NLL = lower likelihood than the lm-solution
likelihood(c(intercept, slope, 3.25), data$weight, data$age)[1] 17.18256
likelihood(c(26, 7.5, 3.25), data$weight, data$age)[1] 23.72457
likelihood(c(35, 7.5, 3.25), data$weight, data$age)[1] 24.15061
likelihood(c(45, 6.0, 3.25), data$weight, data$age)[1] 18.70682
plot(data$age, data$weight)
abline(intercept, slope)
abline(26, 7.5, col="grey")
abline(35, 7.5, col="grey")
abline(45, 6.0, col="grey")Max-likelihood: iterative algorithm to search for optimum \(a,b,\sigma\) with initial guess (100,5,8)
ml = optim(fn = likelihood,
par = c(100,5,8),
weights = data$weight, # data
ages = data$age) # data
ml$par
[1] 26.193382 7.900498 2.745227
$value
[1] 17.00468
$counts
function gradient
172 NA
$convergence
[1] 0
$message
NULL
That’s VERY close to the lm-solution
Visualize 2-dimensional likelihood (intercept & slope), \(\sigma\) fixed. Plot ML solution
int.plot = seq(from=15, to=35, length.out=200)
slope.plot = seq(from=5, to=10, length.out=200)
test = expand.grid(int=int.plot, slope=slope.plot)
test$lik = NA
head(test) int slope lik
1 15.00000 5 NA
2 15.10050 5 NA
3 15.20101 5 NA
4 15.30151 5 NA
5 15.40201 5 NA
6 15.50251 5 NA
for(i in 1:nrow(test)){
test$lik[i] = likelihood(c(test$int[i], test$slope[i], 5.0), data$weight, data$age)
}
lik.plot = matrix(test$lik, nrow=length(int.plot), ncol=length(slope.plot))
image(int.plot, slope.plot, exp(-lik.plot), # exp(-NLL)=likelihood, since NLL=-log(likelihood)
xlab="Intercept", ylab="Slope")
points(ml$par[1], ml$par[2], col="white", lwd=2)There is a strong negative correlation between Intercept and Slope. This is common if the predictor is not centered (mean=0). In extreme cases, that can cause numerical problems and wrong ML solutions.
Example 4: body size vs age (centered)
We repeat the same analysis, but this time the predictor is centered
data = data.frame(weight = c(104, 120, 118, 115, 99, 110, 102),
age = c(10, 12, 11, 11, 9, 11, 10))
data$agecenter = data$age-mean(data$age)
plot(data$agecenter, data$weight)
model4 = lm(weight ~ agecenter, data=data)
summary(model4)
Call:
lm(formula = weight ~ agecenter, data = data)
Residuals:
1 2 3 4 5 6 7
-1.2 -1.0 4.9 1.9 1.7 -3.1 -3.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 109.714 1.228 89.326 3.33e-09 ***
agecenter 7.900 1.359 5.811 0.00213 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.25 on 5 degrees of freedom
Multiple R-squared: 0.871, Adjusted R-squared: 0.8452
F-statistic: 33.77 on 1 and 5 DF, p-value: 0.002129
coef(model4)(Intercept) agecenter
109.7143 7.9000
intercept = coef(model4)[1]
slope = coef(model4)[2]
likelihood=function(parameters, weights, ages){
lik = dnorm(weights,
mean=parameters[1] + parameters[2]*ages, # regression line a+b*age
sd=rep(parameters[3],length(weights)), # identical sigma for all
log=TRUE)
return(-sum(lik))
}
likelihood(c(intercept, slope, 3.25), data$weight, data$agecenter)[1] 17.18256
plot(data$agecenter, data$weight)
abline(intercept, slope)
abline(105, 7.5, col="grey")
abline(115, 7.5, col="grey")
abline(110, 6.0, col="grey")ml = optim(fn = likelihood,
par = c(100,5,8),
weights = data$weight, # data
ages = data$agecenter) # data
ml$par
[1] 109.712504 7.900602 2.746666
$value
[1] 17.00468
$counts
function gradient
116 NA
$convergence
[1] 0
$message
NULL
Visualize 2-dimensional likelihood (intercept & slope), \(\sigma\) fixed
int.plot = seq(from=100, to=120, length.out=200)
slope.plot = seq(from=5, to=10, length.out=200)
test = expand.grid(int=int.plot, slope=slope.plot)
test$lik = NA
head(test) int slope lik
1 100.0000 5 NA
2 100.1005 5 NA
3 100.2010 5 NA
4 100.3015 5 NA
5 100.4020 5 NA
6 100.5025 5 NA
for(i in 1:nrow(test)){
test$lik[i] = likelihood(c(test$int[i], test$slope[i], 5.0), data$weight, data$agecenter)
}
lik.plot = matrix(test$lik, nrow=length(int.plot), ncol=length(slope.plot))
image(int.plot, slope.plot, exp(-lik.plot), # exp(-NLL)=likelihood, since NLL=-log(likelihood)
xlab="Intercept", ylab="Slope")
points(ml$par[1], ml$par[2], col="white", lwd=2)By using a centered predictor, the correlation between intercept & slope is resolved.
Exercise 1: two predictors
Now we have more data and a second predictor for avg annual temperature.
Deterministic part: \(\mu=a+b\cdot age + c\cdot temperature\)
Stochastic part: \(weight \sim \text{Normal}(\mu,\sigma)\)
Write a model that includes both predictors for LM and ML. Test different initial guesses for the optim function and see if they all converge to the same result.
data = data.frame(weight = c(65,80,129,41,77,133,75,75,87,109,136,134,91,49,127,120,108,126,81,112),
age = c(7,7,14,6,10,15,9,8,10,13,14,15,9,7,15,13,13,13,7,12),
temperature = c(5.9,6.9,1.5,9.6,9.0,6.9,8.0,10.2,4.8,7.6,6.2,11.2,7.3,11.4,4.1,5.1,3.7,9.5,6.4,8.3))
model5 = lm(weight ~ age+temperature, data=data)
summary(model5)
Call:
lm(formula = weight ~ age + temperature, data = data)
Residuals:
Min 1Q Median 3Q Max
-13.1864 -7.3443 -0.4815 6.2809 15.6250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.5811 12.3765 0.855 0.404
age 8.5746 0.7572 11.325 2.43e-09 ***
temperature -0.8169 0.9275 -0.881 0.391
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.854 on 17 degrees of freedom
Multiple R-squared: 0.8998, Adjusted R-squared: 0.888
F-statistic: 76.32 on 2 and 17 DF, p-value: 3.219e-09
likelihood=function(pars, df){
lik = dnorm(df$weight,
mean=pars[1] + pars[2]*df$age + pars[3]*df$temperature,
sd=rep(pars[4],nrow(df)),
log=TRUE)
return(-sum(lik))
}
ml = optim(fn=likelihood, par=c(100,5,0,8), df=data)
ml$par[1] 10.5594067 8.5795082 -0.8206534 9.0902816
ml = optim(fn=likelihood, par=c(20,1,1,10), df=data)
ml$par[1] 10.5764548 8.5744169 -0.8169554 9.0831255
ml = optim(fn=likelihood, par=c(200,10,10,10), df=data)
ml$par[1] 10.5576618 8.5724384 -0.8132537 9.0897939
Here, all initial guesses converge to the same solution (more or less) with the maximum likelihood approach.