-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathestimation.qmd
2335 lines (1725 loc) · 88 KB
/
estimation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Model Estimation and Optimization {#sec-estimation}
![](img/chapter_gp_plots/gp_plot_3.svg){width=75% fig-align='center'}
```{r}
#| label: setup-estimation
#| include: false
#|
# not sure what's going on to need this
filter = dplyr::filter
```
```{python}
#| label: py-setup-estimation
#| echo: false
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.optimize import minimize
np.set_printoptions(formatter={'float': lambda x: '{0:0.4f}'.format(x)})
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# df_reviews = pd.read_csv('data/movie_reviews.csv').dropna()
# model_reviews = sm.load('linear_models/data/model_reviews.pickle') # pkl later
# model_reviews = smf.ols('rating ~ word_count_sc', data = df_reviews).fit()
```
In our initial linear model, the coefficients for each feature are the key parameters. But how do we know what the coefficients are, and how did we come to those values? When we explore a linear model using some package function, they appear magically, but it's worth knowing a little bit about how they come to be, so let's try and dive a little deeper. As we do so, we'll end up going a long way into 'demystifying' the modeling process.
**Model estimation** is the process of finding the parameters associated with a model that allow us to reach a particular modeling goal. Different types of models will have different parameters to estimate, and there are different ways to estimate them. In general though, the goal is the same, find the set of parameters that will lead to the best predictions under the current data modeling context.
With model estimation, we can break things down into the following steps:
1. Start with an **initial guess** for the parameters.
2. Calculate the **prediction error** based on those parameters, or some function of it, or some other value that represents our model's **objective**.
3. **Update** the guess.
4. Repeat steps 2 & 3 until we find a 'best' guess.
Pretty straightforward, right? Well, it's not always so simple, but this is the general idea in most applications, so keep it in mind! In this chapter, we'll show how to do this ourselves to take away the mystery a bit from when you run standard model functions in typical contexts. Hopefully then you'll gain more confidence when you do use them!
## Key Ideas {#sec-estim-key-ideas}
A few concepts we'll keep using here are fundamental to understanding estimation and optimization. We should note that we're qualifying our present discussion of these topics to typical linear models, machine learning, and similar settings, but they are much more broad and general than presented here. We've seen some of this before, but we'll be getting a bit cozier with the concepts now.
- **Parameters** are the values associated with a model that we have to estimate.
- **Estimation** is the process of finding the parameters associated with a model.
- The **objective (loss) function** takes input and produces a value that we want to maximize or minimize.
- **Prediction error** is the difference between the observed value of the target and the predicted value, and is often used to calculate the objective function.
- **Optimization** is the specific process of finding the parameters that maximize or minimize some objective function.
- **Model Selection** is the process of choosing the best model from a set of models.
:::{.callout-note title='Estimation vs. Optimization' collapse='true'}
We can use **estimation** as general term for finding parameters, while **optimization** can be seen as a term for finding parameters that maximize or minimize some **objective function**, or even a combination of objectives. In some cases, we can estimate parameters without optimization, because there is a known way of solving the problem, but in most modeling situations we are going to use some optimization approach to find a 'best' set of parameters.
:::
### Why this matters {#sec-estim-why}
When it comes to modeling, even knowing just a little bit about what goes on behind the scenes is a great demystifier. And if models are less of a mystery, you'll feel more confident using them. Much of what you see here is part of almost every common model used for statistics and machine learning, and adding this knowledge provides you a more solid foundation for expanding your modeling skills.
### Helpful context {#sec-estim-good-to-know}
This chapter is more involved and technical than most of the others, so might be more suited for those who like to get their hands dirty. It's all about DIY, and so we'll be doing a lot of the work ourselves. If you're not one of those types of people that gets much out of that, that's ok, you can skip this chapter and still get a lot out of the rest of the book. But if you're curious about how models work, or you want to be able to do more than just run a canned function, then we think you'll find the following useful. You'd want to at least have your linear model basics down (@sec-foundation).
## Data Setup {#sec-estim-data-setup}
For the examples here, we'll use the world happiness dataset for the year 2018. We'll use the happiness score as our target. Let's take an initial look at the data here, but for more information see the appendix @sec-dd-world-happiness-report.
```{r}
#| echo: false
#| eval: false
#| label: fig-r-happiness-data
#| caption: World Happiness Data Summary
# use this for the img
Happiness = read_csv('data/world_happiness_2018.csv') |>
drop_na() |>
select(
country,
happiness = happiness_score,
`life expectancy` = healthy_life_expectancy_at_birth,
`GDP per cap (log)` = log_gdp_per_capita,
`perc. of corruption` = perceptions_of_corruption
)
tbl_happy = Happiness |>
gtExtras::gt_plt_summary()
# CROP AFTER ('expand' does not appear affect unnecessary whitespace)
gtsave_extra(tbl_happy, 'img/happiness_data_summary.png', expand = 0, zoom = 3)
```
![World Happiness Data Summary](img/happiness_data_summary.png){#fig-r-happiness-data width=100%}
Our happiness score has values from around 3-7, life expectancy and gdp appear to have some notable variability, and corruption perception is skewed toward lower values. We can also see that the features and target are correlated with each other, which is not surprising.
\small
```{r}
#| echo: false
#| label: tbl-r-happiness-data-correlation
#| tbl-cap: Correlation Matrix for World Happiness Data
read_csv('data/world_happiness_2018.csv') |>
drop_na() |>
select(
country,
happiness = happiness_score,
life_exp = healthy_life_expectancy_at_birth,
log_gdp_pc = log_gdp_per_capita,
corrupt = perceptions_of_corruption
) |>
corrr::correlate(diagonal = 1) |>
gt()
htmltools::br()
```
\normalsize
For our purposes here, we'll drop any rows with missing values, and we'll use scaled features so that they have the same variance, which, as noted in the [data chapter](./data.html#sec-data-transformations), can help make estimation easier.
:::{.panel-tabset}
##### R
```{r}
#| label: r-happiness-data-setup
df_happiness = read_csv('https://tinyurl.com/worldhappiness2018') |>
drop_na() |>
rename(happiness = happiness_score) |>
select(
country,
happiness,
contains('_sc')
)
```
##### Python
```{python}
#| label: py-happiness-data-setup
import pandas as pd
df_happiness = (
pd.read_csv('https://tinyurl.com/worldhappiness2018')
.dropna()
.rename(columns = {'happiness_score': 'happiness'})
.filter(regex = '_sc|country|happ')
)
```
:::
### Other Setup
For the R examples, after the above nothing beyond base R is needed. For Python examples, the following should be enough to get you through the examples.
```{python}
#| label: py-imports-estimation
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from scipy.optimize import minimize
from scipy.stats import norm
```
## Starting Out by Guessing {#sec-estim-guessing}
So, we'll start with a model in which we predict a country's level of happiness by their life expectancy. If you can expect to live longer, maybe you're probably in a country with better health care, higher incomes, and other important stuff. As such we can probably expect higher happiness scores with higher life expectancy. We'll stick with a linear regression model as well to start out.
As a starting point we can just guess what the parameter should be. But how would we know what to guess? How would we know which guesses are better than others? Let's try a couple guesses and see what happens. Let's say that we think all countries start at the bottom on the happiness scale (around 3), but life expectancy makes a big impact- for every standard deviation of life expectancy we go up a whole point on happiness[^lifexpcoef]. We can plug this into the model and see what we get:
[^lifexpcoef]: Since life expectancy is scaled, a standard deviation is 1, and in this case represents about 7 years.
$$
\text{prediction} = 3.3 + 1\cdot\text{life\_exp}
$$
For a different model, we'll say countries start with a mean of happiness score, and moving up a standard deviation of life expectancy would move us up a half point of happiness.
$$
\text{prediction} = \overline{\text{happiness}} + .5\cdot\text{life\_exp}
$$
How do we know which is better? Let's find out!
## Prediction Error {#sec-estim-prediction-error}
We've seen that a key component to model assessment involves comparing the predictions from the model to the actual values of the target. This difference is known as the **prediction error**, or **residuals** in more statistical contexts. We can express this as:
$$
\epsilon = y - \hat{y}
$$
$$
\text{error} = \text{target} - \text{model-based guess}
$$
This prediction error tells us how far off our model prediction is from the observed target values, but it also gives us a way to compare models. How? With our measure of prediction error, we can calculate a total error for all observations/predictions (@sec-knowing-model-metrics), or similarly, the average error. If one model or parameter set has less total or average error, we can say it's a better model than one that has more (@sec-knowing-model-compare). Ideally we'd like to choose a model with the least possible error, but we'll see that this is not always possible[^neverbest].
However, if we just take the average of our errors from a linear regression model, you'll see that it is roughly zero! This is by design for many common models, and is even made explicit in their mathematical depiction. So, to get a meaningful error metric, we need to use the squared error value or the absolute value. These also allow errors of similar value above and below the observed value to cost the same[^absloss]. As we've done elsewhere, we'll use squared error here, and we'll calculate the mean of the squared errors for all our predictions, i.e., the **mean squared error (MSE)**.
[^absloss]: We don't have to do it this way, but it's the default in most scenarios. As an example, maybe for your situation overshooting is worse than undershooting, and so you might want to use an approach that would weight those errors more heavily.
[^neverbest]: It turns out that our error metric is itself an *estimate* of the true error. We'll get more into this later, but for now this means that we can't ever know the true error, and so we can't ever really know the best or true model. However, we can still choose a good or better model relative to others based on our error estimate.
:::{.panel-tabset}
###### R
```{r}
#| label: r-error
#| results: hide
y = df_happiness$happiness
# Calculate the error for the guess of 4
prediction = min(df_happiness$happiness) + 1*df_happiness$life_exp_sc
mse_model_A = mean((y - prediction)^2)
# Calculate the error for our other guess
prediction = mean(y) + .5 * df_happiness$life_exp_sc
mse_model_B = mean((y - prediction)^2)
tibble(
Model = c('A', 'B'),
MSE = c(mse_model_A, mse_model_B)
)
```
###### Python
```{python}
#| label: py-error
#| results: hide
y = df_happiness['happiness']
# Calculate the error for the guess of four
prediction = np.min(df_happiness['happiness']) + \
1 * df_happiness['life_exp_sc']
mse_model_A = np.mean((y - prediction)**2)
# Calculate the error for our other guess
prediction = y.mean() + .5 * df_happiness['life_exp_sc']
mse_model_B = np.mean((y - prediction)**2)
pd.DataFrame({
'Model': ['A', 'B'],
'MSE': [mse_model_A, mse_model_B]
})
```
:::
Now let's look at our MSE, and we'll also inspect the square root of it, or the **Root Mean Squared Error**, as that puts things back on the original target scale, and tells us the standard deviation of our prediction errors. We also add the **Mean Absolute Error** (MAE) as another metric with straightforward interpretation.
```{r}
#| echo: false
#| label: tbl-compare-error
#| tbl-cap: Comparison of Error Metrics for Two Models
mse_tab = tibble(
Model = c('A', 'B'),
MSE = c(mse_model_A, mse_model_B),
RMSE = sqrt(MSE),
MAE = c(
mean(abs(y - 4)),
mean(abs(y - (mean(y) + 1 * df_happiness$life_exp_sc)))
),
`RMSE % drop` = c(NA, round((RMSE[2] - RMSE[1]) / RMSE[1] * 100, 1)),
`MAE % drop` = c(NA, round((MAE[2] - MAE[1]) / MAE[1] * 100, 1))
) |>
mutate(,
`RMSE % drop` = c('', scales::percent(`RMSE % drop`[2] / -100)),
`MAE % drop` = c('', scales::percent(`MAE % drop`[2] / -100))
)
mse_tab |>
gt() |>
cols_align('center')
```
Inspecting the metrics, we can see that we are off on average by over a point for model A (MAE), and a little over half a point on average for model B. So we can see that model B is not only better, but results in a `r mse_tab[['RMSE % drop']][2]` drop in RMSE, and similar for MAE. We'd definitely prefer it over model A. This approach is also how we can compare models in a general fashion.
Now all of this is useful, and at least we can say one model is better than another. But you're probably hoping there is an easier way to get a good guess for our model parameters, especially when we have possibly dozens of features and/or parameters to keep track of, and there is!
## Ordinary Least Squares {#sec-estim-ols}
In a simple linear model, we often use the **Ordinary Least Squares (OLS)** method to estimate parameters. This method finds the coefficients that minimize the sum of the squared differences between the predicted and actual values[^notamodel]. In other words, it finds the coefficients that minimize the sum of the squared differences between the predicted values and the actual values, which is what we just did in our previous example. The sum of the squared errors is also called the **residual sum of squares** (RSS), as opposed to the 'total' sums of squares (i.e., the variance of the target), and the part explained by the model ('model' or 'explained' sums of squares). We can express this as follows, where $y_i$ is the observed value of the target for observation $i$, and $\hat{y_i}$ is the predicted value from the model.
[^notamodel]: Some disciplines seem to confuse models with estimation methods and link functions. It doesn't really make sense, nor is it informative, to call something an OLS model or a logit model. Many models are estimated using a least squares objective function, even deep learning, and different types of models use a logit link, from logistic regression, to beta regression, to activation functions used in deep learning.
$$
\text{Value} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2
$$ {#eq-ols}
It's called *ordinary* least squares because there are other least squares methods - generalized least squares, weighted least squares, and others, but we don't need to worry about that for now. The sum or mean of the squared errors is our **objective value**. The **objective function** takes the predictions and observed target values as inputs, and returns the objective value as an output. We can use this value to find the best parameters for a specific model, as well as compare models with different parameters.
Now let's calculate the OLS estimate for our model. We need our own function to do this, but it doesn't take much to create one. We need to map our inputs to our output, which are the model predictions. We then calculate the error, square it, and then average the squared errors to provide the mean squared error.
:::{.panel-tabset}
##### R
```{r}
#| label: r-ols
# for later comparison
model_lr_happy = lm(happiness ~ life_exp_sc, data = df_happiness)
ols = function(X, y, par) {
# add a column of 1s for the intercept
X = cbind(1, X)
# Calculate the predicted values
y_hat = X %*% par # %*% is matrix multiplication
# Calculate the error
error = y - y_hat
# Calculate the value as mean squared error
value = sum(error^2) / nrow(X)
# Return the objective value
return(value)
}
```
##### Python
```{python}
#| label: py-ols
# for later comparison
model_lr_happy = smf.ols('happiness ~ life_exp_sc', data = df_happiness).fit()
def ols(par, X, y):
# add a column of 1s for the intercept
X = np.c_[np.ones(X.shape[0]), X]
# Calculate the predicted values
y_hat = X @ par # @ is matrix multiplication
# Calculate the mean of the squared errors
value = np.mean((y - y_hat)**2)
# Return the objective value
return value
```
:::
We'll want to make a bunch of guesses for the parameters, so let's create data for those guesses. We'll then choose the guess that gives us the lowest objective value. But before getting carried away, try it out for just one guess to see how it works!
:::{.panel-tabset}
##### R
```{r}
#| label: grid-guess-r
# create a grid of guesses
guesses = crossing(
b0 = seq(1, 7, 0.1),
b1 = seq(-1, 1, 0.1)
)
# Example for one guess
ols(
X = df_happiness$life_exp_sc,
y = df_happiness$happiness,
par = unlist(guesses[1, ])
)
```
##### Python
```{python}
#| label: grid-guess-py
# create a grid of guesses
from itertools import product
guesses = pd.DataFrame(
product(
np.arange(1, 7, 0.1),
np.arange(-1, 1, 0.1)
),
columns = ['b0', 'b1']
)
# Example for one guess
ols(
par = guesses.iloc[0,:],
X = df_happiness['life_exp_sc'],
y = df_happiness['happiness']
)
```
:::
Now we'll calculate the loss for each guess and find which one gives us the smallest function value.
:::{.panel-tabset}
##### R
```{r}
#| label: r-ols-get-best-guess
# Calculate the function value for each guess, mapping over
# each combination of b0 and b1
guesses = guesses |>
mutate(
objective = map2_dbl(
guesses$b0,
guesses$b1,
\(b0, b1) ols(
par = c(b0, b1),
X = df_happiness$life_exp_sc,
y = df_happiness$happiness
)
)
)
min_loss = guesses |> filter(objective == min(objective))
min_loss
```
##### Python
```{python}
#| label: py-ols-get-best-guess
# Calculate the function value for each guess, mapping over
# each combination of b0 and b1
guesses['objective'] = guesses.apply(
lambda x: ols(
par = x,
X = df_happiness['life_exp_sc'],
y = df_happiness['happiness']
),
axis = 1
)
min_loss = guesses[guesses['objective'] == guesses['objective'].min()]
min_loss
```
:::
The following plot shows the objective value for each guess in a smooth fashion as if we had a more continuous space. Darker values indicate we're getting closer to our smallest objective value. The star notes our best guess.
```{r}
#| echo: false
#| label: r-ols-get-mse
model_lr_happy_life = lm(happiness ~ life_exp_sc, data = df_happiness)
int_happy = round(coef(model_lr_happy_life)[1], 2)
coef_happy = round(coef(model_lr_happy_life)[2], 2)
mse_happy_life = round(performance::performance_rmse(model_lr_happy_life)^2, 4)
```
```{r}
#| echo: false
#| eval: false
#| label: fig-r-ols-apply
predictions = min_loss$b0 + min_loss$b1 * df_happiness$life_exp_sc
p_ols = guesses |>
ggplot(aes(x = b0, y = b1)) +
geom_raster(aes(fill = objective), show.legend = FALSE) +
geom_contour(aes(z = objective), color = 'white', show.legend = FALSE) +
geom_point(
x = model_lr_happy_life$coefficients[1],
y = model_lr_happy_life$coefficients[2],
size = 5,
pch = 8,
color = 'white',
alpha = 1
) +
# annotate(
# geom = 'text',
# # aes(
# x = min_loss$b0-1,
# y = min_loss$b1,
# label = glue('Best guess at ({round(min_loss$b0, 2)}, {round(min_loss$b1, 3)})
# Objective value: {round(min_loss$objective, 4)}'),
# size = 3,
# hjust = 0,
# color = '#ffffff'
# ) +
# annotate(
# geom = 'text',
# x = int_happy + .66,
# y = coef_happy,
# label = glue('OLS estimate at ({int_happy}, {coef_happy})
# Objective value: {mse_happy_life}'),
# size = 3,
# # hjust = c(-0.1, -0.1),
# color = '#ffffff'
# ) +
# coord_cartesian(
# # xlim = c(-1, 1),
# ylim = c(-1, 1)
# ) +
scico::scale_fill_scico(begin = 0, end = 1, direction = 1) +
labs(
x = 'b0',
y = 'b1',
title = 'Objective value (loss) for different guesses of b0 and b1'
)
# p_ols
ggsave(p_ols, filename = 'img/estim-ols_guesses.svg', width = 8, height = 6)
```
![Results of parameter search](img/estim-ols_guesses.svg){#fig-r-ols-apply width=75%}
Now let's run the model using standard functions.
:::{.panel-tabset}
##### R
```{r}
#| label: r-ols-lm
#| results: hide
model_lr_happy_life = lm(happiness ~ life_exp_sc, data = df_happiness)
# not shown
c(
coef(model_lr_happy_life),
performance::performance_mse(model_lr_happy_life)
)
```
##### Python
```{python}
#| label: py-ols-lr
#| results: hide
model_lr_happy_life = sm.OLS(
df_happiness['happiness'],
sm.add_constant(df_happiness['life_exp_sc'])
).fit()
# not shown
model_lr_happy_life.params, model_lr_happy_life.scale
```
:::
If we inspect our results from the standard functions, we had estimates of `r int_happy` and `r coef_happy` for our coefficients versus the best guess from our approach of `r min_loss$b0` and `r min_loss$b1`. These are very similar but not exactly the same, but this is mostly due to the granularity of our guesses. Even so, in the end we can see that we get pretty dang close to what our basic `lm` or `statsmodels` functions would get us. Pretty neat!
::: {.callout-note title="Estimation as 'Learning'" collapse='true'}
Estimation and/or optimization can be seen as the process of a model *learning* which parameters will best allow the predictions to match the observed data, and hopefully, predict as-yet-unseen future data. This is a very common way to think about estimation in machine *learning*, and it is a useful way to think about our simple linear model also.
One thing to keep in mind is that it is not a magical process. It takes good data, a good idea (model), and an appropriate estimation method to get good results.
:::
## Optimization {#sec-estim-opt}
Before we get into other objective functions, let's think about a better way to find good parameters for our model. Rather than just guessing, we can use a more systematic approach, and thankfully, there are tools out there to help us. We just use a function like our OLS function, give it a starting point, and let the algorithms do the rest! These tools eventually arrive at a pretty good set of parameters, and are optimized for speed.
Previously we created a set of guesses, and tried each one. This approach is often called a **grid search**, as we search over a grid of possible parameters as in @fig-r-ols-apply, and it is a bit of a brute force approach to finding the best fitting model. You can maybe imagine a couple of unfortunate scenarios for this approach. One is just having a very large number of parameters to search, a common issue in deep learning. Or it may be that our range of guesses doesn't allow us to find the right set of parameters. Or perhaps we specify a very large range, but the best fitting model is within a very narrow part of that, so that it takes a long time to find them. In any of these cases we waste a lot of time or may not find an optimal solution.
In general, we can think of **optimization** as employing a smarter, more efficient way to find what you're looking for. Here's how it works:
- **Start with an initial guess** for the parameters.
- **Calculate the objective function** given the parameters.
- **Update the parameters** to a new guess (that hopefully improves the objective function).
- **Repeat**, until the improvement is small enough to not be worth continuing, or an arbitrary maximum number of iterations is reached.
With optimization, the key aspect is how we *update* the old parameters with a new guess at each iteration. Different **optimization algorithms** use different approaches to find those new guesses. The process stops when the improvement is smaller than a set **tolerance** level, or the maximum number of iterations is reached. If the objective is met, we say that our model has **converged**. Sometimes, the number of iterations is not enough for us to reach convergence in terms of tolerance, and we have to try again with a different set of parameters, a different algorithm, maybe use some data transformations, or something else.
So, let's try it out! We start out with several inputs:
- the objective function
- the initial guess for the parameters to get things going
- other related inputs to the objective function, such as the data
- options for the optimization process, e.g., algorithm, maximum number of iterations, etc.
With these inputs, we'll let the chosen optimization function do the rest of the work. We'll again compare our results to the standard functions to make sure we're on the right track.
:::{.panel-tabset}
##### R
We'll use the `optim` function in R.
```{r}
#| label: r-optim-ols
#| results: 'hide'
our_ols_optim = optim(
par = c(1, 0), # initial guess for the parameters
fn = ols,
X = df_happiness$life_exp_sc,
y = df_happiness$happiness,
method = 'BFGS', # optimization algorithm
control = list(
reltol = 1e-6, # tolerance
maxit = 500 # max iterations
)
)
our_ols_optim
```
##### Python
We'll use the `minimize` function in Python.
```{python}
#| label: py-optim-ols
#| results: 'hide'
from scipy.optimize import minimize
our_ols_optim = minimize(
fun = ols,
x0 = np.array([1., 0.]), # initial guess for the parameters
args = (
np.array(df_happiness['life_exp_sc']),
np.array(df_happiness['happiness'])
),
method = 'BFGS', # optimization algorithm
tol = 1e-6, # tolerance
options = {
'maxiter': 500 # max iterations
}
)
our_ols_optim
```
:::
Optimization functions typically return multiple values, including the best parameters found, the value of the objective function at that point, and sometimes other information like the number of iterations it took to reach the returned value and whether or not the process converged. This can be quite a bit of stuff, so we don't show the raw output, but we definitely encourage you to inspect it closely. The following table shows the estimated parameters and the objective value for our model, and we can compare it to the standard functions to see how we did.
```{r}
#| echo: false
#| label: tbl-r-optim-ols
#| tbl-cap: Comparison of Our Results to a Standard Function
our_result_tbl = tibble(
'Parameter' = c('Intercept', 'Life Exp. Coef.', 'Objective/MSE'),
`Standard` = c(coef(model_lr_happy_life), performance::performance_mse(model_lr_happy_life)),
`Our Result` = c(our_ols_optim$par, our_ols_optim$value)
) |>
mutate(
`Standard` = round(`Standard`, 3),
`Our Result` = round(`Our Result`, 3)
)
our_result_tbl |>
gt(decimals = 4)
```
So, our little function and the right tool allows us to come up with the same thing as base R and `statsmodels`! I hope you're feeling pretty good at this point because you should! You just proved you could do what seemed before to be like magic, but really all it took is just a little knowledge about some key concepts to demystify the process. So, let's keep going!
:::{.callout-note title='A Note on Terminology' collapse='true'}
The objective function is often called the **loss function**, and sometimes the **cost function**. However, these both imply that we are trying to minimize the function, which is not always the case[^sklearnnomax], and it's arbitrary whether you want to minimize or maximize the function.
The term **metric**, such as the MSE or AUC we've seen elsewhere, is a value that you might also want to use to evaluate the model. Some metrics are also used as an objective function. For instance, we might minimize MSE as our objective, but also calculate other metrics like Adjusted R-squared or Mean Absolute Error to evaluate the model. It's fine to use MSE as the sole objective/metric as well.
This can be very confusing when starting out! We'll try to stick to the term *metric* for additional values that we might want to examine, apart from the *objective function value*, which is specifically used for estimating model parameters.
:::
[^sklearnnomax]: You may find that some packages will *only* minimize (or maximize) a function, even to the point of reporting nonsensical things like negative squared values, so you'll need to take care when implementing your own metrics.
## Maximum Likelihood {#sec-estim-maxlike}
In our example, we've been minimizing the mean of the squared errors to find the best parameters for our model. But let's think about this differently. Now we'd like you to think about the **data generating process**. Ignoring the model, imagine that each happiness value is generated by some random process, like drawing from a normal distribution. So, something like this would describe it mathematically:
$$
\text{happiness} \sim N(\text{mean}, \text{sd})
$$
where the `mean` is just the mean of happiness, and `sd` is its standard deviation. In other words, we can think of happiness as a random variable that is drawn from a normal distribution with mean and standard deviation as the parameters of that distribution.
Let's apply this idea to our linear model setting. In this case, the mean is a function of life expectancy, and we're not sure what the standard deviation is, but we can go ahead and write our model as follows.
$$
\text{mean} = \beta_0 + \beta_1 * \text{life\_exp}
$$
$$
\text{happiness} \sim N(\text{mean}, \text{sd})
$$
Now, our model is estimating the parameters of the normal distribution. We have an extra parameter to estimate - the standard deviation, which is similar to our RMSE.
In our analysis, instead of merely comparing the predicted happiness score to the actual score by looking at their difference, we do something a little different to get a sense of their correspondence. We consider how *likely* it is to observe the actual happiness score based on our prediction. The value known as the **likelihood**, depends on our model's parameters. The likelihood is determined by the statistical distribution we've selected for our analysis. We can write this as:
$$
\text{Pr}(\text{happiness} \mid \text{life\_exp}, \beta_0, \beta_1, \text{sd})
$$
$$
\text{Pr}(\text{happiness} \mid \text{mean}, \text{sd})
$$
Thinking more generally, the likelihood gives us the probability of the observed data given the parameter estimates.
$$
\text{Pr}(\text{Data} \mid \text{Parameters})
$$
The following shows how to calculate a likelihood for our model. The values you see are called **probability density** values. They're not exactly probabilities, but they show the **relative likelihood** of each observation[^probzero]. You can think of them like you do for probabilities, but remember that likelihoods are slightly different.
[^probzero]: The actual probability of a *specific value* in this setting is 0, but the probability of a range of values is greater than 0. You can find out more about likelihoods and probabilities at the [stackexchange discussion on the the difference between likelihood and probability](https://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability), but many traditional statistical texts will cover this also.
:::{.panel-tabset}
##### R
```{r}
#| label: r-demo-Likelihood
# two example life expectancy scores, at the mean (0) and 1 sd above
life_expectancy = c(0, 1)
# observed happiness scores
happiness = c(4, 5.2)
# predicted happiness with rounded coefs
mu = 5 + 1 * life_expectancy
# just a guess for sigma
sigma = .5
# likelihood for each observation
L = dnorm(happiness, mean = mu, sd = sigma)
L
```
##### Python
```{python}
#| label: py-demo-Likelihood
from scipy.stats import norm
# two example life expectancy scores, at the mean (0) and 1 sd above
life_expectancy = np.array([0, 1])
# observed happiness scores
happiness = np.array([4, 5.2])
# predicted happiness with rounded coefs
mu = 5 + 1 * life_expectancy
# just a guess for sigma
sigma = .5
# likelihood for each observation
L = norm.pdf(happiness, loc = mu, scale = sigma)
L
```
:::
With a guess for the parameters and an assumption about the data's distribution, we can calculate the likelihood of each data point, which is what our final result `L` shows. We eventually get a total likelihood for all observations, similar to how we added squared errors previously. But unlike errors, we want *more* likelihood, not less. In theory, we'd multiply each likelihood, but in practice we sum the *log of the likelihood*, otherwise values would get too small for our computers to handle. We can also turn our problem into a minimization problem by calculating the negative log-likelihood, and then minimizing that value, which many optimization algorithms are designed to do[^nll].
[^nll]: The negative log-likelihood is often what is reported in the model output as well.
The following is a function we can use to calculate the likelihood of the data given our parameters. This value, just like MSE, can also be used to compare models with different parameter guesses[^lowerbound]. For the estimate of sigma, note that we are estimating its log value by exponentiating the parameter guess. This will keep it positive, as the standard deviation must be positive. We'll hold off with our result until the table shown next.
[^lowerbound]: Those who have experience here will notice we aren't putting a lower bound on sigma. You typically want to do this otherwise you may get nonsensical results by not keeping sigma positive. You can do this by setting a specific argument for an algorithm that uses boundaries, or more simply by exponentiating the parameter so that it can only be positive. In the latter case, you'll have to exponentiate the final parameter estimate to get back to the correct scale. We leave this detail out of the code for now to keep things simple.
:::{.panel-tabset}
##### R
```{r}
#| label: r-likelihood
#| results: 'hide'
max_likelihood = function(par, X, y) {
# setup
X = cbind(1, X) # add a column of 1s for the intercept
beta = par[-1] # coefficients
sigma = exp(par[1]) # error sd, exp keeps positive
N = nrow(X)
LP = X %*% beta # linear predictor
mu = LP # identity link in the glm sense
# calculate (log) likelihood
ll = dnorm(y, mean = mu, sd = sigma, log = TRUE)
value = -sum(ll) # negative for minimization
return(value)
}
our_max_like = optim(
par = c(1, 0, 0), # first param is sigma
fn = max_likelihood,
X = df_happiness$life_exp_sc,
y = df_happiness$happiness
)
our_max_like
# logLik(model_lr_happy_life) # one way to extract the LL from a model
```
##### Python
```{python}
#| label: py-likelihood
#| results: 'hide'
def max_likelihood(par, X, y):
# setup
X = np.c_[np.ones(X.shape[0]), X] # add a column of 1s for the intercept
beta = par[1:] # coefficients
sigma = np.exp(par[0]) # error sd, exp keeps positive
N = X.shape[0]
LP = X @ beta # linear predictor
mu = LP # identity link in the glm sense
# calculate (log) likelihood
ll = norm.logpdf(y, loc = mu, scale = sigma)
value = -np.sum(ll) # negative for minimization
return value
our_max_like = minimize(
fun = max_likelihood,
x0 = np.array([1, 0, 0]), # first param is sigma
args = (
np.array(df_happiness['life_exp_sc']),
np.array(df_happiness['happiness'])
)
)
our_max_like
# model_lr_happy_life.llf # one way to extract the log likelihood from a model
```
:::
We can compare our result to a built-in function that has capabilities beyond OLS, and the table shows we're duplicating the basic result. We show more decimal places on the log likelihood estimate to prove we aren't getting *exactly* the same result.
```{r}
#| echo: false
#| label: tbl-r-likelihood
#| tbl-cap: Comparison of Our Results to a Standard Function
glm_happy_life = glm(happiness ~ life_exp_sc, data = df_happiness)
our_result_tbl = tibble(
'Parameter' = c('Intercept', 'Life Exp. Coef.', 'Sigma', 'LogLik (neg)'),
`Standard` = c(coef(glm_happy_life), sqrt(summary(glm_happy_life)$dispersion), -logLik(model_lr_happy_life)),
`Our Result` = c(our_max_like$par[-1], exp(our_max_like$par[1]), our_max_like$value)
) |>
gt(decimals = 3) |>
gt::tab_footnote(
footnote = "Parameter estimate is exponentiated for the by-hand approach",
locations = cells_body(
columns = vars(`Parameter`),
rows = `Parameter` == 'Sigma'
),
placement = 'right'
) |>
tab_options(
footnotes.font.size = 10
)
our_result_tbl
```
To use a maximum likelihood approach for linear models, you can use functions like `glm` in R or `GLM` in Python, which is the reference used in the table above. We can also use different likelihoods corresponding to the binomial, poisson and other distributions. Still other packages would allow even more distributions for consideration. In general, we choose a distribution that we feel best reflects the data generating process. For binary targets for example, we typically would feel a bernoulli or binomial distribution is appropriate. For count data, we might choose a poisson or negative binomial distribution. For targets that fall between 0 and 1, we might go for a beta distribution. You can see some of these demonstrated in [@sec-glm].
There are many distributions to choose from, and the best one depends on your data. Sometimes, even if one distribution seems like a better fit, we might choose another one because it's easier to use. Some distributions are special cases of others, or they might become more like a normal distribution under certain conditions. For example, the exponential distribution is a special case of the gamma distribution, and a t distribution with many degrees of freedom looks like a normal distribution. Here is a visualization of the relationships among some of the more common distributions (@wikipedia_relationships_2023).
<!-- {{< pagebreak >}} -->
<!-- \blandscape -->
```{r}
#| echo: false
#| label: fig-distribution-relationships
#| fig-width: 6
#| fig-cap: Relationships Among Some Probability Distributions. Source Wikipedia.
# knitr::include_graphics('img/distribution_relationships.jpg')
```
![Relationships Among Some Probability Distributions](img/distribution_relationships.jpg){#fig-distribution-relationships width=100%}
<!-- \elandscape -->
<!-- {{< pagebreak >}} -->
When you realize that many distributions are closely related, it's easier to understand why we might choose one over another, but also why we might use a simpler option even if it's not the best fit - you likely won't come to a different conclusion about your model. Ultimately, you'll get a better feel for this as you work with different types of data and models.
Here are examples of standard GLM functions, which just require an extra argument for the family of the distribution.
:::{.panel-tabset}
##### R
```{r}
#| eval: false
#| label: r-glm
glm(happiness ~ life_exp_sc, data = df_happiness, family = gaussian)
glm(binary_target ~ x1 + x2, data = some_data, family = binomial)
glm(count ~ x1 + x2, data = some_data, family = poisson)
```
##### Python
```{python}
#| eval: false
#| label: py-glm
import statsmodels.formula.api as smf
smf.glm(
'happiness ~ life_exp_sc',
data = df_happiness,
family = sm.families.Gaussian()
)
smf.glm(
'binary_target ~ x1 + x2',
data = some_data,
family = sm.families.Binomial()
)
smf.glm(
'count ~ x1 + x2',
data = some_data,
family = sm.families.Poisson()
)
```
:::
### Diving deeper {#sec-estim-likelihood-deeper}
Let's think more about what's going on here, as it applies to estimation and optimization in general. It turns out that our objective function defines a 'space' or 'surface'. You can imagine the process as searching for the lowest point on a landscape, with each guess a point on this landscape. Let's start to get a sense of this with the following visualization, based on a single parameter. The following visualization shows this for a single parameter. The data comes from a variable with a true average of 5. As our guesses get closer to 5, the likelihood increases. However, with more and more data, the final guess converges on the true value. Model estimation finds that maximum on the curve, and optimization algorithms are the means to find it.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-r-likelihood-plot
#| fig-cap: Likelihood Function for One Parameter
#| out-width: 100%
set.seed(1234)
y = rpois(100000, lambda = 5)
mus = seq(3, 8, l = 100)