HDtweedie: lasso tutorial

`HDtweedie` package notes

These functions are modified based on the functions from the glmnet package. The glmnet package was authored by Robert Tibshirani, who in 1996 introduced the Lasso method.

load the data

This data has been previously transformed, indicated in Qian, W., Yang, Y., Yang, Y. and Zou, H. (2013), âTweedie’s Compound Poisson Model With Grouped Elastic Net,â submitted to Journal of Computational and Graphical Statistics.

The y response value (aggregate claim loss) has been transformed to y = y / $1000
The numerical values have been scaled (within each factor) to have a mean of 0 and a standard deviation of 1.

library(HDtweedie)

data("auto") # modified auto insurance dataset, where x is a matrix of 2812 policy records with 56 predictors and y is the aggreagate claim loss

dim(auto)

## NULL

class(auto)

## [1] "list"

str(auto) # two lists (where x is a matrix of the predictor variables)

## List of 2
##  $ y: num [1:2812] 0 0 19.2 0 0 ...
##  $ x: num [1:2812, 1:56] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:56] "CAR_TYPE_2" "CAR_TYPE_3" "CAR_TYPE_4" "CAR_TYPE_5" ...

fit a tweedie distribution for lasso regression

This shows nonzero coefficients (Df) and the value of lambda. The function contains many other parameters to specify: group, weights, elasticnet mixing parameter, penalty factor …

fit <- HDtweedie(x = auto$x , y = auto$y, p =1.50) 
print(fit)

## 
## Call:  HDtweedie(x = auto$x, y = auto$y, p = 1.5) 
## 
##     Df   Lambda
## s0   0 47.98000
## s1   1 44.74000
## s2   1 41.73000
## s3   1 38.92000
## s4   1 36.29000
## s5   1 33.85000
## s6   1 31.57000
## s7   1 29.44000
## s8   1 27.45000
## s9   1 25.60000
## s10  1 23.88000
## s11  1 22.27000
## s12  1 20.77000
## s13  1 19.37000
## s14  1 18.06000
## s15  1 16.85000
## s16  1 15.71000
## s17  1 14.65000
## s18  1 13.66000
## s19  1 12.74000
## s20  1 11.88000
## s21  1 11.08000
## s22  1 10.34000
## s23  1  9.64000
## s24  1  8.99000
## s25  1  8.38400
## s26  1  7.81900
## s27  1  7.29200
## s28  1  6.80100
## s29  1  6.34200
## s30  1  5.91500
## s31  1  5.51600
## s32  1  5.14400
## s33  1  4.79800
## s34  1  4.47400
## s35  1  4.17300
## s36  1  3.89200
## s37  1  3.62900
## s38  1  3.38500
## s39  1  3.15700
## s40  1  2.94400
## s41  2  2.74500
## s42  2  2.56000
## s43  2  2.38800
## s44  2  2.22700
## s45  1  2.07700
## s46  1  1.93700
## s47  1  1.80600
## s48  1  1.68500
## s49  2  1.57100
## s50  2  1.46500
## s51  2  1.36600
## s52  2  1.27400
## s53  2  1.18800
## s54  2  1.10800
## s55  2  1.03400
## s56  2  0.96400
## s57  2  0.89900
## s58  2  0.83840
## s59  3  0.78190
## s60  3  0.72920
## s61  3  0.68010
## s62  3  0.63420
## s63  3  0.59150
## s64  4  0.55160
## s65  4  0.51440
## s66  4  0.47980
## s67  5  0.44740
## s68  5  0.41730
## s69  5  0.38920
## s70  5  0.36290
## s71  5  0.33850
## s72  5  0.31570
## s73  5  0.29440
## s74  5  0.27450
## s75  5  0.25600
## s76  5  0.23880
## s77  5  0.22270
## s78  6  0.20770
## s79  6  0.19370
## s80  7  0.18060
## s81  7  0.16850
## s82  7  0.15710
## s83  7  0.14650
## s84  7  0.13660
## s85  7  0.12740
## s86  8  0.11880
## s87  8  0.11080
## s88  8  0.10340
## s89 10  0.09640
## s90 10  0.08990
## s91 10  0.08384
## s92 10  0.07819
## s93 10  0.07292
## s94 11  0.06801
## s95 13  0.06342
## s96 13  0.05915
## s97 12  0.05516
## s98 14  0.05144
## s99 13  0.04798

actual coefficients for fitted value

Computes the coefficients at the requested values for lambda (s = size; can be multiple $\lambda$ values) from a fitted HDtweedie object.

coef(fit, s = 0.1)

##                         1
## (Intercept)  0.1878435590
## CAR_TYPE_2   0.0000000000
## CAR_TYPE_3   0.0000000000
## CAR_TYPE_4   0.0000000000
## CAR_TYPE_5   0.0000000000
## CAR_TYPE_6   0.0000000000
## JOBCLASS_3   0.0000000000
## JOBCLASS_4   0.0000000000
## JOBCLASS_5   0.0000000000
## JOBCLASS_6   0.0000000000
## JOBCLASS_7   0.0000000000
## JOBCLASS_8   0.0000000000
## JOBCLASS_9   0.0000000000
## MAX_EDUC_2   0.0000000000
## MAX_EDUC_3   0.0000000000
## MAX_EDUC_4   0.0000000000
## MAX_EDUC_5   0.0000000000
## KIDSDRIV     0.0000000000
## KIDSDRIV2    0.0000000000
## KIDSDRIV3    0.0000000000
## TRAVTIME     0.0000000000
## TRAVTIME2    0.0000000000
## TRAVTIME3   -0.0001836337
## BLUEBOOK    -0.0164009025
## BLUEBOOK2    0.0000000000
## BLUEBOOK3    0.0000000000
## NPOLICY      0.0000000000
## NPOLICY2     0.0000000000
## NPOLICY3    -0.0019804218
## MVR_PTS      0.2679335031
## MVR_PTS2     0.0104853437
## MVR_PTS3    -0.0038720361
## AGE          0.0000000000
## AGE2         0.0000000000
## AGE3         0.0000000000
## HOMEKIDS     0.0000000000
## HOMEKIDS2    0.0000000000
## HOMEKIDS3    0.0000000000
## YOJ          0.0000000000
## YOJ2         0.0000000000
## YOJ3         0.0000000000
## INCOME       0.0000000000
## INCOME2      0.0000000000
## INCOME3      0.0022316426
## HOME_VAL    -0.0019443085
## HOME_VAL2    0.0000000000
## HOME_VAL3    0.0000000000
## SAMEHOME     0.0000000000
## SAMEHOME2    0.0000000000
## SAMEHOME3    0.0000000000
## CAR_USE      0.0000000000
## RED_CAR      0.0000000000
## REVOLKED     1.1829812495
## GENDER       0.0000000000
## MARRIED      0.0000000000
## PARENT1      0.0000000000
## AREA         0.3881338630

visualize the fit

Each curve corresponds to each variable. The curve shows the path of the coefficient against the $\lambda_{1}$ -norm (regularization parameter) of the whole coefficient vector as $\lambda$ varies, so as Lambda approaches zero, the loss function of the model approaches the OLS (ordinary least squares which seeks to find the $\beta$ that minimizes the differences between the observed and predicted response). Therefore, when lambda is very small, the LASSO solution should be very close to the OLS solution, and all of your coefficients are in the model. As lambda grows, the regularization term has greater effect and you will see fewer variables in your model (because more and more coefficients will be zero valued).

Source: http://stats.stackexchange.com/questions/68431/interpretting-lasso-variable-trace-plots

plot(fit)

n-fold cross validation

This plot includes the cross-validation curve (red dotted line) and the upper and lower standard deviation curves along the $\lambda$ sequence (error bars). The vertical black dotted lines are the two selected lmbda values.

The coefficients displayed are based on the minimal $\lambda$ which gives the minimum mean cross-validated error are essetially the variable selection section of the analysis which shows the included variables (coef != 0) and the exluded variables (coef == 0).

# 5-fold cross validation using the lasso (default = 5)
cv_fit <- cv.HDtweedie(auto$x, auto$y)
plot(cv_fit)

cv_fit$lambda.min

## [1] 0.04797758

coef(cv_fit, s = "lambda.min")

##                        1
## (Intercept) -0.055919922
## CAR_TYPE_2   0.000000000
## CAR_TYPE_3   0.000000000
## CAR_TYPE_4   0.000000000
## CAR_TYPE_5   0.000000000
## CAR_TYPE_6   0.000000000
## JOBCLASS_3   0.000000000
## JOBCLASS_4   0.000000000
## JOBCLASS_5   0.000000000
## JOBCLASS_6   0.000000000
## JOBCLASS_7   0.000000000
## JOBCLASS_8   0.000000000
## JOBCLASS_9   0.000000000
## MAX_EDUC_2   0.000000000
## MAX_EDUC_3   0.000000000
## MAX_EDUC_4   0.000000000
## MAX_EDUC_5   0.000000000
## KIDSDRIV     0.000000000
## KIDSDRIV2    0.000000000
## KIDSDRIV3    0.000000000
## TRAVTIME     0.000000000
## TRAVTIME2   -0.022868351
## TRAVTIME3    0.000000000
## BLUEBOOK    -0.058530567
## BLUEBOOK2   -0.022798853
## BLUEBOOK3    0.000000000
## NPOLICY      0.000000000
## NPOLICY2     0.000000000
## NPOLICY3    -0.001885975
## MVR_PTS      0.290668624
## MVR_PTS2     0.000000000
## MVR_PTS3    -0.003345931
## AGE          0.015615244
## AGE2         0.000000000
## AGE3         0.000000000
## HOMEKIDS     0.000000000
## HOMEKIDS2    0.000000000
## HOMEKIDS3    0.000000000
## YOJ          0.000000000
## YOJ2         0.000000000
## YOJ3         0.000000000
## INCOME       0.000000000
## INCOME2      0.000000000
## INCOME3      0.006162290
## HOME_VAL    -0.038507332
## HOME_VAL2    0.000000000
## HOME_VAL3    0.000000000
## SAMEHOME    -0.002501553
## SAMEHOME2    0.000000000
## SAMEHOME3    0.000000000
## CAR_USE      0.000000000
## RED_CAR      0.000000000
## REVOLKED     1.313816847
## GENDER      -0.009220928
## MARRIED      0.000000000
## PARENT1      0.000000000
## AREA         0.633743866

# define group index (i.e. all factor levels are group with the factor regardless of individual columns)
group1 <- c(rep(1,5),rep(2,7),rep(3,4),rep(4:14,each=3),15:21)

# 5-fold cross validation using the grouped lasso (all or nothing for categorical groups)
cv1 <- cv.HDtweedie(x=auto$x,y=auto$y,group=group1,p=1.5,nfolds=5)
plot(cv1)

model predictions / scoring

Predictions can be made based on the fitted cv.HDtweedie() object. newx is the new input matrix (for this example they just score it against the first 5 observations used to contruct the model) and s is the value(s) of $\lambda$ at which predictions are made.

predict(cv_fit, newx = auto$x[1:5, ], s = "lambda.min")

##              1
## [1,]  1.682658
## [2,]  3.113788
## [3,] 14.424567
## [4,]  1.776053
## [5,]  1.885848

This tutorial has been abstracted based on the glmnet

http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

fin.

HDtweedie: lasso tutorial

Jasmine Dumas

May 2, 2016

Introduction to Tweedie

Introduction to LASSO

`HDtweedie` package notes

load the data

fit a tweedie distribution for lasso regression

actual coefficients for fitted value

visualize the fit

n-fold cross validation

model predictions / scoring

HDtweedie: lasso tutorial

Jasmine Dumas

May 2, 2016

Introduction to Tweedie

Introduction to LASSO

HDtweedie package notes

load the data

fit a tweedie distribution for lasso regression

actual coefficients for fitted value

visualize the fit

n-fold cross validation

model predictions / scoring

`HDtweedie` package notes