r - Why is caret train taking up so much memory? -


when train using glm, works, , don't come close exhausting memory. when run train(..., method='glm'), run out of memory.

is because train storing lot of data each iteration of cross-validation (or whatever trcontrol procedure is)? i'm looking @ traincontrol , can't find how prevent this...any hints? care performance summary , maybe predicted responses.

(i know it's not related storing data each iteration of parameter-tuning grid search because there's no grid glm's, believe.)

the problem 2 fold. i) train doesn't fit model via glm(), bootstrap model, defaults, train() 25 bootstrap samples, which, coupled problem ii) the (or a) source of problem, , ii) train() calls glm() function its defaults. , defaults store model frame (argument model = true of ?glm), includes copy of data in model frame style. object returned train() stores copy of data in $trainingdata, , "glm" object in $finalmodel has copy of actual data.

at point, running glm() using train() producing 25 copies of expanded model.frame and original data, need held in memory during resampling process - whether these held concurrently or consecutively not clear quick @ code resampling happens in lapply() call. there 25 copies of raw data.

once resampling finished, returned object contain 2 copies of raw data , full copy of model.frame. if training data large relative available ram or contains many factors expanded in model.frame, using huge amounts of memory carrying copies of data around.

if add model = false train call, might make difference. here small example using clotting data in ?glm:

clotting <- data.frame(u = c(5,10,15,20,30,40,60,80,100),                        lot1 = c(118,58,42,35,27,25,21,19,18),                        lot2 = c(69,35,26,21,18,16,13,12,12)) require(caret) 

then

> m1 <- train(lot1 ~ log(u), data=clotting, family = gamma, method = "glm",  +             model = true) fitting: parameter=none  aggregating results fitting model on full training set > m2 <- train(lot1 ~ log(u), data=clotting, family = gamma, method = "glm", +             model = false) fitting: parameter=none  aggregating results fitting model on full training set > object.size(m1) 121832 bytes > object.size(m2) 116456 bytes > ## ordinary glm() call: > m3 <- glm(lot1 ~ log(u), data=clotting, family = gamma) > object.size(m3) 47272 bytes > m4 <- glm(lot1 ~ log(u), data=clotting, family = gamma, model = false) > object.size(m4) 42152 bytes 

so there size difference in returned object , memory use during training lower. how lower depend on whether internals of train() keep copies of model.frame in memory during resampling process.

the object returned train() larger returned glm() - mentioned @dwin in comments, below.

to take further, either study code more closely, or email max kuhn, maintainer of caret, enquire options reduce memory footprint.


Comments

Popular posts from this blog

c# - SharpSVN - How to get the previous revision? -

c++ - Is it possible to compile a VST on linux? -

url - Querystring manipulation of email Address in PHP -