H2O Framework를 활용한 AutoML 사용법 | Yoon's Lab Note

H2O Framework를 활용한 AutoML 사용법

Preview


이번 포스팅에서는 오픈소스 프레임워크 ‘h2o’을 정리해보는 시간을 갖겠습니다. h2o를 활용하면 누구나 쉽게 Grid Search를 통한 모델 비교 및 최적의 모델을 찾아 적용할 수 있습니다.


Load Libraries

1
2
3
4
5
6
7
8
9
list.of.packages <- c("DMwR", "Rcpp", "caTools", "tidyverse", "h2o", "dummies")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(DMwR)
library(caTools)
library(tidyverse)
library(h2o)
library(dummies)

Init h2o

h2o의 기본 호스트는 localhost:54321입니다. h2o.init의 옵션설정을 통해 변경 할 수 있습니다.

  • Source Code

1
2
3
4
h2o.init(ip = "localhost", port = 54321,
         startH2O = TRUE, nthreads = -1,
         max_mem_size = '10g', min_mem_size = '8g')
h2o.removeAll()
  • Result

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

Starting H2O JVM and connecting: . Connection successful!

  R is connected to the H2O cluster:
  H2O cluster uptime:         2 seconds 514 milliseconds
  H2O cluster timezone:       Asia/Seoul
  H2O data parsing timezone:  UTC
  H2O cluster version:        3.20.0.6
  H2O cluster version age:    3 months and 18 days !!!
  H2O cluster name:           H2O_started_from_R_penta_qcg948
  H2O cluster total nodes:    1
  H2O cluster total memory:   56.89 GB
  H2O cluster total cores:    12
  H2O cluster allowed cores:  12
  H2O cluster healthy:        TRUE
  H2O Connection ip:          localhost
  H2O Connection port:        54321
  H2O Connection proxy:       NA
  H2O Internal Security:      FALSE
  H2O API Extensions:         Algos, AutoML, Core V3, Core V4
  R Version:                  R version 3.5.1 (2018-07-02)

Pre-processing

1
2
3
4
5
6
7
8
9
10
df <- read.csv('df_fin_2nd_draft_190214.csv', colClasses = "character", na.strings = c('', ' ', 'NA', '#N/A'), header=T)

df <- df2

target_colname <- "CT실시여부"

target_col <- df[names(df) == target_colname]
df <- df[names(df) != target_colname]
df <- cbind(target_col, df)
glimpse(df, width=80)

Get Set Colnames

PRO TIP !
h2o에서는 빠른 연산을 위해 data frame을 hexbin형식으로 변환하여 연산합니다.변환하는 과정에서 한글 column이 들어가면 에러가 나고 멈추는 현상이 나타납니다. 그렇기 때문에 원본 colname을 Column Number로 변환합니다.

1
2
3
4
5
6
7
8
9
10
11
12
origin_colname <- names(df)
names(df) <- c('target', paste0('C', 2:length(df)))
names(df)
new_colname <- names(df)

colname_df <- cbind(origin_colname, new_colname)
colname_df

write.csv(colname_df, 'colname.csv')

# new_colname 적용
names(df) <- new_colname

Split Data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
idx <- sample(x = c("train", "valid", "test"),
              size = nrow(df),
              replace = TRUE,
              prob = c(6, 2, 2))


train <- df[idx == "train",]
valid <- df[idx == "valid",]
test <- df[idx == "test",]


# train/valid/test dimension 확인
dim(train)
dim(valid)
dim(test)

Run AutoML

  • Source Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# dataframe h2o 형식으로 변환
train <- as.h2o(train)
valid <- as.h2o(valid)
test <- as.h2o(test)

# h2o engine에 .hex라는 이름으로 저장
train <- h2o.assign(train, "train.hex")
valid <- h2o.assign(valid, "valid.hex")
test <- h2o.assign(test, "test.hex")

# target과 response variable 이름 지정
target <- "target"
response <- setdiff(names(train), target)

# classification을 위해 factor형으로 다시 한번 변환
train[,target] <- as.factor(train[,target])
valid[,target] <- as.factor(valid[,target])
test[,target] <- as.factor(test[,target])

# modeling
aml <- h2o.automl(x = response,
                  y = target,
                  training_frame  = train,
                  validation_frame = valid,
                  leaderboard_frame = test,
                  exclude_algos = "DeepLearning",
                  max_runtime_secs = 1000)
  • Result

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# leader board 확인
lb <- aml@leaderboard
lb

# leader board에서 가장 높은 auccuracy를 기록한 model을 aml_leader로 지정
aml_leader <- aml@leader
aml_leader

# performance 확인 (newdata안에 모델 적용하고 싶은 데이터 지정)
perf <- h2o.performance(aml_leader, newdata = test)
perf

# train confusion matrix 확인
train_perf <- h2o.performance(aml_leader, newdata = train)
h2o.confusionMatrix(train_perf)

# valid confusion matrix 확인
valid_perf <- h2o.performance(aml_leader, newdata = valid)
h2o.confusionMatrix(valid_perf)

Predict & Save Model

1
2
3
4
5
pred <- h2o.predict(aml_leader, newdata = test)

h2o.saveModel(aml_leader, 'bestmodel')

predict.reg <- as.data.frame(h2o.predict(aml_leader, test))

Close H2O

1
h2o.shutdown(prompt=FALSE)

Session Info

  • Source Code

1
sessionInfo()
  • Result

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
  [1] LC_COLLATE=Korean_Korea.949  LC_CTYPE=Korean_Korea.949    LC_MONETARY=Korean_Korea.949 LC_NUMERIC=C                 LC_TIME=Korean_Korea.949

attached base packages:
  [1] grid      stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
  [1] dummies_1.5.6    dplyr_0.7.6      caTools_1.17.1.1 DMwR_0.4.1       lattice_0.20-35  h2o_3.20.0.6

loaded via a namespace (and not attached):
  [1] magic_1.5-8        ddalpha_1.3.4      tidyr_0.8.1        sfsmisc_1.1-2      jsonlite_1.5       splines_3.5.1      foreach_1.4.4      gtools_3.8.1
  [9] prodlim_2018.04.18 assertthat_0.2.0   TTR_0.23-3         stats4_3.5.1       DRR_0.0.3          robustbase_0.93-2  ipred_0.9-7        pillar_1.3.0
  [17] backports_1.1.2    glue_1.3.0         colorspace_1.3-2   recipes_0.1.3      Matrix_1.2-14      plyr_1.8.4         timeDate_3043.102  pkgconfig_2.0.2
  [25] CVST_0.2-2         broom_0.5.0        caret_6.0-80       purrr_0.2.5        scales_1.0.0       gdata_2.18.0       gower_0.1.2        lava_1.6.3
  [33] tibble_1.4.2       ggplot2_3.0.0      withr_2.1.2        ROCR_1.0-7         nnet_7.3-12        lazyeval_0.2.1     quantmod_0.4-13    survival_2.42-3
  [41] magrittr_1.5       crayon_1.3.4       gplots_3.0.1       nlme_3.1-137       MASS_7.3-50        xts_0.11-0         dimRed_0.1.0       class_7.3-14
  [49] tools_3.5.1        data.table_1.11.4  stringr_1.3.1      kernlab_0.9-27     munsell_0.5.0      bindrcpp_0.2.2     pls_2.7-0          compiler_3.5.1
  [57] RcppRoll_0.3.0     rlang_0.2.2        RCurl_1.95-4.11    iterators_1.0.10   rstudioapi_0.7     bitops_1.0-6       geometry_0.3-6     gtable_0.2.0
  [65] ModelMetrics_1.2.0 codetools_0.2-15   abind_1.4-5        curl_3.2           reshape2_1.4.3     R6_2.2.2           zoo_1.8-3          lubridate_1.7.4
  [73] bindr_0.1.1        KernSmooth_2.23-15 stringi_1.1.7      Rcpp_0.12.18       rpart_4.1-13       DEoptimR_1.0-8     tidyselect_0.2.4
0%