作者:黄天元,复旦大学博士在读,热爱数据科学与开源工具(R),致力于利用数据科学迅速积累行业经验优势和科学知识发现,涉猎内容包括但不限于信息计量、机器学习、数据可视化、应用统计建模、知识图谱等,著有《R语言高效数据处理指南》(《R语言数据高效处理指南》(黄天元)【摘要 书评 试读】- 京东图书)。知乎专栏:R语言数据挖掘。邮箱:huang.tian-yuan@qq.com.欢迎合作交流。
前文提要:
mlr包中,在定义了任务(要做分类还是回归)和模型(学习器)之后,训练就是一个train函数就能够完成。简单如斯:
# Generate the task
task = makeClassifTask(data = iris, target = "Species")
# Generate the learner
lrn = makeLearner("classif.lda")
# Train the learner
mod = train(lrn, task)
mod
## Model for learner.id=classif.lda; learner.class=classif.lda
## Trained on: task.id = iris; obs = 150; features = 4
## Hyperparameters:
上面,首先用R自带数据集定义了分类任务task,然后选择学习器LDA(线性判别分析),然后一个train函数做训练,后把训练好的模型放在mod中。如果学习器只想使用默认设置,可以不定义直接放到train函数中,如:
mod = train("classif.lda", task)
mod
## Model for learner.id=classif.lda; learner.class=classif.lda
## Trained on: task.id = iris; obs = 150; features = 4
## Hyperparameters:
训练获得的模型,依旧是一个对象。names函数可以看到里面有什么信息,然后直接用$进行访问。比如我们训练一个无监督的聚类模型:
# Generate the task
ruspini.task = makeClusterTask(data = ruspini)
# Generate the learner
lrn = makeLearner("cluster.kmeans", centers = 4)
# Train the learner
mod = train(lrn, ruspini.task)
mod
## Model for learner.id=cluster.kmeans; learner.class=cluster.kmeans
## Trained on: task.id = ruspini; obs = 75; features = 2
## Hyperparameters: centers=4
# Peak into mod
names(mod)
## [1] "learner" "learner.model" "task.desc" "subset"
## [5] "features" "factor.levels" "time" "dump"
mod$learner
## Learner cluster.kmeans from package stats,clue
## Type: cluster
## Name: K-Means; Short name: kmeans
## Class: cluster.kmeans
## Properties: numerics,prob
## Predict-Type: response
## Hyperparameters: centers=4
mod$features
## [1] "x" "y"
# Extract the fitted model
getLearnerModel(mod)
## K-means clustering with 4 clusters of sizes 23, 17, 20, 15
##
## Cluster means:
## x y
## 1 43.91304 146.0435
## 2 98.17647 114.8824
## 3 20.15000 64.9500
## 4 68.93333 19.4000
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##
## Within cluster sum of squares by cluster:
## [1] 3176.783 4558.235 3689.500 1456.533
## (between_SS / total_SS = 94.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
在训练中,还可以指定subset参数来对其中一部分数据进行训练,而留下剩余的数据来进行验证。subset能够接受的数据类型是整数向量(代表用哪些行做训练)或逻辑向量(方便使用条件筛选)。同时,weights参数可以对类失衡的问题进行重采样,进而进行校正。这些细节,都留到后面的部分继续展开。
参考链接: