上一篇文章中我们已经大概了解了Gradient Boosting的来源和主要数学思想。在这篇文章里,我们将以sklearn中的Gradient Boosting为基础 源码在这,了解GBDT的实现过程.希望大家能在看这篇文章的过程中有所收获.
这里面会有大量的代码,请耐住性子,我们一起把它啃下来.
GBDT全称是Gradient Boosting Decision Trees,顾名思义,就是在梯度提升框架下(上一篇文章有详细解说传送门),用回归树作为基本分类器的算法(分类树相加会有问题,如男+女=?).因此可以想到,参数调优时有两方面:
1.梯度提升框架方面:
2.决策树方面:
大家可以参考sklearn的用户手册Guide
为什么要用决策树作为基本分类器?
决策树优点在于可理解性强,可以快速生成我们能理解的规则,而且计算量相对而言不大(每棵树的复杂度为O(mnlog(Depth))).
但决策树有一个很大的缺点:高方差和不稳定.由于决策数是运用启发式生成的,初始分割点的一点改变就会导致最后结果完全不同.这使得决策树对特征的要求非常高.但实际中特征噪声是很多的.
因此它是典型的弱分类器.
当我们的基本分类器是一个包含J个节点的回归树时,回归树模型可以表示为
其中Rj为不相交的区域,它们的集合覆盖了预测值空间,bj是叶子节点的值.
因此,在回归树为基模型下,算法最终的结果为:
通俗的理解就是这样:
说白了就是每次的预测相加.不难理解吧
了解了最基本的构造,我们接着看一下GBDT的代码实现
class MeanEstimator(BaseEstimator):
"""An estimator predicting the mean of the training targets."""
def fit(self, X, y, sample_weight=None):
if sample_weight is None:
self.mean = np.mean(y)
else:
self.mean = np.average(y, weights=sample_weight)
def predict(self, X):
check_is_fitted(self, 'mean')
y = np.empty((X.shape[0], 1), dtype=np.float64)
y.fill(self.mean)
return y
若无样本权重,则简单求和;若有样本权重,则加权求和
除平均之外,叶子节点取值方式还有:
class LeastSquaresError(RegressionLossFunction):
def init_estimator(self):
return MeanEstimator()
def __call__(self, y, pred, sample_weight=None):
if sample_weight is None:
return np.mean((y - pred.ravel()) ** 2.0)
else:
return (1.0 / sample_weight.sum() *
np.sum(sample_weight * ((y - pred.ravel()) ** 2.0)))
def negative_gradient(self, y, pred, **kargs):
return y - pred.ravel()
def update_terminal_regions(self, tree, X, y, residual, y_pred,
sample_weight, sample_mask,
learning_rate=1.0, k=0):
"""Least squares does not need to update terminal regions.
But it has to update the predictions.
"""
# update predictions
y_pred[:, k] += learning_rate * tree.predict(X).ravel()
def _update_terminal_region(self, tree, terminal_regions, leaf, X, y, residual, pred, sample_weight):
pass
我们可以看到损失函数类有三个方法:
除平方损失之外,
对于Regressor还有
class ExponentialLoss(ClassificationLossFunction):
"""Exponential loss function for binary classification.
Same loss as AdaBoost.
References
----------
Greg Ridgeway, Generalized Boosted Models: A guide to the gbm package, 2007
"""
def __init__(self, n_classes):
if n_classes != 2:
raise ValueError("{0:s} requires 2 classes.".format(
self.__class__.__name__))
# we only need to fit one tree for binary clf.
super(ExponentialLoss, self).__init__(1)
def init_estimator(self):
return ScaledLogOddsEstimator()
def __call__(self, y, pred, sample_weight=None):
pred = pred.ravel()
if sample_weight is None:
return np.mean(np.exp(-(2. * y - 1.) * pred))
else:
return (1.0 / sample_weight.sum() *
np.sum(sample_weight * np.exp(-(2 * y - 1) * pred)))
def negative_gradient(self, y, pred, **kargs):
y_ = -(2. * y - 1.)
return y_ * np.exp(y_ * pred.ravel())
def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
residual, pred, sample_weight):
terminal_region = np.where(terminal_regions == leaf)[0]
pred = pred.take(terminal_region, axis=0)
y = y.take(terminal_region, axis=0)
sample_weight = sample_weight.take(terminal_region, axis=0)
y_ = 2. * y - 1.
numerator = np.sum(y_ * sample_weight * np.exp(-y_ * pred))
denominator = np.sum(sample_weight * np.exp(-y_ * pred))
# prevents overflow and division by zero
if abs(denominator) < 1e-150:
tree.value[leaf, 0, 0] = 0.0
else:
tree.value[leaf, 0, 0] = numerator / denominator
def _score_to_proba(self, score):
proba = np.ones((score.shape[0], 2), dtype=np.float64)
proba[:, 1] = expit(2.0 * score.ravel())
proba[:, 0] -= proba[:, 1]
return proba
def _score_to_decision(self, score):
return (score.ravel() >= 0.0).astype(np.int)
和回归器的损失函数不同,分类器的损失函数类多了两个方法:
ps:这两个函数的出现是因为GBDT中,所有基模型都是回归树,这在上面已经讲过了.所以需要将连续的结果离散化
除指数损失外,
对于classifier还有
class BaseGradientBoosting(six.with_metaclass(ABCMeta, BaseEnsemble)):
"""Abstract base class for Gradient Boosting. """
@abstractmethod
def __init__(...):
...
...
def _fit_stages(self, X, y, y_pred, sample_weight, random_state,
begin_at_stage=0, monitor=None, X_idx_sorted=None):
"""Iteratively fits the stages.
For each stage it computes the progress (OOB, train score)
and delegates to ``_fit_stage``.
Returns the number of stages fit; might differ from ``n_estimators``
due to early stopping.
"""
1-> n_samples = X.shape[0]
do_oob = self.subsample < 1.0
sample_mask = np.ones((n_samples,), dtype=np.bool)
n_inbag = max(1, int(self.subsample * n_samples))
loss_ = self.loss_
# Set min_weight_leaf from min_weight_fraction_leaf
if self.min_weight_fraction_leaf != 0. and sample_weight is not None:
min_weight_leaf = (self.min_weight_fraction_leaf *
np.sum(sample_weight))
else:
min_weight_leaf = 0.
if self.verbose:
verbose_reporter = VerboseReporter(self.verbose)
verbose_reporter.init(self, begin_at_stage)
X_csc = csc_matrix(X) if issparse(X) else None
X_csr = csr_matrix(X) if issparse(X) else None
# perform boosting iterations
i = begin_at_stage
2-> for i in range(begin_at_stage, self.n_estimators):
# subsampling
if do_oob:
3-> sample_mask = _random_sample_mask(n_samples, n_inbag,random_state)
# OOB score before adding this stage
4-> old_oob_score = loss_(y[~sample_mask],
y_pred[~sample_mask],
sample_weight[~sample_mask])
# fit next stage of trees
5-> y_pred = self._fit_stage(i, X, y, y_pred, sample_weight,
sample_mask, random_state, X_idx_sorted,
X_csc, X_csr)
# track deviance (= loss)
6-> if do_oob:
self.train_score_[i] = loss_(y[sample_mask], #更新train_score_
y_pred[sample_mask],
sample_weight[sample_mask])
self.oob_improvement_[i] = (
old_oob_score - loss_(y[~sample_mask],
y_pred[~sample_mask],
sample_weight[~sample_mask]))
7-> else:
# no need to fancy index w/ no subsampling
self.train_score_[i] = loss_(y, y_pred, sample_weight)
#update verbose_reporter
8-> if self.verbose > 0:
verbose_reporter.update(i, self)
#update monitor
9-> if monitor is not None:
early_stopping = monitor(i, self, locals())
if early_stopping:
break
return i + 1
根据代码中的标号,我们一步一步来看
是否很好奇,流程5是怎么实现的?我们接着看
def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,
random_state, X_idx_sorted, X_csc=None, X_csr=None):
"""Fit another stage of ``n_classes_`` trees to the boosting model. """
assert sample_mask.dtype == np.bool # if flase, raise error
loss = self.loss_
original_y = y
for k in range(loss.K):
if loss.is_multi_class:
y = np.array(original_y == k, dtype=np.float64)
1-> residual = loss.negative_gradient(y, y_pred, k=k,
sample_weight=sample_weight)
# induce regression tree on residuals
2-> tree = DecisionTreeRegressor(
criterion=self.criterion,
splitter='best',
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
min_weight_fraction_leaf=self.min_weight_fraction_leaf,
min_impurity_split=self.min_impurity_split,
max_features=self.max_features,
max_leaf_nodes=self.max_leaf_nodes,
random_state=random_state,
presort=self.presort)
3-> if self.subsample < 1.0:
# no inplace multiplication!
sample_weight = sample_weight * sample_mask.astype(np.float64)
4-> if X_csc is not None:
tree.fit(X_csc, residual, sample_weight=sample_weight,
check_input=False, X_idx_sorted=X_idx_sorted)
else:
tree.fit(X, residual, sample_weight=sample_weight,
check_input=False, X_idx_sorted=X_idx_sorted)
# update tree leaves
5-> if X_csr is not None:
loss.update_terminal_regions(tree.tree_, X_csr, y, residual, y_pred,
sample_weight, sample_mask,
self.learning_rate, k=k)
else:
loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred,
sample_weight, sample_mask,
self.learning_rate, k=k)
6-> # add tree to ensemble
self.estimators_[i, k] = tree
7-> return y_pred
假设大家已经对决策树的生成与预测有所了解,这篇文章里就不展开讲了.想看决策树基础代码的同学可以看一下这个传送门(代码来自《机器学习实战》,我加了一些注释)
函数_fit_stage展示的是GBDT每一轮里的训练过程.主要分为7部分(标号).
注意,这里面有一个for循环,其实是针对分类的(二项分类与多项分类).
若是分类,模型会根据标签(label)类别对y进行重构,相同的一label放在一列.在训练中对每一列分别进行训练.
若是连续型的预测,很显然n_classes_==1,即进入一次循环就好了
其实也不是很复杂对不对.整个过程要注意的细节就是子样本的采样和分类器的训练
当然还有一些细节如:
这些就留给你们自己去研究了.看懂了收益也会很大吧
预测对于分类器和回归器来说是不同的,但是在Base Gradient Boosting类中有他们共同要用到的函数
1-> def _init_decision_function(self, X):
self._check_initialized()
X = self.estimators_[0, 0]._validate_X_predict(X, check_input=True)
if X.shape[1] != self.n_features_:
raise ValueError("X.shape[1] should be {0:d}, not {1:d}.".format(
self.n_features_, X.shape[1]))
score = self.init_.predict(X).astype(np.float64)
return score
2-> def _decision_function(self, X):
score = self._init_decision_function(X)
predict_stages(self.estimators_, X, self.learning_rate, score)
return score
注:predict_stages()函数在_gradient_boosting.pyx中,用的cython写的,其实就是循环每一轮的基模型来预测.前面讲过所有训练好的基模型都储存在estimators_中. 这里就不深入了
我们看回来,对于上面的两个函数
接着我们分别看一下分类器和回归器的预测
分类器
class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
...
...
def decision_function(self, X):
"""Compute the decision function of ``X``.
X = check_array(X, dtype=DTYPE, order="C", accept_sparse='csr')
score = self._decision_function(X)
if score.shape[1] == 1:
return score.ravel()
return score
def predict(self, X):
"""Predict class for X."""
score = self.decision_function(X)
decisions = self.loss_._score_to_decision(score)
return self.classes_.take(decisions, axis=0)
回归器
class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
...
...
def predict(self, X):
"""Predict regression target for X."""
X = check_array(X, dtype=DTYPE, order="C", accept_sparse='csr')
return self._decision_function(X).ravel()
分类器比回归归器的预测步骤多了一个decision_function函数,就是因为上面说过对于分类器,训练时会将不同label的样本进行重排.这里只是对预测出来的就过做个分类规范化处理而已.
当然,分类器中最后得到的答案还要进行离散化,调用的函数就是之前损失函数中的方法_score_to_decision().
好啦,到这里,我们对GBDT的源码解读就差不多啦,更多的细节大家可以深入阅读源码.
如果文章有不当的地方欢迎指出.页面下方有我的电子邮箱,谢谢大家的指点!
下一篇文章将会讲一下集成学习中另一个分支:bagging.下篇文章见吧.
E-mail: liangyaorong1995@outlook.com