XGBoost cho newbie: 1 input đi qua tree như thế nào?

Tóm lược

XGBoost (gbtree) dự đoán bằng cách cho input đi qua nhiều cây; mỗi cây trả về một leaf_value rồi cộng dồn.
Với binary:logistic, thứ bạn tự cộng ra thường là score/margin; muốn ra probability thì qua sigmoid.
Trong credit scoring, ML score ≠ quyết định: policy/cutoff mới là “tay lái”.

Hình minh hoạ (mở bằng trình duyệt)

Sơ đồ HTML/SVG: /diagrams/xgboost_flow.html

1) Input là gì?

Trong credit scoring, 1 input thường là một khách hàng tại thời điểm ra quyết định (point-in-time). Sau khi dựng feature, bạn có vector:

[ x = [x_0, x_1, ..., x_{d-1}] ]

Trong dump của cây, bạn sẽ thấy dạng f27 < 0.1424 — nghĩa là model đang check “feature #27 có < 0.1424 không?”.

2) Một cây (decision tree) xử lý 1 input như thế nào?

Một cây là một chuỗi câu hỏi dạng:

text

if x[j] is missing: go Missing
else if x[j] < split: go Yes
else: go No

Bạn bắt đầu ở root node, rẽ nhánh cho tới khi rơi vào leaf. Leaf trả về một số thực:

[ tree_i(x) \rightarrow leaf_value ]

Điểm quan trọng: leaf_value không phải 0/1. Nó là “một phần đóng góp” để cộng dồn.

Tham khảo (chính thống): ý nghĩa Split, Yes, No, Missing khi parse dump.
https://xgboost.readthedocs.io/en/latest/r_docs/R-package/docs/reference/xgb.model.dt.tree.html

3) XGBoost: nhiều cây cộng dồn → ra score (margin)

Giả sử model có (n) cây:

[ score(x)=base_score+\sum_{i=1}^{n} tree_i(x) ]

Diễn giải như code:

score = base_score
for mỗi tree i: score += tree_i(x)

4) Score vs Probability (chỗ newbie hay đặt sai threshold)

Với logistic, thường bạn cần:

[ prob(x)=\sigma(score)=\frac{1}{1+e^{-score}} ]

Vậy:

Threshold trên probability: prob > 0.5
Tương đương threshold trên margin: score > 0 (vì sigmoid(0)=0.5)

Tham khảo (chính thống): output_margin để tránh transformation (lấy raw margin).
https://xgboost.readthedocs.io/en/stable/prediction.html

5) “Bỏ 1 cây” ảnh hưởng gì?

Vì score là tổng:

[ score'(x)=score(x)-tree_k(x) ]

Nên bỏ cây (k) sẽ làm score đổi đúng bằng phần đóng góp của cây đó.

Để demo “đúng bài”, XGBoost có iteration_range (model slicing) và ví dụ chính thống về việc cộng dồn prediction theo từng cây.

https://xgboost.readthedocs.io/en/latest/python/examples/individual_trees.html

5b) Code thực tế: Xem từng cây đóng góp ra sao

python

import xgboost as xgb
import numpy as np

# Train một model đơn giản để minh hoạ
model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=4,
    objective='binary:logistic',
    random_state=42
)
model.fit(X_train, y_train)

# --- Cách 1: Lấy raw margin (log-odds) — KHÔNG qua sigmoid ---
# output_margin=True trả về score trước sigmoid; đây là thứ bạn cộng dồn từng cây
margins = model.get_booster().predict(xgb.DMatrix(X_test), output_margin=True)

# Chuyển sang probability thủ công
probs_manual = 1 / (1 + np.exp(-margins))
# So sánh với predict_proba — phải giống hệt
probs_api = model.predict_proba(X_test)[:, 1]
assert np.allclose(probs_manual, probs_api, atol=1e-5), “Should be identical”

# --- Cách 2: Cộng dồn từng cây — minh hoạ “boosting = cộng dồn” ---
booster = model.get_booster()

# Dùng iteration_range để lấy margin sau k cây đầu tiên
margins_10  = booster.predict(xgb.DMatrix(X_test), output_margin=True,
                               iteration_range=(0, 10))
margins_50  = booster.predict(xgb.DMatrix(X_test), output_margin=True,
                               iteration_range=(0, 50))
margins_100 = booster.predict(xgb.DMatrix(X_test), output_margin=True,
                               iteration_range=(0, 100))

# Xem score của 1 applicant tiến hoá theo số cây
print(f”After  10 trees: margin={margins_10[0]:.4f}  → prob={1/(1+np.exp(-margins_10[0])):.3f}”)
print(f”After  50 trees: margin={margins_50[0]:.4f}  → prob={1/(1+np.exp(-margins_50[0])):.3f}”)
print(f”After 100 trees: margin={margins_100[0]:.4f} → prob={1/(1+np.exp(-margins_100[0])):.3f}”)
# Score hội tụ dần — đây là bản chất của “boosting = correction”

# --- Cách 3: Set threshold đúng cách cho credit ---
# KHÔNG dùng 0.5 làm default — chọn dựa trên approval rate target
sorted_probs = np.sort(probs_api)
target_approval_rate = 0.70  # muốn approve 70% applicants
cutoff = np.percentile(probs_api, (1 - target_approval_rate) * 100)
print(f”Cutoff for {target_approval_rate:.0%} approval rate: {cutoff:.4f}”)
# Cutoff này có thể là 0.12, không phải 0.5

6) Credit scoring note: ML score ≠ policy decision

Trong thực tế:

Model cho ra prob/score (PD estimate hoặc ranking)
Policy (cutoff + rules) quyết định approve/reject/limit

Đừng nhầm “prob > 0.5 là approve” — cutoff thường được chọn theo trade-off (approval rate, bad rate, expected loss, exposure…).

Tóm lược

Hình minh hoạ (mở bằng trình duyệt)

1) Input là gì?

2) Một cây (decision tree) xử lý 1 input như thế nào?

3) XGBoost: nhiều cây cộng dồn → ra score (margin)

4) Score vs Probability (chỗ newbie hay đặt sai threshold)

5) “Bỏ 1 cây” ảnh hưởng gì?

5b) Code thực tế: Xem từng cây đóng góp ra sao

6) Credit scoring note: ML score ≠ policy decision

Bài liên quan / Related posts