如何使用多个分类变量以 Python R 样式创建预测模型

How to create a Prediction Model in Python R style using multiple categorical variables

你知道如何为集成方法创建预测模型吗特别是 R 风格的分类器:

ded.fit(formula="X ~ Y + Z**2", data=fed)

目前代码看起来像这样:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, min_samples_leaf=10,    
random_state=1)
model.fit(x_train, y_train)

您可能会问我为什么需要这个?

  1. 我需要它来添加更多变量,而不仅仅是 X 和 Y 我还需要 Z、P、Q 和 R。
  2. 我需要像在 R 中做的那样查看和实验是否将指数添加到特定变量或乘以或除以特定变量的值会增加或减少预测的准确性,如下面的公式:

    X ~ Y + Z^2" or "X ~ Y + Z + (P*2) + Q**2

任何答案将不胜感激。 提前致谢。

像下面这样的东西应该可以工作:

import pandas as pd
import numpy as np
X = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('XZ'))
y = np.random.randint(2,size=100) # labels for binary classification
X['Z2'] = X.Z**2    # add more features
print X.head() # note the added feature Z^2
#    X   Z    Z2
#0  88  90  8100
#1  49  63  3969
#2  27  23   529
#3  47  71  5041
#4  21  98  9604
train_samples = 80  # Samples used for training the models
X_train = X[:train_samples]
X_test = X[train_samples:]
y_train = y[:train_samples]
y_test = y[train_samples:]
from sklearn.ensemble import RandomForestClassifier
from pandas_ml import ConfusionMatrix
import matplotlib.pyplot as plt
model = RandomForestClassifier(n_estimators=100, min_samples_leaf=10, random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
#print confusion_matrix(y_test, y_pred)
cm = ConfusionMatrix(y_test, y_pred)
print cm
# Predicted  0   1  __all__
# Actual
# 0          3   4        7
# 1          4   9       13
# __all__    7  13       20
cm.plot()
plt.show()

我会尝试这样做,使用一个假想的 pandas df,其中 3 列由您的分类变量组成,一列是您的目标 {cat1, cat2, cat3, target}:

predictors =df[["cat1", "cat2", "cat3"]]
target = df["target"]

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

'''let sklearn do your training/testing split'''
pred_train, tar_train, pred_test, tar_test(predictors, target, test_size = .30)

'''create model with pre-pruning--play with the parameters consulting documentation'''
numtrees = 50
classifier=RandomForestClassifier(n_estimators = numtrees,min_samples_leaf = 10,
                                  max_leaf_nodes = 25)
model=classifier.fit(pred_train,tar_train)
predictions=model.predict(pred_test)

'''To test the results'''
import sklearn.metrics

print '\n********* confusion matrix **********\n'
print "TRUE NEG   FALSE POS"
print '', sklearn.metrics.confusion_matrix(tar_test,predictions)
print "FALSE NEG   TRUE POS"

print '\n============ Accuracy ============='
print sklearn.metrics.accuracy_score(tar_test, predictions)

请记住,我不是经验丰富的程序员——但上面的代码对我有用。