从函数返回训练有素的 sci-kit 学习(随机森林)模型?
machine-learning 319
原文标题 :Returning a trained sci-kit learn (random forest) model from a function?
我正在训练一个随机森林模型,并发现从函数返回经过训练的模型对象会始终导致不同的.predict
行为。不知道这是有意还是无意?
我认为这是完全可重现的代码。输入数据只是 1000 行 6 列的浮点数:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd
def as_a_function():
df = pd.to_csv() # read file
lcscols = #just a list of 6/12 of the columns in the csv file that are used to build the model (ignore timestamp, etc)
selcol = #y 'real' data
train_df = df.sample(frac=testsize,random_state=42)
test_df = df.drop(train_df.index) #test/train split
rfmodel, fitvals_mid = RF_model(train_df,test_df,selcol, lcscols)
tempdf = df.copy(deep=True) # new copy, not totally necessary but helpful in edge cases
tempdf.dropna(inplace=True)
selcolname = selcol + '_cal'
mid_cal = pd.DataFrame(data=rfmodel.predict(tempdf[lcscols]),index=tempdf.index,columns=[selcolname])
#new df just made from a .predict call
# note that input order of columns matters, needs to be identical to training order??
def RF_model(train_df, test_df, ycol, xcols):
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rfmodel = rf.fit(train_df[xcols], train_df[ycol])
y_pred_test = rfmodel.predict(test_df[xcols])
#missing code to test predicted values of testing set
return rfmodel
#################################
def inline():
df = pd.to_csv() # read file
lcscols = #just a list of 6/12 of the columns in the csv file that are used to build the model (ignore timestamp, etc)
refcol = #'true' data
X = df[lcscols].values
y = df[[refcol]].values
x_train,x_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
ramp = rf.fit(x_train, y_train.flatten())
y_pred_test = ramp.predict(x_test)
#missing code to check prediction on test values
tempdf = df.copy(deep=True)[lcscols]
tempdf.dropna(axis=1,how='all',inplace=True)
tempdf.dropna(axis=0,inplace=True)
df_cal = pd.DataFrame(data=ramp.predict(tempdf),index=tempdf.index,columns=['name'])
return df_cal
问题是rfmodel.predict(tempdf[lcscols])
产生的输出与ramp.predict(tempdf)
不同。
我想它会有所不同,因为pd.DataFrame.sample
不会与test_train_split
完全相同,但它的 sr^2 值是 0.98,当 .predict 在相同功能的训练模型上调用时,与 tor^2 = 0.5 相比当.predict
在返回的模型对象上被调用。这似乎太不同了,不能归因于不同的拆分方法?