从函数返回训练有素的 sci-kit 学习(随机森林)模型?

乘风 machine-learning 212

原文标题Returning a trained sci-kit learn (random forest) model from a function?

我正在训练一个随机森林模型,并发现从函数返回经过训练的模型对象会始终导致不同的.predict行为。不知道这是有意还是无意?

我认为这是完全可重现的代码。输入数据只是 1000 行 6 列的浮点数:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd

def as_a_function():
    df = pd.to_csv() # read file
    lcscols = #just a list of 6/12 of the columns in the csv file that are used to build the model (ignore timestamp, etc)
    selcol = #y 'real' data

    train_df = df.sample(frac=testsize,random_state=42)
    test_df = df.drop(train_df.index)  #test/train split
        
    rfmodel, fitvals_mid = RF_model(train_df,test_df,selcol, lcscols)

    tempdf = df.copy(deep=True) # new copy, not totally necessary but helpful in edge cases
    tempdf.dropna(inplace=True)
    selcolname = selcol + '_cal'
    mid_cal = pd.DataFrame(data=rfmodel.predict(tempdf[lcscols]),index=tempdf.index,columns=[selcolname])
    #new df just made from a .predict call
    # note that input order of columns matters, needs to be identical to training order??

def RF_model(train_df, test_df, ycol, xcols):
    rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
    rfmodel = rf.fit(train_df[xcols], train_df[ycol])
    y_pred_test = rfmodel.predict(test_df[xcols])
    #missing code to test predicted values of testing set
    return rfmodel

#################################

def inline():
    df = pd.to_csv() # read file
    lcscols = #just a list of 6/12 of the columns in the csv file that are used to build the model (ignore timestamp, etc)
    refcol = #'true' data

    X = df[lcscols].values
    y = df[[refcol]].values
        
    x_train,x_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
    ramp = rf.fit(x_train, y_train.flatten())
    y_pred_test = ramp.predict(x_test)
    #missing code to check prediction on test values
        
    tempdf = df.copy(deep=True)[lcscols]
    tempdf.dropna(axis=1,how='all',inplace=True)
    tempdf.dropna(axis=0,inplace=True)
    df_cal = pd.DataFrame(data=ramp.predict(tempdf),index=tempdf.index,columns=['name'])

    return df_cal

问题是rfmodel.predict(tempdf[lcscols])产生的输出与ramp.predict(tempdf)不同。

我想它会有所不同,因为pd.DataFrame.sample不会与test_train_split完全相同,但它的 sr^2 值是 0.98,当 .predict 在相同功能的训练模型上调用时,与 tor^2 = 0.5 相比当.predict在返回的模型对象上被调用。这似乎太不同了,不能归因于不同的拆分方法?

原文链接:https://stackoverflow.com//questions/71470896/returning-a-trained-sci-kit-learn-random-forest-model-from-a-function

回复

我来回复
  • Sarthak Thakur的头像
    Sarthak Thakur 评论

    在调用方法之前尝试使用 np.random.seed(42) – 确保首先导入了 numpy。每次模型预测它使用随机值时,每次使用该种子运行代码时它使用不同的随机值,但是当您使用 np.random.seed(42) 时,每次运行代码时,模型都会使用相同的随机值。

    2年前 0条评论