自定义Python模型

Notebook提供灵活的算法建模环境，用户可通过代码自定义实现模型。

Python单个模型构建

自定义Python模型部署后，会调用此模型的"predict"接口，因此必须在自定的python模型中实现predict接口，为了统一api，还需要指定预测的列名，以及predict接口需要传入pandas的dataframe。总结如下：

构造一个类，类名随意；
实现predict接口；
predict接口必须接收一个pandas的DataFrame对象；

示例如下：

import pandas as pd
from aiworks_plugins.hdfs_plugins import save_python_model_to_hdfs


class TestModel:
  	def __init__(self):
      	pass
    def predict(self, data:pd.DataFrame):
      	pass

my_model = TestModel()
save_python_model_to_hdfs(my_model)

您可以实现自己的模型，也可以调用其他的Python依赖库去实现。下面是一个比较好的实现方法:

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier

"""
假设你想用sklearn的gbdt去对iris建模，并且只选用iris数据集的前三个特征列。
"""
iris = load_iris()
x = pd.DataFrame(iris.data,columns=iris.feature_names)
y = iris.target

gbdt = GradientBoostingClassifier()
gbdt.fit(x[['sepal length (cm)', 'sepal width (cm)','petal length (cm)']], y)


class TestModel:
  	def __init__(self, model, col):
      	self.model = model
        self.col = col
    def predict(self, data:pd.DataFrame):
      	return self.model.predict(data[self.col])

my_model = TestModel(model=gbdt,col=['sepal length (cm)',
                                     'sepal width (cm)','petal length (cm)'])

# 保存模型
from aiworks_plugins.hdfs_plugins import save_python_model_to_hdfs
save_python_model_to_hdfs(my_model)

Python的pipeline构建

Notebook中Python代码的pipeline构建过程如下：

1.特征工程的封装

由于特征工程需要对特定列进行处理，因此需要对其进行封装。在特征工程封装中必须实现transform过程。封装的过程如下：
ⅰ. 构造一个类（类名是任意的）；
ⅱ. 实现transform接口；
ⅲ. transform接口必须接收一个pandas的DataFrame对象；

举例：

（1）对于自己写的特征工程方法的封装，例如二次函数转化；

class FeatureSquare:

    def __init__(self, col):
        self.col = col

    def transform(self, data):
        for col in self.col:
            data[f'{col}_pow'] = data[col].apply(lambda x: x**2)
        return data

feat_square = FeatureSquare(['sepal length (cm)', 'sepal width (cm)'])

（2）对于调用其他库特征工程方法的封装，例如标准化处理；

class FeatureScale:
    def __init__(self, model, col):
        self.col = col
        self.model = model

    def transform(self, data: pd.DataFrame):
        data[self.col] = self.model.transform(data[self.col])
        return data

from sklearn.preprocessing import MinMaxScaler
stdscale = MinMaxScaler()
stdscale.fit(x[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                'petal width (cm)', 'sepal length (cm)_pow', 'sepal width (cm)_pow']])
std_feature_model = FeatureScale(stdscale, ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                                             'petal width (cm)', 'sepal length (cm)_pow', 'sepal width (cm)_pow'])

2.预测工程的封装

自定义的python模型部署后，会调用此模型的"predict"接口，因此必须在自定的python模型中实现predict接口,为了统一api，还需要指定预测的列名，以及predict接口需要传入pandas的dataframe。总结一下:
ⅰ. 构造一个类（类名是任意的）；
ⅱ. 实现predict接口;
ⅲ. predict接口必须接收一个pandas的DataFrame对象；

举例：

class PredictModel:
    def __init__(self, model, col):
        self.model = model
        self.col = col

    def predict(self, data: pd.DataFrame):
        return self.model.predict(data[self.col])
rf = RandomForestClassifier()
rf.fit(x[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                'petal width (cm)', 'sepal length (cm)_pow', 'sepal width (cm)_pow']], y)
predict_model = PredictModel(model=rf, col=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                'petal width (cm)', 'sepal length (cm)_pow', 'sepal width (cm)_pow'])

3.整理流程封装

pipeline_model = [f2i, std_feature_model, predict_model]

from aiworks_plugins import save_python_pipeline_model_to_hdfs

save_python_pipeline_model_to_hdfs(pipeline_model)

4.整体实例举例

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# 建模
iris = load_iris()
x = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# 特征处理1，col中不应该包含标签列
class FeatureSquare:

    def __init__(self, col):
        self.col = col

    def transform(self, data):
        for col in self.col:
            data[f'{col}_pow'] = data[col].apply(lambda x: x**2)
        return data

feat_square = FeatureSquare(['sepal length (cm)', 'sepal width (cm)'])

x = feat_square.transform(x)

# 特征处理2，引用了sklearn中的模型，col中不应该包含标签列
class FeatureScale:
    def __init__(self, model, col):
        self.col = col
        self.model = model

    def transform(self, data: pd.DataFrame):
        data[self.col] = self.model.transform(data[self.col])
        return data

from sklearn.preprocessing import MinMaxScaler
stdscale = MinMaxScaler()
stdscale.fit(x[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                'petal width (cm)', 'sepal length (cm)_pow', 'sepal width (cm)_pow']])
std_feature_model = FeatureScale(stdscale, ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                                             'petal width (cm)', 'sepal length (cm)_pow', 'sepal width (cm)_pow'])
x = std_feature_model.transform(x)


# 创建预测模型，col中不应该包含标签列
class PredictModel:
    def __init__(self, model, col):
        self.model = model
        self.col = col

    def predict(self, data: pd.DataFrame):
        return self.model.predict(data[self.col])
rf = RandomForestClassifier()
rf.fit(x[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                'petal width (cm)', 'sepal length (cm)_pow', 'sepal width (cm)_pow']], y)
predict_model = PredictModel(model=rf, col=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                'petal width (cm)', 'sepal length (cm)_pow', 'sepal width (cm)_pow'])

# 构建流程表, 按照特征处理的顺序构建
pipeline_model = [feat_square, std_feature_model, predict_model]

from aiworks_plugins.hdfs_plugins import save_python_pipeline_to_hdfs
save_python_pipeline_to_hdfs(pipeline_model, model_name='rf_pipeline')

该算法Pipeline部署后，进行模型预测，可输入以下参数进行模型测试：

# 部署预测时输入的代码块
{"sepal length (cm)" : 5.9, "sepal width (cm)": 3.0, "petal length (cm)": 5.1, "petal width (cm)": 1.8}
# 或者
{"sepal length (cm)" : 5.1, "sepal width (cm)": 3.5, "petal length (cm)": 1.4, "petal width (cm)": 0.2}