# Tutorial - Implementing a custom mixer in Lightwood


## Introduction

Mixers are the center piece of lightwood, tasked with learning the mapping between the encoded feature and target representation


## Objective

In this tutorial we'll be trying to implement a sklearn random forest as a mixer that handles categorical and binary targets. 

## Step 1: The Mixer Interface

The Mixer interface is defined by the `BaseMixer` class, a mixer needs methods for 4 tasks:
* fitting (`fit`)
* predicting (`__call__`)
* construction (`__init__`)
* partial fitting (`partial_fit`), though this one is optional

## Step 2: Writing our mixer

I'm going to create a file called `random_forest_mixer.py` inside `/etc/lightwood_modules`, this is where lightwood sources custom modules from.

Inside of it I'm going to write the following code:

In [1]:
%%writefile random_forest_mixer.py

from lightwood.mixer import BaseMixer
from lightwood.api.types import PredictionArguments
from lightwood.data.encoded_ds import EncodedDs, ConcatedEncodedDs
from type_infer.dtype import dtype
from lightwood.encoder import BaseEncoder

import torch
import pandas as pd
from sklearn.ensemble import RandomForestClassifier


class RandomForestMixer(BaseMixer):
 clf: RandomForestClassifier

 def __init__(self, stop_after: int, dtype_dict: dict, target: str, target_encoder: BaseEncoder):
 super().__init__(stop_after)
 self.target_encoder = target_encoder
 # Throw in case someone tries to use this for a problem that's not classification, I'd fail anyway, but this way the error message is more intuitive
 if dtype_dict[target] not in (dtype.categorical, dtype.binary):
 raise Exception(f'This mixer can only be used for classification problems! Got target dtype {dtype_dict[target]} instead!')

 # We could also initialize this in `fit` if some of the parameters depend on the input data, since `fit` is called exactly once
 self.clf = RandomForestClassifier(max_depth=30)

 def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
 X, Y = [], []
 # By default mixers get some train data and a bit of dev data on which to do early stopping or hyper parameter optimization. For this mixer, we don't need dev data, so we're going to concat the two in order to get more training data. Then, we're going to turn them into an sklearn friendly foramat.
 for x, y in ConcatedEncodedDs([train_data, dev_data]):
 X.append(x.tolist())
 Y.append(y.tolist())
 self.clf.fit(X, Y)

 def __call__(self, ds: EncodedDs,
 args: PredictionArguments = PredictionArguments()) -> pd.DataFrame:
 # Turn the data into an sklearn friendly format
 X = []
 for x, _ in ds:
 X.append(x.tolist())

 Yh = self.clf.predict(X)

 # Lightwood encoders are meant to decode torch tensors, so we have to cast the predictions first
 decoded_predictions = self.target_encoder.decode(torch.Tensor(Yh))

 # Finally, turn the decoded predictions into a dataframe with a single column called `prediction`. This is the standard behaviour all lightwood mixers use
 ydf = pd.DataFrame({'prediction': decoded_predictions})

 return ydf

 
 # We'll skip implementing `partial_fit`, thus making this mixer unsuitable for online training tasks

Writing random_forest_mixer.py


## Step 3: Using our mixer

We're going to use our mixer for diagnosing heart disease using this dataset: [https://github.com/mindsdb/benchmarks/blob/main/benchmarks/datasets/heart_disease/data.csv](https://github.com/mindsdb/benchmarks/blob/main/benchmarks/datasets/heart_disease/data.csv)

First, since we don't want to bother writing a Json AI for this dataset from scratch, we're going to let lightwood auto generate one.

In [2]:
from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, load_custom_module
import pandas as pd

# load the code
load_custom_module('random_forest_mixer.py')

# read dataset
df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/heart_disease/data.csv')

# define the predictive task
pdef = ProblemDefinition.from_dict({
 'target': 'target', # column you want to predict
})

# generate the Json AI intermediate representation from the data and its corresponding settings
json_ai = json_ai_from_problem(df, problem_definition=pdef)

# Print it (you can also put it in a file and edit it there)
print(json_ai.to_json())

[32mINFO:lightwood-2524:No torchvision detected, image helpers not supported.[0m


[32mINFO:lightwood-2524:No torchvision/pillow detected, image encoder not supported[0m


[32mINFO:type_infer-2524:Analyzing a sample of 298[0m


[32mINFO:type_infer-2524:from a total population of 303, this is equivalent to 98.3% of your data.[0m


[32mINFO:type_infer-2524:Infering type for: age[0m


[32mINFO:type_infer-2524:Column age has data type integer[0m


[32mINFO:type_infer-2524:Infering type for: sex[0m


[32mINFO:type_infer-2524:Column sex has data type binary[0m


[32mINFO:type_infer-2524:Infering type for: cp[0m


[32mINFO:type_infer-2524:Column cp has data type categorical[0m


[32mINFO:type_infer-2524:Infering type for: trestbps[0m


[32mINFO:type_infer-2524:Column trestbps has data type integer[0m


[32mINFO:type_infer-2524:Infering type for: chol[0m


[32mINFO:type_infer-2524:Column chol has data type integer[0m


[32mINFO:type_infer-2524:Infering type for: fbs[0m


[32mINFO:type_infer-2524:Column fbs has data type binary[0m


[32mINFO:type_infer-2524:Infering type for: restecg[0m


[32mINFO:type_infer-2524:Column restecg has data type categorical[0m


[32mINFO:type_infer-2524:Infering type for: thalach[0m


[32mINFO:type_infer-2524:Column thalach has data type integer[0m


[32mINFO:type_infer-2524:Infering type for: exang[0m


[32mINFO:type_infer-2524:Column exang has data type binary[0m


[32mINFO:type_infer-2524:Infering type for: oldpeak[0m


[32mINFO:type_infer-2524:Column oldpeak has data type float[0m


[32mINFO:type_infer-2524:Infering type for: slope[0m


[32mINFO:type_infer-2524:Column slope has data type categorical[0m


[32mINFO:type_infer-2524:Infering type for: ca[0m


[32mINFO:type_infer-2524:Column ca has data type categorical[0m


[32mINFO:type_infer-2524:Infering type for: thal[0m


[32mINFO:type_infer-2524:Column thal has data type categorical[0m


[32mINFO:type_infer-2524:Infering type for: target[0m


[32mINFO:type_infer-2524:Column target has data type binary[0m


[32mINFO:dataprep_ml-2524:Starting statistical analysis[0m


[32mINFO:dataprep_ml-2524:Finished statistical analysis[0m


{
 "encoders": {
 "target": {
 "module": "BinaryEncoder",
 "args": {
 "is_target": "True",
 "target_weights": "$statistical_analysis.target_weights"
 }
 },
 "age": {
 "module": "NumericEncoder",
 "args": {}
 },
 "sex": {
 "module": "BinaryEncoder",
 "args": {}
 },
 "cp": {
 "module": "OneHotEncoder",
 "args": {}
 },
 "trestbps": {
 "module": "NumericEncoder",
 "args": {}
 },
 "chol": {
 "module": "NumericEncoder",
 "args": {}
 },
 "fbs": {
 "module": "BinaryEncoder",
 "args": {}
 },
 "restecg": {
 "module": "OneHotEncoder",
 "args": {}
 },
 "thalach": {
 "module": "NumericEncoder",
 "args": {}
 },
 "exang": {
 "module": "BinaryEncoder",
 "args": {}
 },
 "oldpeak": {
 "module": "NumericEncoder",
 "args": {}
 },
 "slope": {
 "module": "OneHotEncoder",
 "args": {}
 },
 "ca": {
 "module": "OneHotEncoder",
 "args": {}
 },
 "thal": {
 "module": "OneHotEncoder",
 "args": {}
 }
 },
 "dtype_dict": {
 "age": "integer",
 "sex": "binary",
 "cp": "categorical",
 "trestbps": "integer",
 "chol": "int

Now we have to edit the `mixers` key of this json ai to tell lightwood to use our custom mixer. We can use it together with the others, and have it ensembled with them at the end, or standalone. In this case I'm going to replace all existing mixers with this one

In [3]:
json_ai.model['args']['submodels'] = [{
 'module': 'random_forest_mixer.RandomForestMixer',
 'args': {
 'stop_after': '$problem_definition.seconds_per_mixer',
 'dtype_dict': '$dtype_dict',
 'target': '$target',
 'target_encoder': '$encoders[self.target]'

 }
}]

Then we'll generate some code, and finally turn that code into a predictor object and fit it on the original data.

In [4]:
from lightwood.api.high_level import code_from_json_ai, predictor_from_code

code = code_from_json_ai(json_ai)
predictor = predictor_from_code(code)

In [5]:
predictor.learn(df)

[32mINFO:dataprep_ml-2524:[Learn phase 1/8] - Statistical analysis[0m


[32mINFO:dataprep_ml-2524:Starting statistical analysis[0m


[32mINFO:dataprep_ml-2524:Finished statistical analysis[0m


[37mDEBUG:lightwood-2524: `analyze_data` runtime: 0.03 seconds[0m


[32mINFO:dataprep_ml-2524:[Learn phase 2/8] - Data preprocessing[0m


[32mINFO:dataprep_ml-2524:Cleaning the data[0m


[37mDEBUG:lightwood-2524: `preprocess` runtime: 0.01 seconds[0m


[32mINFO:dataprep_ml-2524:[Learn phase 3/8] - Data splitting[0m


[32mINFO:dataprep_ml-2524:Splitting the data into train/test[0m


[37mDEBUG:lightwood-2524: `split` runtime: 0.01 seconds[0m


[32mINFO:dataprep_ml-2524:[Learn phase 4/8] - Preparing encoders[0m


[37mDEBUG:dataprep_ml-2524:Preparing sequentially...[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for age...[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for sex...[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for cp...[0m


[37mDEBUG:lightwood-2524:Encoding UNKNOWN categories as index 0[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for trestbps...[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for chol...[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for fbs...[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for restecg...[0m


[37mDEBUG:lightwood-2524:Encoding UNKNOWN categories as index 0[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for thalach...[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for exang...[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for oldpeak...[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for slope...[0m


[37mDEBUG:lightwood-2524:Encoding UNKNOWN categories as index 0[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for ca...[0m


[37mDEBUG:lightwood-2524:Encoding UNKNOWN categories as index 0[0m


[37mDEBUG:dataprep_ml-2524:Preparing encoder for thal...[0m


[37mDEBUG:lightwood-2524:Encoding UNKNOWN categories as index 0[0m


[37mDEBUG:lightwood-2524: `prepare` runtime: 0.02 seconds[0m


[32mINFO:dataprep_ml-2524:[Learn phase 5/8] - Feature generation[0m


[32mINFO:dataprep_ml-2524:Featurizing the data[0m


[37mDEBUG:lightwood-2524: `featurize` runtime: 0.09 seconds[0m


[32mINFO:dataprep_ml-2524:[Learn phase 6/8] - Mixer training[0m


[32mINFO:dataprep_ml-2524:Training the mixers[0m


[37mDEBUG:lightwood-2524: `fit_mixer` runtime: 0.12 seconds[0m


[32mINFO:dataprep_ml-2524:Ensembling the mixer[0m


[32mINFO:lightwood-2524:Mixer: RandomForestMixer got accuracy: 0.798[0m


[32mINFO:lightwood-2524:Picked best mixer: RandomForestMixer[0m


[37mDEBUG:lightwood-2524: `fit` runtime: 0.13 seconds[0m


[32mINFO:dataprep_ml-2524:[Learn phase 7/8] - Ensemble analysis[0m


[32mINFO:dataprep_ml-2524:Analyzing the ensemble of mixers[0m


[32mINFO:lightwood-2524:The block ICP is now running its analyze() method[0m


[32mINFO:lightwood-2524:The block ConfStats is now running its analyze() method[0m


[32mINFO:lightwood-2524:The block AccStats is now running its analyze() method[0m


[32mINFO:lightwood-2524:The block PermutationFeatureImportance is now running its analyze() method[0m


[32mINFO:lightwood-2524:[PFI] Using a random sample (1000 rows out of 31).[0m


[32mINFO:lightwood-2524:[PFI] Set to consider first 10 columns out of 10: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak'].[0m






[37mDEBUG:lightwood-2524: `analyze_ensemble` runtime: 0.27 seconds[0m


[32mINFO:dataprep_ml-2524:[Learn phase 8/8] - Adjustment on validation requested[0m


[32mINFO:dataprep_ml-2524:Updating the mixers[0m


[37mDEBUG:lightwood-2524: `adjust` runtime: 0.04 seconds[0m


[37mDEBUG:lightwood-2524: `learn` runtime: 0.62 seconds[0m


Finally, we can use the trained predictor to make some predictions, or save it to a pickle for later use

In [6]:
predictions = predictor.predict(pd.DataFrame({
 'age': [63, 15, None],
 'sex': [1, 1, 0],
 'thal': [3, 1, 1]
}))
print(predictions)

predictor.save('my_custom_heart_disease_predictor.pickle')

[32mINFO:dataprep_ml-2524:[Predict phase 1/4] - Data preprocessing[0m


[32mINFO:dataprep_ml-2524:Cleaning the data[0m


[37mDEBUG:lightwood-2524: `preprocess` runtime: 0.01 seconds[0m


[32mINFO:dataprep_ml-2524:[Predict phase 2/4] - Feature generation[0m


[32mINFO:dataprep_ml-2524:Featurizing the data[0m


 outputs = ufunc(*inputs)
 outputs = ufunc(*inputs)
 outputs = ufunc(*inputs)
 outputs = ufunc(*inputs)
 outputs = ufunc(*inputs)
 outputs = ufunc(*inputs)
 outputs = ufunc(*inputs)
 outputs = ufunc(*inputs)
[37mDEBUG:lightwood-2524: `featurize` runtime: 0.02 seconds[0m


[32mINFO:dataprep_ml-2524:[Predict phase 3/4] - Calling ensemble[0m


[37mDEBUG:lightwood-2524: `_timed_call` runtime: 0.01 seconds[0m


[32mINFO:dataprep_ml-2524:[Predict phase 4/4] - Analyzing output[0m


[32mINFO:lightwood-2524:The block ICP is now running its explain() method[0m


[32mINFO:lightwood-2524:The block ConfStats is now running its explain() method[0m


[32mINFO:lightwood-2524:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.[0m


[32mINFO:lightwood-2524:The block AccStats is now running its explain() method[0m


[32mINFO:lightwood-2524:AccStats.explain() has not been implemented, no modifications will be done to the data insights.[0m


[32mINFO:lightwood-2524:The block PermutationFeatureImportance is now running its explain() method[0m


[32mINFO:lightwood-2524:PermutationFeatureImportance.explain() has not been implemented, no modifications will be done to the data insights.[0m


[37mDEBUG:lightwood-2524: `explain` runtime: 0.01 seconds[0m


[37mDEBUG:lightwood-2524: `predict` runtime: 0.05 seconds[0m


 original_index prediction confidence
0 0 1 0.073676
1 1 0 0.250612
2 2 0 0.462595


That's it, all it takes to solve a predictive problem with lightwood using your own custom mixer.