Custom Encoder: Rule-Based

Lightwood uses “Encoders” to convert preprocessed (cleaned) data into features. Encoders represent the feature engineering step of the data science pipeline; they can either have a set of instructions (“rule-based”) or a learned representation (trained on data).

In the following notebook, we will experiment with creating a custom encoder that creates Label Encoding.

For example, imagine we have the following set of categories:

MyColumnData = ["apple", "orange", "orange", "banana", "apple", "dragonfruit"]

There are 4 categories to consider: “apple”, “banana”, “orange”, and “dragonfruit”.

Label encoding allows you to refer to these categories as if they were numbers. For example, consider the mapping (arranged alphabetically):

1 - apple 2 - banana 3 - dragonfruit 4 - orange

Using this mapping, we can convert the above data as follows:

MyFeatureData = [1, 4, 4, 2, 1, 3]

In the following notebook, we will design a LabelEncoder for Lightwood for use on categorical data. We will be using the Kaggle “Used Car” dataset. We’ve provided a link for you to automatically access this CSV. This dataset describes various details of cars on sale - with the goal of predicting how much this car may sell for.

Let’s get started.

[1]:
import pandas as pd

# Lightwood modules
import lightwood as lw
from lightwood import ProblemDefinition, \
                      JsonAI, \
                      json_ai_from_problem, \
                      code_from_json_ai, \
                      predictor_from_code
INFO:lightwood-2551:No torchvision detected, image helpers not supported.
INFO:lightwood-2551:No torchvision/pillow detected, image encoder not supported

1) Load your data

Lightwood works with pandas.DataFrames; load data via pandas as follows:

[2]:
filename = 'https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/used_car_price/data.csv'
df = pd.read_csv(filename)
df.head()
[2]:
model year price transmission mileage fuelType tax mpg engineSize
0 A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4
1 A6 2016 16500 Automatic 36203 Diesel 20 64.2 2.0
2 A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4
3 A4 2017 16800 Automatic 25952 Diesel 145 67.3 2.0
4 A3 2019 17300 Manual 1998 Petrol 145 49.6 1.0

We can see a handful of columns above, such as model, year, price, transmission, mileage, fuelType, tax, mpg, engineSize. Some columns are numerical whereas others are categorical. We are going to specifically only focus on categorical columns.

2) Generate JSON-AI Syntax

We will make a LabelEncoder as follows:

  1. Find all unique examples within a column

  2. Order the examples in a consistent way

  3. Label (python-index of 0 as start) each category

  4. Assign the label according to each datapoint.

First, let’s generate a JSON-AI syntax so we can automatically identify each column.

[3]:
# Create the Problem Definition
pdef = ProblemDefinition.from_dict({
    'target': 'price', # column you want to predict
    #'ignore_features': ['year', 'mileage', 'tax', 'mpg', 'engineSize']
})

# Generate a JSON-AI object
json_ai = json_ai_from_problem(df, problem_definition=pdef)
INFO:type_infer-2551:Analyzing a sample of 6920
INFO:type_infer-2551:from a total population of 10668, this is equivalent to 64.9% of your data.
INFO:type_infer-2551:Using 3 processes to deduct types.
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/multiprocessing/popen_fork.py:66: RuntimeWarning: Using fork() can cause Polars to deadlock in the child process.
In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

If you really know what your doing, you can silence this warning with the warning module
or by setting POLARS_ALLOW_FORKING_THREAD=1.

  self.pid = os.fork()
INFO:type_infer-2551:Infering type for: year
INFO:type_infer-2551:Infering type for: price
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/multiprocessing/popen_fork.py:66: RuntimeWarning: Using fork() can cause Polars to deadlock in the child process.
In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

If you really know what your doing, you can silence this warning with the warning module
or by setting POLARS_ALLOW_FORKING_THREAD=1.

  self.pid = os.fork()
INFO:type_infer-2551:Infering type for: model
INFO:type_infer-2551:Column price has data type integer
INFO:type_infer-2551:Column year has data type integer
INFO:type_infer-2551:Infering type for: transmission
INFO:type_infer-2551:Infering type for: mileage
INFO:type_infer-2551:Column mileage has data type integer
INFO:type_infer-2551:Infering type for: fuelType
INFO:type_infer-2551:Column model has data type categorical
INFO:type_infer-2551:Infering type for: tax
INFO:type_infer-2551:Column tax has data type integer
INFO:type_infer-2551:Infering type for: mpg
INFO:type_infer-2551:Column mpg has data type float
INFO:type_infer-2551:Infering type for: engineSize
INFO:type_infer-2551:Column engineSize has data type float
INFO:type_infer-2551:Column fuelType has data type categorical
INFO:type_infer-2551:Column transmission has data type categorical
INFO:dataprep_ml-2551:Starting statistical analysis
INFO:dataprep_ml-2551:Finished statistical analysis

Let’s take a look at our JSON-AI and print to file.

[4]:
print(json_ai.to_json())
{
    "encoders": {
        "price": {
            "module": "NumericEncoder",
            "args": {
                "is_target": "True",
                "positive_domain": "$statistical_analysis.positive_domain"
            }
        },
        "model": {
            "module": "CategoricalAutoEncoder",
            "args": {
                "stop_after": "$problem_definition.seconds_per_encoder"
            }
        },
        "year": {
            "module": "NumericEncoder",
            "args": {}
        },
        "transmission": {
            "module": "OneHotEncoder",
            "args": {}
        },
        "mileage": {
            "module": "NumericEncoder",
            "args": {}
        },
        "fuelType": {
            "module": "OneHotEncoder",
            "args": {}
        },
        "tax": {
            "module": "NumericEncoder",
            "args": {}
        },
        "mpg": {
            "module": "NumericEncoder",
            "args": {}
        },
        "engineSize": {
            "module": "NumericEncoder",
            "args": {}
        }
    },
    "dtype_dict": {
        "model": "categorical",
        "year": "integer",
        "price": "integer",
        "transmission": "categorical",
        "mileage": "integer",
        "fuelType": "categorical",
        "tax": "integer",
        "mpg": "float",
        "engineSize": "float"
    },
    "dependency_dict": {},
    "model": {
        "module": "BestOf",
        "args": {
            "submodels": [
                {
                    "module": "Neural",
                    "args": {
                        "fit_on_dev": true,
                        "stop_after": "$problem_definition.seconds_per_mixer",
                        "search_hyperparameters": true
                    }
                },
                {
                    "module": "XGBoostMixer",
                    "args": {
                        "stop_after": "$problem_definition.seconds_per_mixer",
                        "fit_on_dev": true
                    }
                },
                {
                    "module": "Regression",
                    "args": {
                        "stop_after": "$problem_definition.seconds_per_mixer"
                    }
                },
                {
                    "module": "RandomForest",
                    "args": {
                        "stop_after": "$problem_definition.seconds_per_mixer",
                        "fit_on_dev": true
                    }
                }
            ]
        }
    },
    "problem_definition": {
        "target": "price",
        "pct_invalid": 2,
        "unbias_target": true,
        "seconds_per_mixer": 21384.0,
        "seconds_per_encoder": 85536.0,
        "expected_additional_time": 11.148000478744507,
        "time_aim": 259200,
        "target_weights": null,
        "positive_domain": false,
        "timeseries_settings": {
            "is_timeseries": false,
            "order_by": null,
            "window": null,
            "group_by": null,
            "use_previous_target": true,
            "horizon": null,
            "historical_columns": null,
            "target_type": "",
            "allow_incomplete_history": true,
            "eval_incomplete": false,
            "interval_periods": []
        },
        "anomaly_detection": false,
        "use_default_analysis": true,
        "embedding_only": false,
        "dtype_dict": {},
        "ignore_features": [],
        "fit_on_all": true,
        "strict_mode": true,
        "seed_nr": 1
    },
    "identifiers": {},
    "imputers": [],
    "accuracy_functions": [
        "r2_score"
    ]
}

3) Create your custom encoder (LabelEncoder).

Once our JSON-AI is filled, let’s make our LabelEncoder. All Lightwood encoders inherit from the BaseEncoder class, found here.

BaseEncoder

The BaseEncoder has 5 expected calls:

  • __init__: instantiate the encoder

  • prepare: Train or create the rules of the encoder

  • encode: Given data, convert to the featurized representation

  • decode: Given featurized representations, revert back to data

  • to: Use CPU/GPU (mostly important for learned representations)

From above, we see that “model”, “transmission”, and “fuelType” are all categorical columns. These will be the ones we want to modify.

LabelEncoder

The LabelEncoder should satisfy a couple of rules

  1. For the __init__ call:

  • Specify the only argument is_target; this asks whether the encoder aims to represent the target column.

  • Set is_prepared=False in the initialization. All encoders are prepared using their prepare() call, which turns this flag on to True if preparation of the encoders is successful.

  • Set output_size=1; the output size refers to how many options the represented encoder may adopt.

  1. For the prepare call:

  • Specify the only argument priming_data; this provides the pd.Series of the data column for the encoder.

  • Find all unique categories in the column data

  • Make a dictionary representing label number to category (reserves 0 as Unknown) and the inverse dictionary

  • Set is_prepared=True

  1. The encode() call will convert each data point’s category name into the encoded label.

  2. The decode() call will convert a previously encoded label into the original category name.

Given this approach only uses simple dictionaries, there is no need for a dedicated to() call (although this would inherit BaseEncoder’s implementation).

This implementation would look as follows:

[5]:
%%writefile LabelEncoder.py

"""
2021.10.13

Create a LabelEncoder that transforms categorical data into a label.
"""
import pandas as pd
import torch

from lightwood.encoder import BaseEncoder
from typing import List, Union
from lightwood.helpers.log import log


class LabelEncoder(BaseEncoder):
    """
    Create a label representation for categorical data. The data will rely on sorted to organize the order of the labels.

    Class Attributes:
    - is_target: Whether this is used to encode the target
    - is_prepared: Whether the encoder rules have been set (after ``prepare`` is called)

    """  # noqa

    is_target: bool
    is_prepared: bool

    is_timeseries_encoder: bool = False
    is_trainable_encoder: bool = True

    def __init__(self, is_target: bool = False, stop_after = 10) -> None:
        """
        Initialize the Label Encoder

        :param is_target:
        """
        self.is_target = is_target
        self.is_prepared = False

        # Size of the output encoded dimension per data point
        # For LabelEncoder, this is always 1 (1 label per category)
        self.output_size = 1

    def prepare(self, train_data: pd.Series, dev_data: pd.Series) -> None:
        """
        Create a LabelEncoder for categorical data.

        LabelDict creates a mapping where each index is associated to a category.

        :param priming_data: Input column data that is categorical.

        :returns: Nothing; prepares encoder rules with `label_dict` and `ilabel_dict`
        """

        # Find all unique categories in the dataset
        categories = train_data.unique()

        log.info("Categories Detected = " + str(self.output_size))

        # Create the Category labeller
        self.label_dict = {"Unknown": 0}  # Include an unknown category
        self.label_dict.update({cat: idx + 1 for idx, cat in enumerate(categories)})
        self.ilabel_dict = {idx: cat for cat, idx in self.label_dict.items()}

        self.is_prepared = True

    def encode(self, column_data: Union[pd.Series, list]) -> torch.Tensor:
        """
        Convert pre-processed data into the labeled values

        :param column_data: Pandas series to convert into labels
        """
        if isinstance(column_data, pd.Series):
            enc = column_data.apply(lambda x: self.label_dict.get(x, 0)).tolist()
        else:
            enc = [self.label_dict.get(x, 0) for x in column_data]

        return torch.Tensor(enc).int().unsqueeze(1)

    def decode(self, encoded_data: torch.Tensor) -> List[object]:
        """
        Convert torch.Tensor labels into categorical data

        :param encoded_data: Encoded data in the form of a torch.Tensor
        """
        return [self.ilabel_dict[i.item()] for i in encoded_data]

Writing LabelEncoder.py

Some additional notes: (1) The encode() call should be able to intake a list of values, it is optional to make it compatible with pd.Series or pd.DataFrame (2) The output of encode() must be a torch tensor with dimensionality \(N_{rows} x N_{output}\).

Now that the LabelEncoder is complete, move this to ~/lightwood_modules and we’re ready to try this out!

[6]:
from lightwood import load_custom_module
load_custom_module('LabelEncoder.py')

4) Edit JSON-AI

Now that we have our LabelEncoder script, we have two ways of introducing this encoder:

  1. Change all categorical columns to our encoder of choice

  2. Replace the default encoder (Categorical.OneHotEncoder) for categorical data to our encoder of choice

In the first scenario, we may not want to change ALL columns. By switching the encoder on a Feature level, Lightwood allows you to control how representations for a given feature are handled. However, suppose you want to replace an approach entirely with your own methodology - Lightwood supports overriding default methods to control how you want to treat a data type as well.

Below, we’ll show both strategies:

The first strategy requires just specifying which features you’d like to change. Once you have your list, you can manually set the encoder “module” to the class you’d like. This is best suited for a few columns or if you only want to override a few particular columns as opposed to replacing the ``Encoder`` behavior for an entire data type. #### Strategy 1: Change the encoders for the features directly

for ft in ["model", "transmission", "fuelType"]: # Features you want to replace
    # Set each feature to the custom encoder
    json_ai.encoders[ft]['module'] = 'LabelEncoder.LabelEncoder'

Suppose you have many columns that are categorical- you may want to enforce your approach explicitly without naming each column. This can be done by examining the data_dtype of JSON-AI’s features. For all features that are type categorical (while this is a str, it’s ideal to import dtype and explicitly check the data type), replace the default Encoder with your encoder. In this case, this is LabelEncoder.LabelEncoder. #### Strategy 2: Programatically change all encoder assignments for a data type

from lightwood.api import dtype
for i in json_ai.dtype_dict:
    if json_ai.dtype_dict[i] == dtype.categorical:
        json_ai.encoders[i]['module'] = 'LabelEncoder.LabelEncoder'

We’ll go with the first approach for simplicity:

[7]:
for ft in ["model", "transmission", "fuelType"]: # Features you want to replace
    # Set each feature to the custom encoder
    json_ai.encoders[ft]['module'] = 'LabelEncoder.LabelEncoder'

5) Generate code and your predictor from JSON-AI

Now, let’s use this JSON-AI object to generate code and make a predictor. This can be done in 2 simple lines, below:

[8]:
#Generate python code that fills in your pipeline
code = code_from_json_ai(json_ai)

# Turn the code above into a predictor object
predictor = predictor_from_code(code)

Now, let’s run our pipeline. To do so, let’s first:

  1. Perform a statistical analysis on the data (this is important in preparing Encoders/Mixers as it populates the StatisticalAnalysis attribute with details some encoders need).

  2. Clean our data

  3. Prepare the encoders

  4. Featurize the data

[9]:
# Perform Stats Analysis
predictor.analyze_data(df)

# Pre-process the data
cleaned_data = predictor.preprocess(data=df)

# Create a train/test split
split_data = predictor.split(cleaned_data)

# Prepare the encoders
predictor.prepare(split_data)

# Featurize the data
ft_data = predictor.featurize(split_data)
INFO:dataprep_ml-2551:Starting statistical analysis
INFO:dataprep_ml-2551:Finished statistical analysis
DEBUG:lightwood-2551: `analyze_data` runtime: 0.42 seconds
INFO:dataprep_ml-2551:Cleaning the data
DEBUG:lightwood-2551: `preprocess` runtime: 0.13 seconds
INFO:dataprep_ml-2551:Splitting the data into train/test
DEBUG:lightwood-2551: `split` runtime: 0.0 seconds
DEBUG:dataprep_ml-2551:Preparing sequentially...
DEBUG:dataprep_ml-2551:Preparing encoder for year...
DEBUG:dataprep_ml-2551:Preparing encoder for mileage...
DEBUG:dataprep_ml-2551:Preparing encoder for tax...
DEBUG:dataprep_ml-2551:Preparing encoder for mpg...
DEBUG:dataprep_ml-2551:Preparing encoder for engineSize...
INFO:lightwood-2551:Categories Detected = 1
INFO:lightwood-2551:Categories Detected = 1
INFO:lightwood-2551:Categories Detected = 1
DEBUG:lightwood-2551: `prepare` runtime: 0.02 seconds
INFO:dataprep_ml-2551:Featurizing the data
DEBUG:lightwood-2551: `featurize` runtime: 0.55 seconds

The splitter creates 3 data-splits, a “train”, “dev”, and “test” set. The featurize command from the predictor allows us to convert the cleaned data into features. We can access this as follows:

[10]:
# Pick a categorical column name
col_name = "fuelType"

# Get the encoded feature data
enc_ft = ft_data["train"].get_encoded_column_data(col_name).squeeze(1) #torch tensor (N_rows x N_output_dim)

# Get the original data from the dataset
orig_data = ft_data["train"].get_column_original_data(col_name) #pandas dataframe

# Create a pandas data frame to compare encoded data and original data
compare_data = pd.concat([orig_data, pd.Series(enc_ft, name="EncData")], axis=1)
compare_data.head()
[10]:
fuelType EncData
0 Diesel 1.0
1 Diesel 1.0
2 Diesel 1.0
3 Petrol 2.0
4 Diesel 1.0

We can see what the label mapping is by inspecting our encoders as follows:

[11]:
# Label Name -> Label Number
print(predictor.encoders[col_name].label_dict)
{'Unknown': 0, 'Diesel': 1, 'Petrol': 2, 'Hybrid': 3}

For each category above, the number associated in the dictionary is the label for each category. This means “Diesel” is always represented by a 1, etc.

With that, you’ve created your own custom Encoder that uses a rule-based approach! Please checkout more tutorials for other custom approach guides.