{ "cells": [ { "cell_type": "markdown", "id": "smooth-philip", "metadata": {}, "source": [ "### Custom Encoder: Rule-Based\n", "\n", "Lightwood uses \"Encoders\" to convert preprocessed (cleaned) data into **features**. Encoders represent the **feature engineering** step of the data science pipeline; they can either have a set of instructions (\"rule-based\") or a learned representation (trained on data).\n", "\n", "In the following notebook, we will experiment with creating a custom encoder that creates **Label Encoding**. \n", "\n", "For example, imagine we have the following set of categories:\n", "\n", "```\n", "MyColumnData = [\"apple\", \"orange\", \"orange\", \"banana\", \"apple\", \"dragonfruit\"]\n", "```\n", "\n", "There are 4 categories to consider: \"apple\", \"banana\", \"orange\", and \"dragonfruit\".\n", "\n", "**Label encoding** allows you to refer to these categories as if they were numbers. For example, consider the mapping (arranged alphabetically):\n", "\n", "1 - apple
\n", "2 - banana
\n", "3 - dragonfruit
\n", "4 - orange
\n", "\n", "Using this mapping, we can convert the above data as follows:\n", "\n", "```\n", "MyFeatureData = [1, 4, 4, 2, 1, 3]\n", "```\n", "\n", "In the following notebook, we will design a **LabelEncoder** for Lightwood for use on categorical data. We will be using the Kaggle \"Used Car\" [dataset](https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes). We've provided a link for you to automatically access this CSV. This dataset describes various details of cars on sale - with the goal of predicting how much this car may sell for.\n", "\n", "Let's get started." ] }, { "cell_type": "code", "execution_count": 1, "id": "raising-adventure", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:12.524156Z", "iopub.status.busy": "2024-05-07T17:14:12.523956Z", "iopub.status.idle": "2024-05-07T17:14:15.297400Z", "shell.execute_reply": "2024-05-07T17:14:15.296719Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2557:No torchvision detected, image helpers not supported.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2557:No torchvision/pillow detected, image encoder not supported\u001b[0m\n" ] } ], "source": [ "import pandas as pd\n", "\n", "# Lightwood modules\n", "import lightwood as lw\n", "from lightwood import ProblemDefinition, \\\n", " JsonAI, \\\n", " json_ai_from_problem, \\\n", " code_from_json_ai, \\\n", " predictor_from_code" ] }, { "cell_type": "markdown", "id": "instant-income", "metadata": {}, "source": [ "### 1) Load your data\n", "\n", "Lightwood works with `pandas.DataFrame`s; load data via pandas as follows:" ] }, { "cell_type": "code", "execution_count": 2, "id": "technical-government", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:15.300454Z", "iopub.status.busy": "2024-05-07T17:14:15.300029Z", "iopub.status.idle": "2024-05-07T17:14:15.585690Z", "shell.execute_reply": "2024-05-07T17:14:15.585023Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelyearpricetransmissionmileagefuelTypetaxmpgengineSize
0A1201712500Manual15735Petrol15055.41.4
1A6201616500Automatic36203Diesel2064.22.0
2A1201611000Manual29946Petrol3055.41.4
3A4201716800Automatic25952Diesel14567.32.0
4A3201917300Manual1998Petrol14549.61.0
\n", "
" ], "text/plain": [ " model year price transmission mileage fuelType tax mpg engineSize\n", "0 A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4\n", "1 A6 2016 16500 Automatic 36203 Diesel 20 64.2 2.0\n", "2 A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4\n", "3 A4 2017 16800 Automatic 25952 Diesel 145 67.3 2.0\n", "4 A3 2019 17300 Manual 1998 Petrol 145 49.6 1.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filename = 'https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/used_car_price/data.csv'\n", "df = pd.read_csv(filename)\n", "df.head()" ] }, { "cell_type": "markdown", "id": "anonymous-rainbow", "metadata": {}, "source": [ "We can see a handful of columns above, such as `model, year, price, transmission, mileage, fuelType, tax, mpg, engineSize`. Some columns are numerical whereas others are categorical. We are going to specifically only focus on categorical columns.\n", "\n", "\n", "### 2) Generate JSON-AI Syntax\n", "\n", "We will make a `LabelEncoder` as follows:\n", "\n", "(1) Find all unique examples within a column
\n", "(2) Order the examples in a consistent way
\n", "(3) Label (python-index of 0 as start) each category
\n", "(4) Assign the label according to each datapoint.
\n", "\n", "First, let's generate a JSON-AI syntax so we can automatically identify each column. " ] }, { "cell_type": "code", "execution_count": 3, "id": "absent-maker", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:15.588304Z", "iopub.status.busy": "2024-05-07T17:14:15.587940Z", "iopub.status.idle": "2024-05-07T17:14:26.459842Z", "shell.execute_reply": "2024-05-07T17:14:26.459139Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Analyzing a sample of 6920\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:from a total population of 10668, this is equivalent to 64.9% of your data.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Using 3 processes to deduct types.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Infering type for: year\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Infering type for: price\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Column year has data type integer\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Column price has data type integer\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Infering type for: transmission\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Infering type for: mileage\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Infering type for: model\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Column mileage has data type integer\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Infering type for: fuelType\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Column fuelType has data type categorical\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Infering type for: tax\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Column tax has data type integer\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Infering type for: mpg\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Column mpg has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Infering type for: engineSize\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Column engineSize has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Column transmission has data type categorical\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2557:Column model has data type categorical\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2557:Starting statistical analysis\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2557:Finished statistical analysis\u001b[0m\n" ] } ], "source": [ "# Create the Problem Definition\n", "pdef = ProblemDefinition.from_dict({\n", " 'target': 'price', # column you want to predict\n", " #'ignore_features': ['year', 'mileage', 'tax', 'mpg', 'engineSize']\n", "})\n", "\n", "# Generate a JSON-AI object\n", "json_ai = json_ai_from_problem(df, problem_definition=pdef)" ] }, { "cell_type": "markdown", "id": "swedish-riverside", "metadata": {}, "source": [ "Let's take a look at our JSON-AI and print to file." ] }, { "cell_type": "code", "execution_count": 4, "id": "coastal-paragraph", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:26.463088Z", "iopub.status.busy": "2024-05-07T17:14:26.462595Z", "iopub.status.idle": "2024-05-07T17:14:26.468203Z", "shell.execute_reply": "2024-05-07T17:14:26.467507Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"encoders\": {\n", " \"price\": {\n", " \"module\": \"NumericEncoder\",\n", " \"args\": {\n", " \"is_target\": \"True\",\n", " \"positive_domain\": \"$statistical_analysis.positive_domain\"\n", " }\n", " },\n", " \"model\": {\n", " \"module\": \"CategoricalAutoEncoder\",\n", " \"args\": {\n", " \"stop_after\": \"$problem_definition.seconds_per_encoder\"\n", " }\n", " },\n", " \"year\": {\n", " \"module\": \"NumericEncoder\",\n", " \"args\": {}\n", " },\n", " \"transmission\": {\n", " \"module\": \"OneHotEncoder\",\n", " \"args\": {}\n", " },\n", " \"mileage\": {\n", " \"module\": \"NumericEncoder\",\n", " \"args\": {}\n", " },\n", " \"fuelType\": {\n", " \"module\": \"OneHotEncoder\",\n", " \"args\": {}\n", " },\n", " \"tax\": {\n", " \"module\": \"NumericEncoder\",\n", " \"args\": {}\n", " },\n", " \"mpg\": {\n", " \"module\": \"NumericEncoder\",\n", " \"args\": {}\n", " },\n", " \"engineSize\": {\n", " \"module\": \"NumericEncoder\",\n", " \"args\": {}\n", " }\n", " },\n", " \"dtype_dict\": {\n", " \"model\": \"categorical\",\n", " \"year\": \"integer\",\n", " \"price\": \"integer\",\n", " \"transmission\": \"categorical\",\n", " \"mileage\": \"integer\",\n", " \"fuelType\": \"categorical\",\n", " \"tax\": \"integer\",\n", " \"mpg\": \"float\",\n", " \"engineSize\": \"float\"\n", " },\n", " \"dependency_dict\": {},\n", " \"model\": {\n", " \"module\": \"BestOf\",\n", " \"args\": {\n", " \"submodels\": [\n", " {\n", " \"module\": \"Neural\",\n", " \"args\": {\n", " \"fit_on_dev\": true,\n", " \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n", " \"search_hyperparameters\": true\n", " }\n", " },\n", " {\n", " \"module\": \"XGBoostMixer\",\n", " \"args\": {\n", " \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n", " \"fit_on_dev\": true\n", " }\n", " },\n", " {\n", " \"module\": \"Regression\",\n", " \"args\": {\n", " \"stop_after\": \"$problem_definition.seconds_per_mixer\"\n", " }\n", " },\n", " {\n", " \"module\": \"RandomForest\",\n", " \"args\": {\n", " \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n", " \"fit_on_dev\": true\n", " }\n", " }\n", " ]\n", " }\n", " },\n", " \"problem_definition\": {\n", " \"target\": \"price\",\n", " \"pct_invalid\": 2,\n", " \"unbias_target\": true,\n", " \"seconds_per_mixer\": 21384.0,\n", " \"seconds_per_encoder\": 85536.0,\n", " \"expected_additional_time\": 10.828943014144897,\n", " \"time_aim\": 259200,\n", " \"target_weights\": null,\n", " \"positive_domain\": false,\n", " \"timeseries_settings\": {\n", " \"is_timeseries\": false,\n", " \"order_by\": null,\n", " \"window\": null,\n", " \"group_by\": null,\n", " \"use_previous_target\": true,\n", " \"horizon\": null,\n", " \"historical_columns\": null,\n", " \"target_type\": \"\",\n", " \"allow_incomplete_history\": true,\n", " \"eval_incomplete\": false,\n", " \"interval_periods\": []\n", " },\n", " \"anomaly_detection\": false,\n", " \"use_default_analysis\": true,\n", " \"embedding_only\": false,\n", " \"dtype_dict\": {},\n", " \"ignore_features\": [],\n", " \"fit_on_all\": true,\n", " \"strict_mode\": true,\n", " \"seed_nr\": 1\n", " },\n", " \"identifiers\": {},\n", " \"imputers\": [],\n", " \"accuracy_functions\": [\n", " \"r2_score\"\n", " ]\n", "}\n" ] } ], "source": [ "print(json_ai.to_json())" ] }, { "cell_type": "markdown", "id": "expired-flour", "metadata": {}, "source": [ "### 3) Create your custom encoder (`LabelEncoder`).\n", "\n", "Once our JSON-AI is filled, let's make our LabelEncoder. All Lightwood encoders inherit from the `BaseEncoder` class, found [here](https://github.com/mindsdb/lightwood/blob/staging/lightwood/encoder/base.py). \n", "\n", "![BaseEncoder](baseencoder.png)\n", "\n", "\n", "The `BaseEncoder` has 5 expected calls:\n", "\n", "- `__init__`: instantiate the encoder\n", "- `prepare`: Train or create the rules of the encoder\n", "- `encode`: Given data, convert to the featurized representation\n", "- `decode`: Given featurized representations, revert back to data\n", "- `to`: Use CPU/GPU (mostly important for learned representations)\n", "\n", "From above, we see that \"model\", \"transmission\", and \"fuelType\" are all categorical columns. These will be the ones we want to modify." ] }, { "cell_type": "markdown", "id": "verbal-northwest", "metadata": {}, "source": [ "##### `LabelEncoder`\n", "\n", "The `LabelEncoder` should satisfy a couple of rules\n", "\n", "(1) For the ``__init__`` call:
\n", " - Specify the only argument `is_target`; this asks whether the encoder aims to represent the target column.
\n", " - Set `is_prepared=False` in the initialization. All encoders are prepared using their `prepare()` call, which turns this flag on to `True` if preparation of the encoders is successful.
\n", " - Set `output_size=1`; the output size refers to how many options the represented encoder may adopt. \n", " \n", " \n", "(2) For the ``prepare`` call:\n", " - Specify the only argument `priming_data`; this provides the `pd.Series` of the data column for the encoder.\n", " - Find all unique categories in the column data\n", " - Make a dictionary representing label number to category (reserves 0 as Unknown) and the inverse dictionary\n", " - Set `is_prepared=True`\n", " \n", "(3) The `encode()` call will convert each data point's category name into the encoded label.\n", "\n", "(4) The `decode()` call will convert a previously encoded label into the original category name.\n", "\n", "Given this approach only uses simple dictionaries, there is no need for a dedicated `to()` call (although this would inherit `BaseEncoder`'s implementation).\n", "\n", "This implementation would look as follows:" ] }, { "cell_type": "code", "execution_count": 5, "id": "e03db1b0", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:26.470619Z", "iopub.status.busy": "2024-05-07T17:14:26.470424Z", "iopub.status.idle": "2024-05-07T17:14:26.475548Z", "shell.execute_reply": "2024-05-07T17:14:26.474947Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing LabelEncoder.py\n" ] } ], "source": [ "%%writefile LabelEncoder.py\n", "\n", "\"\"\"\n", "2021.10.13\n", "\n", "Create a LabelEncoder that transforms categorical data into a label.\n", "\"\"\"\n", "import pandas as pd\n", "import torch\n", "\n", "from lightwood.encoder import BaseEncoder\n", "from typing import List, Union\n", "from lightwood.helpers.log import log\n", "\n", "\n", "class LabelEncoder(BaseEncoder):\n", " \"\"\"\n", " Create a label representation for categorical data. The data will rely on sorted to organize the order of the labels.\n", "\n", " Class Attributes:\n", " - is_target: Whether this is used to encode the target\n", " - is_prepared: Whether the encoder rules have been set (after ``prepare`` is called)\n", "\n", " \"\"\" # noqa\n", "\n", " is_target: bool\n", " is_prepared: bool\n", "\n", " is_timeseries_encoder: bool = False\n", " is_trainable_encoder: bool = True\n", "\n", " def __init__(self, is_target: bool = False, stop_after = 10) -> None:\n", " \"\"\"\n", " Initialize the Label Encoder\n", "\n", " :param is_target:\n", " \"\"\"\n", " self.is_target = is_target\n", " self.is_prepared = False\n", "\n", " # Size of the output encoded dimension per data point\n", " # For LabelEncoder, this is always 1 (1 label per category)\n", " self.output_size = 1\n", "\n", " def prepare(self, train_data: pd.Series, dev_data: pd.Series) -> None:\n", " \"\"\"\n", " Create a LabelEncoder for categorical data.\n", "\n", " LabelDict creates a mapping where each index is associated to a category.\n", "\n", " :param priming_data: Input column data that is categorical.\n", "\n", " :returns: Nothing; prepares encoder rules with `label_dict` and `ilabel_dict`\n", " \"\"\"\n", "\n", " # Find all unique categories in the dataset\n", " categories = train_data.unique()\n", "\n", " log.info(\"Categories Detected = \" + str(self.output_size))\n", "\n", " # Create the Category labeller\n", " self.label_dict = {\"Unknown\": 0} # Include an unknown category\n", " self.label_dict.update({cat: idx + 1 for idx, cat in enumerate(categories)})\n", " self.ilabel_dict = {idx: cat for cat, idx in self.label_dict.items()}\n", "\n", " self.is_prepared = True\n", "\n", " def encode(self, column_data: Union[pd.Series, list]) -> torch.Tensor:\n", " \"\"\"\n", " Convert pre-processed data into the labeled values\n", "\n", " :param column_data: Pandas series to convert into labels\n", " \"\"\"\n", " if isinstance(column_data, pd.Series):\n", " enc = column_data.apply(lambda x: self.label_dict.get(x, 0)).tolist()\n", " else:\n", " enc = [self.label_dict.get(x, 0) for x in column_data]\n", "\n", " return torch.Tensor(enc).int().unsqueeze(1)\n", "\n", " def decode(self, encoded_data: torch.Tensor) -> List[object]:\n", " \"\"\"\n", " Convert torch.Tensor labels into categorical data\n", "\n", " :param encoded_data: Encoded data in the form of a torch.Tensor\n", " \"\"\"\n", " return [self.ilabel_dict[i.item()] for i in encoded_data]\n" ] }, { "cell_type": "markdown", "id": "23cce8a8", "metadata": {}, "source": [ "Some additional notes:\n", "(1) The `encode()` call should be able to intake a list of values, it is optional to make it compatible with `pd.Series` or `pd.DataFrame`
\n", "(2) The output of `encode()` must be a torch tensor with dimensionality $N_{rows} x N_{output}$.\n", "\n", "Now that the `LabelEncoder` is complete, move this to `~/lightwood_modules` and we're ready to try this out!" ] }, { "cell_type": "code", "execution_count": 6, "id": "e30866c1", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:26.478042Z", "iopub.status.busy": "2024-05-07T17:14:26.477694Z", "iopub.status.idle": "2024-05-07T17:14:26.480818Z", "shell.execute_reply": "2024-05-07T17:14:26.480320Z" } }, "outputs": [], "source": [ "from lightwood import load_custom_module\n", "load_custom_module('LabelEncoder.py')" ] }, { "cell_type": "markdown", "id": "optical-archive", "metadata": {}, "source": [ "### 4) Edit JSON-AI\n", "\n", "Now that we have our `LabelEncoder` script, we have two ways of introducing this encoder:\n", "\n", "(1) Change all categorical columns to our encoder of choice
\n", "(2) Replace the default encoder (`Categorical.OneHotEncoder`) for categorical data to our encoder of choice
\n", "\n", "In the first scenario, we may not want to change ALL columns. By switching the encoder on a `Feature` level, Lightwood allows you to control how representations for a given feature are handled. However, suppose you want to replace an approach entirely with your own methodology - Lightwood supports overriding default methods to control how you want to treat a *data type* as well.\n", "\n", "Below, we'll show both strategies:" ] }, { "cell_type": "markdown", "id": "quiet-lodging", "metadata": {}, "source": [ "The first strategy requires just specifying which features you'd like to change. Once you have your list, you can manually set the encoder \"module\" to the class you'd like. **This is best suited for a few columns or if you only want to override a few particular columns as opposed to replacing the `Encoder` behavior for an entire data type**.\n", "#### Strategy 1: Change the encoders for the features directly\n", "```python\n", "for ft in [\"model\", \"transmission\", \"fuelType\"]: # Features you want to replace\n", " # Set each feature to the custom encoder\n", " json_ai.encoders[ft]['module'] = 'LabelEncoder.LabelEncoder'\n", "```\n", "\n", "\n", "Suppose you have many columns that are categorical- you may want to enforce your approach explicitly without naming each column. This can be done by examining the `data_dtype` of JSON-AI's features. For all features that are type `categorical` (while this is a `str`, it's ideal to import dtype and explicitly check the data type), replace the default `Encoder` with your encoder. In this case, this is `LabelEncoder.LabelEncoder`.\n", "#### Strategy 2: Programatically change *all* encoder assignments for a data type\n", "\n", "```python\n", "from lightwood.api import dtype\n", "for i in json_ai.dtype_dict:\n", " if json_ai.dtype_dict[i] == dtype.categorical:\n", " json_ai.encoders[i]['module'] = 'LabelEncoder.LabelEncoder'\n", "```\n", "\n", "We'll go with the first approach for simplicity:" ] }, { "cell_type": "code", "execution_count": 7, "id": "elementary-fusion", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:26.483350Z", "iopub.status.busy": "2024-05-07T17:14:26.482935Z", "iopub.status.idle": "2024-05-07T17:14:26.486193Z", "shell.execute_reply": "2024-05-07T17:14:26.485582Z" } }, "outputs": [], "source": [ "for ft in [\"model\", \"transmission\", \"fuelType\"]: # Features you want to replace\n", " # Set each feature to the custom encoder\n", " json_ai.encoders[ft]['module'] = 'LabelEncoder.LabelEncoder'" ] }, { "cell_type": "markdown", "id": "together-austria", "metadata": {}, "source": [ "### 5) Generate code and your predictor from JSON-AI\n", "\n", "Now, let's use this JSON-AI object to generate code and make a predictor. This can be done in 2 simple lines, below:" ] }, { "cell_type": "code", "execution_count": 8, "id": "inappropriate-james", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:26.488868Z", "iopub.status.busy": "2024-05-07T17:14:26.488503Z", "iopub.status.idle": "2024-05-07T17:14:26.843807Z", "shell.execute_reply": "2024-05-07T17:14:26.843080Z" } }, "outputs": [], "source": [ "#Generate python code that fills in your pipeline\n", "code = code_from_json_ai(json_ai)\n", "\n", "# Turn the code above into a predictor object\n", "predictor = predictor_from_code(code)" ] }, { "cell_type": "markdown", "id": "personalized-andorra", "metadata": {}, "source": [ "Now, let's run our pipeline. To do so, let's first:\n", "\n", "(1) Perform a statistical analysis on the data (*this is important in preparing Encoders/Mixers as it populates the* `StatisticalAnalysis` *attribute with details some encoders need*).
\n", "(2) Clean our data
\n", "(3) Prepare the encoders
\n", "(4) Featurize the data
" ] }, { "cell_type": "code", "execution_count": 9, "id": "palestinian-harvey", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:26.846949Z", "iopub.status.busy": "2024-05-07T17:14:26.846537Z", "iopub.status.idle": "2024-05-07T17:14:27.973734Z", "shell.execute_reply": "2024-05-07T17:14:27.973159Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2557:Starting statistical analysis\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2557:Finished statistical analysis\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2557: `analyze_data` runtime: 0.42 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2557:Cleaning the data\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2557: `preprocess` runtime: 0.13 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2557:Splitting the data into train/test\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2557: `split` runtime: 0.0 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2557:Preparing sequentially...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2557:Preparing encoder for year...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2557:Preparing encoder for mileage...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2557:Preparing encoder for tax...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2557:Preparing encoder for mpg...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2557:Preparing encoder for engineSize...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2557:Categories Detected = 1\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2557:Categories Detected = 1\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2557:Categories Detected = 1\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2557: `prepare` runtime: 0.02 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2557:Featurizing the data\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2557: `featurize` runtime: 0.55 seconds\u001b[0m\n" ] } ], "source": [ "# Perform Stats Analysis\n", "predictor.analyze_data(df)\n", "\n", "# Pre-process the data\n", "cleaned_data = predictor.preprocess(data=df)\n", "\n", "# Create a train/test split\n", "split_data = predictor.split(cleaned_data)\n", "\n", "# Prepare the encoders \n", "predictor.prepare(split_data)\n", "\n", "# Featurize the data\n", "ft_data = predictor.featurize(split_data)" ] }, { "cell_type": "markdown", "id": "ordered-beast", "metadata": {}, "source": [ "The splitter creates 3 data-splits, a \"train\", \"dev\", and \"test\" set. The `featurize` command from the predictor allows us to convert the cleaned data into features. We can access this as follows:" ] }, { "cell_type": "code", "execution_count": 10, "id": "silent-dealing", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:27.976463Z", "iopub.status.busy": "2024-05-07T17:14:27.976061Z", "iopub.status.idle": "2024-05-07T17:14:27.984613Z", "shell.execute_reply": "2024-05-07T17:14:27.984075Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fuelTypeEncData
0Diesel1.0
1Diesel1.0
2Diesel1.0
3Petrol2.0
4Diesel1.0
\n", "
" ], "text/plain": [ " fuelType EncData\n", "0 Diesel 1.0\n", "1 Diesel 1.0\n", "2 Diesel 1.0\n", "3 Petrol 2.0\n", "4 Diesel 1.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Pick a categorical column name\n", "col_name = \"fuelType\"\n", "\n", "# Get the encoded feature data\n", "enc_ft = ft_data[\"train\"].get_encoded_column_data(col_name).squeeze(1) #torch tensor (N_rows x N_output_dim)\n", "\n", "# Get the original data from the dataset\n", "orig_data = ft_data[\"train\"].get_column_original_data(col_name) #pandas dataframe\n", "\n", "# Create a pandas data frame to compare encoded data and original data\n", "compare_data = pd.concat([orig_data, pd.Series(enc_ft, name=\"EncData\")], axis=1)\n", "compare_data.head()" ] }, { "cell_type": "markdown", "id": "fatty-peoples", "metadata": {}, "source": [ "We can see what the label mapping is by inspecting our encoders as follows:" ] }, { "cell_type": "code", "execution_count": 11, "id": "superior-mobility", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:27.987192Z", "iopub.status.busy": "2024-05-07T17:14:27.986803Z", "iopub.status.idle": "2024-05-07T17:14:27.990393Z", "shell.execute_reply": "2024-05-07T17:14:27.989705Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'Unknown': 0, 'Diesel': 1, 'Petrol': 2, 'Hybrid': 3}\n" ] } ], "source": [ "# Label Name -> Label Number\n", "print(predictor.encoders[col_name].label_dict)" ] }, { "cell_type": "markdown", "id": "frequent-remedy", "metadata": {}, "source": [ "For each category above, the number associated in the dictionary is the label for each category. This means \"Diesel\" is always represented by a 1, etc.\n", "\n", "With that, you've created your own custom Encoder that uses a rule-based approach! Please checkout more [tutorials](https://lightwood.io/tutorials.html) for other custom approach guides." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 5 }