{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "regulated-manufacturer",
   "metadata": {},
   "source": [
    "## Using your own pre-processing methods in Lightwood\n",
    "\n",
    "#### Date: 2021.10.07\n",
    "\n",
    "For the notebook below, we'll be exploring how to make **custom pre-processing** methods for our data. Lightwood has standard cleaning protocols to handle a variety of different data types, however, we want users to feel comfortable augmenting and addressing their own changes. To do so, we'll highlight the approach we would take below:\n",
    "\n",
    "\n",
    "We will use data from [Kaggle](https://www.kaggle.com/c/commonlitreadabilityprize/data?select=train.csv). \n",
    "\n",
    "The data has several columns, but ultimately aims to use text to predict a *readability score*. There are also some columns that I do not want to use when making predictions, such as `url_legal`, `license`, among others.\n",
    "\n",
    "In this tutorial, we're going to focus on making changes to 2 columns: \n",
    "(1) **excerpt**, a text column, and ensuring we remove stop words using NLTK. <br>\n",
    "(2) **target**, the goal to predict; we will make this explicitly non-negative.\n",
    "\n",
    "Note, for this ACTUAL challenge, negative and positive are meaningful. We are using this as an example dataset to demonstrate how you can make changes to your underlying dataset and proceed to building powerful predictors.\n",
    "\n",
    "Let's get started!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "happy-wheat",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-25T10:07:24.723492Z",
     "iopub.status.busy": "2025-03-25T10:07:24.722931Z",
     "iopub.status.idle": "2025-03-25T10:07:29.000765Z",
     "shell.execute_reply": "2025-03-25T10:07:28.999973Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-3299:No torchvision detected, image helpers not supported.\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-3299:No torchvision/pillow detected, image encoder not supported\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import torch\n",
    "import nltk\n",
    "\n",
    "import os\n",
    "import sys\n",
    "\n",
    "# Lightwood modules\n",
    "import lightwood as lw\n",
    "from lightwood import ProblemDefinition, \\\n",
    "                      JsonAI, \\\n",
    "                      json_ai_from_problem, \\\n",
    "                      code_from_json_ai, \\\n",
    "                      predictor_from_code"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "indie-chaos",
   "metadata": {},
   "source": [
    "### 1) Load your data\n",
    "\n",
    "Lightwood uses `pandas` in order to handle datasets, as this is a very standard package in datascience. We can load our dataset using pandas in the following manner (make sure your data is in the data folder!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "recognized-parish",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-25T10:07:29.004124Z",
     "iopub.status.busy": "2025-03-25T10:07:29.003800Z",
     "iopub.status.idle": "2025-03-25T10:07:29.806490Z",
     "shell.execute_reply": "2025-03-25T10:07:29.805773Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>url_legal</th>\n",
       "      <th>license</th>\n",
       "      <th>excerpt</th>\n",
       "      <th>target</th>\n",
       "      <th>standard_error</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c12129c31</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>When the young people returned to the ballroom...</td>\n",
       "      <td>-0.340259</td>\n",
       "      <td>0.464009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>85aa80a4c</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>All through dinner time, Mrs. Fayre was somewh...</td>\n",
       "      <td>-0.315372</td>\n",
       "      <td>0.480805</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>b69ac6792</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>As Roger had predicted, the snow departed as q...</td>\n",
       "      <td>-0.580118</td>\n",
       "      <td>0.476676</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>dd1000b26</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>And outside before the palace a great garden w...</td>\n",
       "      <td>-1.054013</td>\n",
       "      <td>0.450007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>37c1b32fb</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Once upon a time there were Three Bears who li...</td>\n",
       "      <td>0.247197</td>\n",
       "      <td>0.510845</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          id url_legal license  \\\n",
       "0  c12129c31       NaN     NaN   \n",
       "1  85aa80a4c       NaN     NaN   \n",
       "2  b69ac6792       NaN     NaN   \n",
       "3  dd1000b26       NaN     NaN   \n",
       "4  37c1b32fb       NaN     NaN   \n",
       "\n",
       "                                             excerpt    target  standard_error  \n",
       "0  When the young people returned to the ballroom... -0.340259        0.464009  \n",
       "1  All through dinner time, Mrs. Fayre was somewh... -0.315372        0.480805  \n",
       "2  As Roger had predicted, the snow departed as q... -0.580118        0.476676  \n",
       "3  And outside before the palace a great garden w... -1.054013        0.450007  \n",
       "4  Once upon a time there were Three Bears who li...  0.247197        0.510845  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load the data\n",
    "data = pd.read_csv(\"https://mindsdb-example-data.s3.eu-west-2.amazonaws.com/jupyter/train.csv.zip\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "official-wright",
   "metadata": {},
   "source": [
    "We see **6 columns**, a variety which are numerical, missing numbers, text, and identifiers or \"ids\". For our predictive task, we are only interested in 2 such columns, the **excerpt** and **target** columns.\n",
    "\n",
    "### 2) Create a JSON-AI default object\n",
    "Before we create a custom cleaner object, let's first create JSON-AI syntax for our problem based on its specifications. We can do so by setting up a ``ProblemDefinition``. The ``ProblemDefinition`` allows us to specify the target, the column we intend to predict, along with other details. \n",
    "\n",
    "The end goal of JSON-AI is to provide *a set of instructions on how to compile a machine learning pipeline*.\n",
    "\n",
    "In this case, let's specify our target, the aptly named **target** column. We will also tell JSON-AI to throw away features we never intend to use, such as \"url_legal\", \"license\", and \"standard_error\". We can do so in the following lines:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "chicken-truth",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-25T10:07:29.808650Z",
     "iopub.status.busy": "2025-03-25T10:07:29.808435Z",
     "iopub.status.idle": "2025-03-25T10:07:45.239910Z",
     "shell.execute_reply": "2025-03-25T10:07:45.239285Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-3299:Dropping features: ['url_legal', 'license', 'standard_error']\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-3299:Analyzing a sample of 2478\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-3299:from a total population of 2834, this is equivalent to 87.4% of your data.\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-3299:Infering type for: id\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-3299:Doing text detection for column: id\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-3299:Column id has data type categorical\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-3299:Infering type for: excerpt\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-3299:Doing text detection for column: excerpt\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-3299:Infering type for: target\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-3299:Column target has data type float\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[33mWARNING:type_infer-3299:Column id is an identifier of type \"Hash-like identifier\"\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-3299:Starting statistical analysis\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-3299:Dropping features: ['id']\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-3299:Finished statistical analysis\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "# Setup the problem definition\n",
    "problem_definition = {\n",
    "    'target': 'target',\n",
    "    \"ignore_features\": [\"url_legal\", \"license\", \"standard_error\"]\n",
    "}\n",
    "\n",
    "# Generate the j{ai}son syntax\n",
    "json_ai = json_ai_from_problem(data, problem_definition)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "needed-flashing",
   "metadata": {},
   "source": [
    "Lightwood, as it processes the data, will provide the user a few pieces of information.\n",
    "\n",
    "(1) It drops the features we specify in the `ignore_features` argument <br>\n",
    "(2) It takes a small sample of data from each column to *automatically infer the data type* <br>\n",
    "(3) For each column that was not ignored, it identifies the most likely data type.<br>\n",
    "(4) It notices that \"ID\" is a hash-like-identifier.<br>\n",
    "(5) It conducts a small statistical analysis on the distributions in order to generate syntax.<br>\n",
    "\n",
    "As soon as you request a JSON-AI object, Lightwood automatically creates functional syntax from your data. You can see it as follows: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "designed-condition",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-25T10:07:45.242061Z",
     "iopub.status.busy": "2025-03-25T10:07:45.241780Z",
     "iopub.status.idle": "2025-03-25T10:07:45.246480Z",
     "shell.execute_reply": "2025-03-25T10:07:45.245860Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"encoders\": {\n",
      "        \"target\": {\n",
      "            \"module\": \"NumericEncoder\",\n",
      "            \"args\": {\n",
      "                \"is_target\": \"True\",\n",
      "                \"positive_domain\": \"$statistical_analysis.positive_domain\"\n",
      "            }\n",
      "        },\n",
      "        \"excerpt\": {\n",
      "            \"module\": \"PretrainedLangEncoder\",\n",
      "            \"args\": {\n",
      "                \"output_type\": \"$dtype_dict[$target]\",\n",
      "                \"stop_after\": \"$problem_definition.seconds_per_encoder\"\n",
      "            }\n",
      "        }\n",
      "    },\n",
      "    \"dtype_dict\": {\n",
      "        \"excerpt\": \"rich_text\",\n",
      "        \"target\": \"float\"\n",
      "    },\n",
      "    \"dependency_dict\": {},\n",
      "    \"model\": {\n",
      "        \"module\": \"BestOf\",\n",
      "        \"args\": {\n",
      "            \"submodels\": [\n",
      "                {\n",
      "                    \"module\": \"Neural\",\n",
      "                    \"args\": {\n",
      "                        \"fit_on_dev\": true,\n",
      "                        \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n",
      "                        \"search_hyperparameters\": true\n",
      "                    }\n",
      "                },\n",
      "                {\n",
      "                    \"module\": \"XGBoostMixer\",\n",
      "                    \"args\": {\n",
      "                        \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n",
      "                        \"fit_on_dev\": true\n",
      "                    }\n",
      "                },\n",
      "                {\n",
      "                    \"module\": \"Regression\",\n",
      "                    \"args\": {\n",
      "                        \"stop_after\": \"$problem_definition.seconds_per_mixer\"\n",
      "                    }\n",
      "                },\n",
      "                {\n",
      "                    \"module\": \"RandomForest\",\n",
      "                    \"args\": {\n",
      "                        \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n",
      "                        \"fit_on_dev\": true\n",
      "                    }\n",
      "                }\n",
      "            ]\n",
      "        }\n",
      "    },\n",
      "    \"problem_definition\": {\n",
      "        \"target\": \"target\",\n",
      "        \"pct_invalid\": 2,\n",
      "        \"unbias_target\": true,\n",
      "        \"seconds_per_mixer\": 21384.0,\n",
      "        \"seconds_per_encoder\": 85536.0,\n",
      "        \"expected_additional_time\": 15.390582799911499,\n",
      "        \"time_aim\": 259200,\n",
      "        \"target_weights\": null,\n",
      "        \"positive_domain\": false,\n",
      "        \"timeseries_settings\": {\n",
      "            \"is_timeseries\": false,\n",
      "            \"order_by\": null,\n",
      "            \"window\": null,\n",
      "            \"group_by\": null,\n",
      "            \"use_previous_target\": true,\n",
      "            \"horizon\": null,\n",
      "            \"historical_columns\": null,\n",
      "            \"target_type\": \"\",\n",
      "            \"allow_incomplete_history\": true,\n",
      "            \"eval_incomplete\": false,\n",
      "            \"interval_periods\": []\n",
      "        },\n",
      "        \"anomaly_detection\": false,\n",
      "        \"use_default_analysis\": true,\n",
      "        \"embedding_only\": false,\n",
      "        \"dtype_dict\": {},\n",
      "        \"ignore_features\": [\n",
      "            \"url_legal\",\n",
      "            \"license\",\n",
      "            \"standard_error\"\n",
      "        ],\n",
      "        \"fit_on_all\": true,\n",
      "        \"strict_mode\": true,\n",
      "        \"seed_nr\": 1\n",
      "    },\n",
      "    \"identifiers\": {\n",
      "        \"id\": \"Hash-like identifier\"\n",
      "    },\n",
      "    \"imputers\": [],\n",
      "    \"accuracy_functions\": [\n",
      "        \"r2_score\"\n",
      "    ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "print(json_ai.to_json())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "level-vacation",
   "metadata": {},
   "source": [
    "The above shows the minimal syntax required to create a functional JSON-AI object. For each feature you consider in the dataset, we specify the name of the feature, the type of encoder (feature-engineering method) to process the feature, and key word arguments to process the encoder. For the output, we perform a similar operation, but specify the types of mixers, or algorithms used in making a predictor that can estimate the target. Lastly, we populate the \"problem_definition\" key with the ingredients for our ML pipeline.\n",
    "\n",
    "These are the only elements required to get off the ground with JSON-AI. However, we're interested in making a *custom* approach. So, let's make this syntax a file, and introduce our own changes."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "integrated-entrepreneur",
   "metadata": {},
   "source": [
    "### 3) Build your own cleaner module\n",
    "\n",
    "Let's make a file called `MyCustomCleaner.py`. To write this file, we will use `dataprep_ml.cleaners.cleaner` as inspiration. `dataprep_ml` is a companion library that is part of the broader MindsDB ecosystem, and specializes in data cleaning, data splitting and data analysis.\n",
    "\n",
    "The goal output of the cleaner is to provide pre-processing to your dataset - the output is only a pandas DataFrame. In theory, any pre-processing can be done here. However, data can be highly irregular - our default `Cleaner` function has several main goals:\n",
    "\n",
    "(1) Strip away any identifier, etc. unwanted columns <br>\n",
    "(2) Apply a cleaning function to each column in the dataset, according to that column's data type <br>\n",
    "(3) Standardize NaN values within each column for appropriate downstream treatment <br>\n",
    "\n",
    "You can choose to omit many of these details and completely write this module from scratch, but the easiest way to introduce your custom changes is to borrow the `Cleaner` function, and add core changes in a custom block.\n",
    "\n",
    "This can be done as follows\n",
    "\n",
    "\n",
    "You can see individual cleaning functions in `dataprep_ml.cleaners`. If you want to entirely replace a cleaning technique given a particular data-type, we invite you to change `dataprep_ml.cleaners.get_cleaning_func` using the argument `custom_cleaning_functions`; in this dictionary, for a datatype (specified in `type_infer.dtype`), you can assign your own function to override our defaults."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "325d8f1b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-25T10:07:45.248619Z",
     "iopub.status.busy": "2025-03-25T10:07:45.248394Z",
     "iopub.status.idle": "2025-03-25T10:07:45.253835Z",
     "shell.execute_reply": "2025-03-25T10:07:45.253259Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing MyCustomCleaner.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile MyCustomCleaner.py\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from type_infer.dtype import dtype\n",
    "\n",
    "from lightwood.helpers import text\n",
    "from lightwood.helpers.log import log\n",
    "from lightwood.api.types import TimeseriesSettings\n",
    "\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "stop_words = set(stopwords.words(\"english\"))\n",
    "\n",
    "from typing import Dict\n",
    "\n",
    "# Borrow cleaner functions\n",
    "from dataprep_ml.cleaners import (\n",
    "    _remove_columns,\n",
    "    _get_columns_to_clean,\n",
    "    get_cleaning_func,\n",
    ")\n",
    "\n",
    "# Use for standardizing NaNs\n",
    "VALUES_FOR_NAN_AND_NONE_IN_PANDAS = [np.nan, \"nan\", \"NaN\", \"Nan\", \"None\"]\n",
    "\n",
    "\n",
    "def cleaner(\n",
    "    data: pd.DataFrame,\n",
    "    dtype_dict: Dict[str, str],\n",
    "    identifiers: Dict[str, str],\n",
    "    target: str,\n",
    "    mode: str,\n",
    "    timeseries_settings: TimeseriesSettings,\n",
    "    anomaly_detection: bool,\n",
    "    custom_cleaning_functions: Dict[str, str] = {},\n",
    ") -> pd.DataFrame:\n",
    "    \"\"\"\n",
    "    The cleaner is a function which takes in the raw data, plus additional information about it's types and about the problem. Based on this it generates a \"clean\" representation of the data, where each column has an ideal standardized type and all malformed or otherwise missing or invalid elements are turned into ``None``\n",
    "\n",
    "    :param data: The raw data\n",
    "    :param dtype_dict: Type information for each column\n",
    "    :param identifiers: A dict containing all identifier typed columns\n",
    "    :param target: The target columns\n",
    "    :param mode: Can be \"predict\" or \"train\"\n",
    "    :param timeseries_settings: Timeseries related settings, only relevant for timeseries predictors, otherwise can be the default object\n",
    "    :param anomaly_detection: Are we detecting anomalies with this predictor?\n",
    "\n",
    "    :returns: The cleaned data\n",
    "    \"\"\"  # noqa\n",
    "\n",
    "    data = _remove_columns(\n",
    "        data,\n",
    "        identifiers,\n",
    "        target,\n",
    "        mode,\n",
    "        timeseries_settings,\n",
    "        anomaly_detection,\n",
    "        dtype_dict,\n",
    "    )\n",
    "\n",
    "    for col in _get_columns_to_clean(data, dtype_dict, mode, target):\n",
    "\n",
    "        log.info(\"Cleaning column =\" + str(col))\n",
    "        # Get and apply a cleaning function for each data type\n",
    "        # If you want to customize the cleaner, it's likely you can to modify ``get_cleaning_func``\n",
    "        fn, vec = get_cleaning_func(dtype_dict[col], custom_cleaning_functions)\n",
    "        if not vec:\n",
    "            data[col] = data[col].apply(fn)\n",
    "        if vec:\n",
    "            data[col] = fn(data[col])\n",
    "\n",
    "        # ------------------------ #\n",
    "        # INTRODUCE YOUR CUSTOM BLOCK\n",
    "\n",
    "        # If column data type is a text type, remove stop-words\n",
    "        if dtype_dict[col] in (dtype.rich_text, dtype.short_text):\n",
    "            data[col] = data[col].apply(\n",
    "                lambda x: \" \".join(\n",
    "                    [word for word in x.split() if word not in stop_words]\n",
    "                )\n",
    "            )\n",
    "\n",
    "        # Enforce numerical columns as non-negative\n",
    "        if dtype_dict[col] in (dtype.integer, dtype.float):\n",
    "            log.info(\"Converted \" + str(col) + \" into strictly non-negative\")\n",
    "            data[col] = data[col].apply(lambda x: x if x > 0 else 0.0)\n",
    "\n",
    "        # ------------------------ #\n",
    "        data[col] = data[col].replace(\n",
    "            to_replace=VALUES_FOR_NAN_AND_NONE_IN_PANDAS, value=None\n",
    "        )\n",
    "\n",
    "    return data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "radical-armenia",
   "metadata": {},
   "source": [
    "#### Place your custom module in `~/lightwood_modules` or `/etc/lightwood_modules`\n",
    "\n",
    "We automatically search for custom scripts in your `~/lightwood_modules` and `/etc/lightwood_modules` path. Place your file there. Later, you'll see when we autogenerate code, that you can change your import location if you choose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f030f8ca",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-25T10:07:45.255730Z",
     "iopub.status.busy": "2025-03-25T10:07:45.255513Z",
     "iopub.status.idle": "2025-03-25T10:07:45.258852Z",
     "shell.execute_reply": "2025-03-25T10:07:45.258292Z"
    }
   },
   "outputs": [],
   "source": [
    "from lightwood import load_custom_module\n",
    "\n",
    "# Lightwood automatically does this for us if we want\n",
    "load_custom_module('MyCustomCleaner.py')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "characteristic-promotion",
   "metadata": {},
   "source": [
    "### 4) Introduce your custom cleaner in JSON-AI\n",
    "\n",
    "Now let's introduce our custom cleaner. JSON-AI keeps a lightweight syntax but fills in many default modules (like splitting, cleaning). As you can see, it is also agnostic to the origin of the module, as long as it behaves as expected of the other modules that could be used in any given key.\n",
    "\n",
    "For the custom cleaner, we'll work by editing the \"cleaner\" key. We will change properties within it as follows:\n",
    "(1) \"module\" - place the name of the function. In our case it will be \"MyCustomCleaner.cleaner\"\n",
    "(2) \"args\" - any keyword argument specific to your cleaner's internals. \n",
    "\n",
    "This will look as follows:\n",
    "```\n",
    "    \"cleaner\": {\n",
    "        \"module\": \"MyCustomCleaner.cleaner\",\n",
    "        \"args\": {\n",
    "            \"identifiers\": \"$identifiers\",\n",
    "            \"data\": \"data\",\n",
    "            \"dtype_dict\": \"$dtype_dict\",\n",
    "            \"target\": \"$target\",\n",
    "            \"mode\": \"$mode\",\n",
    "            \"timeseries_settings\": \"$problem_definition.timeseries_settings\",\n",
    "            \"anomaly_detection\": \"$problem_definition.anomaly_detection\"\n",
    "        }\n",
    "```\n",
    "\n",
    "You may be wondering what the \"$\" variables reference. In certain cases, we'd like JSON-AI to auto-fill internal variables when automatically generating code, for example, we've already specified the \"target\" - it would be easier to simply refer in a modular sense what that term is. That is what these variables represent.\n",
    "\n",
    "As we borrowed most of the default `Cleaner`; we keep these arguments. In theory, if we were writing much of these details from scratch, we can customize these values as necessary."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "respiratory-radiation",
   "metadata": {},
   "source": [
    "### 5) Generate Python code representing your ML pipeline\n",
    "\n",
    "Now we're ready to load up our custom JSON-AI and generate the predictor code!\n",
    "\n",
    "We can do this by first reading in our custom json-syntax, and then calling the function `code_from_json_ai`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "floating-patent",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-25T10:07:45.260931Z",
     "iopub.status.busy": "2025-03-25T10:07:45.260734Z",
     "iopub.status.idle": "2025-03-25T10:07:45.649972Z",
     "shell.execute_reply": "2025-03-25T10:07:45.649284Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "import lightwood\n",
      "from lightwood import __version__ as lightwood_version\n",
      "from lightwood.analysis import *\n",
      "from lightwood.api import *\n",
      "from lightwood.data import *\n",
      "from lightwood.encoder import *\n",
      "from lightwood.ensemble import *\n",
      "from lightwood.helpers.device import *\n",
      "from lightwood.helpers.general import *\n",
      "from lightwood.helpers.ts import *\n",
      "from lightwood.helpers.log import *\n",
      "from lightwood.helpers.numeric import *\n",
      "from lightwood.helpers.parallelism import *\n",
      "from lightwood.helpers.seed import *\n",
      "from lightwood.helpers.text import *\n",
      "from lightwood.helpers.torch import *\n",
      "from lightwood.mixer import *\n",
      "\n",
      "from dataprep_ml.insights import statistical_analysis\n",
      "from dataprep_ml.cleaners import cleaner\n",
      "from dataprep_ml.splitters import splitter\n",
      "from dataprep_ml.imputers import *\n",
      "\n",
      "from mindsdb_evaluator import evaluate_accuracies\n",
      "from mindsdb_evaluator.accuracy import __all__ as mdb_eval_accuracy_metrics\n",
      "\n",
      "import pandas as pd\n",
      "from typing import Dict, List, Union, Optional\n",
      "import os\n",
      "from types import ModuleType\n",
      "import importlib.machinery\n",
      "import sys\n",
      "import time\n",
      "\n",
      "\n",
      "for import_dir in [\n",
      "    os.path.join(\n",
      "        os.path.expanduser(\"~/lightwood_modules\"), lightwood_version.replace(\".\", \"_\")\n",
      "    ),\n",
      "    os.path.join(\"/etc/lightwood_modules\", lightwood_version.replace(\".\", \"_\")),\n",
      "]:\n",
      "    if os.path.exists(import_dir) and os.access(import_dir, os.R_OK):\n",
      "        for file_name in list(os.walk(import_dir))[0][2]:\n",
      "            if file_name[-3:] != \".py\":\n",
      "                continue\n",
      "            mod_name = file_name[:-3]\n",
      "            loader = importlib.machinery.SourceFileLoader(\n",
      "                mod_name, os.path.join(import_dir, file_name)\n",
      "            )\n",
      "            module = ModuleType(loader.name)\n",
      "            loader.exec_module(module)\n",
      "            sys.modules[mod_name] = module\n",
      "            exec(f\"import {mod_name}\")\n",
      "\n",
      "\n",
      "class Predictor(PredictorInterface):\n",
      "    target: str\n",
      "    mixers: List[BaseMixer]\n",
      "    encoders: Dict[str, BaseEncoder]\n",
      "    ensemble: BaseEnsemble\n",
      "    mode: str\n",
      "\n",
      "    def __init__(self):\n",
      "        seed(1)\n",
      "        self.target = \"target\"\n",
      "        self.mode = \"inactive\"\n",
      "        self.problem_definition = ProblemDefinition.from_dict(\n",
      "            {\n",
      "                \"target\": \"target\",\n",
      "                \"pct_invalid\": 2,\n",
      "                \"unbias_target\": True,\n",
      "                \"seconds_per_mixer\": 21384.0,\n",
      "                \"seconds_per_encoder\": 85536.0,\n",
      "                \"expected_additional_time\": 15.390582799911499,\n",
      "                \"time_aim\": 259200,\n",
      "                \"target_weights\": None,\n",
      "                \"positive_domain\": False,\n",
      "                \"timeseries_settings\": {\n",
      "                    \"is_timeseries\": False,\n",
      "                    \"order_by\": None,\n",
      "                    \"window\": None,\n",
      "                    \"group_by\": None,\n",
      "                    \"use_previous_target\": True,\n",
      "                    \"horizon\": None,\n",
      "                    \"historical_columns\": None,\n",
      "                    \"target_type\": \"\",\n",
      "                    \"allow_incomplete_history\": True,\n",
      "                    \"eval_incomplete\": False,\n",
      "                    \"interval_periods\": [],\n",
      "                },\n",
      "                \"anomaly_detection\": False,\n",
      "                \"use_default_analysis\": True,\n",
      "                \"embedding_only\": False,\n",
      "                \"dtype_dict\": {},\n",
      "                \"ignore_features\": [\"url_legal\", \"license\", \"standard_error\"],\n",
      "                \"fit_on_all\": True,\n",
      "                \"strict_mode\": True,\n",
      "                \"seed_nr\": 1,\n",
      "            }\n",
      "        )\n",
      "        self.accuracy_functions = [\"r2_score\"]\n",
      "        self.identifiers = {\"id\": \"Hash-like identifier\"}\n",
      "        self.dtype_dict = {\"excerpt\": \"rich_text\", \"target\": \"float\"}\n",
      "        self.lightwood_version = \"25.3.3.3\"\n",
      "        self.pred_args = PredictionArguments()\n",
      "\n",
      "        # Any feature-column dependencies\n",
      "        self.dependencies = {\"target\": [], \"excerpt\": []}\n",
      "\n",
      "        self.input_cols = [\"excerpt\"]\n",
      "\n",
      "        # Initial stats analysis\n",
      "        self.statistical_analysis = None\n",
      "        self.ts_analysis = None\n",
      "        self.runtime_log = dict()\n",
      "        self.global_insights = dict()\n",
      "\n",
      "        # Feature cache\n",
      "        self.feature_cache = dict()\n",
      "\n",
      "    @timed_predictor\n",
      "    def analyze_data(self, data: pd.DataFrame) -> None:\n",
      "        # Perform a statistical analysis on the unprocessed data\n",
      "\n",
      "        self.statistical_analysis = statistical_analysis(\n",
      "            data,\n",
      "            self.dtype_dict,\n",
      "            self.problem_definition.to_dict(),\n",
      "            {\"id\": \"Hash-like identifier\"},\n",
      "        )\n",
      "\n",
      "        # Instantiate post-training evaluation\n",
      "        self.analysis_blocks = [\n",
      "            ICP(fixed_significance=None, confidence_normalizer=False, deps=[]),\n",
      "            ConfStats(deps=[\"ICP\"]),\n",
      "            AccStats(deps=[\"ICP\"]),\n",
      "            PermutationFeatureImportance(deps=[\"AccStats\"]),\n",
      "        ]\n",
      "\n",
      "    @timed_predictor\n",
      "    def preprocess(self, data: pd.DataFrame) -> pd.DataFrame:\n",
      "        # Preprocess and clean data\n",
      "\n",
      "        log.info(\"Cleaning the data\")\n",
      "        self.imputers = {}\n",
      "        data = MyCustomCleaner.cleaner(\n",
      "            data=data,\n",
      "            identifiers=self.identifiers,\n",
      "            dtype_dict=self.dtype_dict,\n",
      "            target=self.target,\n",
      "            mode=self.mode,\n",
      "            timeseries_settings=self.problem_definition.timeseries_settings.to_dict(),\n",
      "            anomaly_detection=self.problem_definition.anomaly_detection,\n",
      "        )\n",
      "\n",
      "        # Time-series blocks\n",
      "\n",
      "        return data\n",
      "\n",
      "    @timed_predictor\n",
      "    def split(self, data: pd.DataFrame) -> Dict[str, pd.DataFrame]:\n",
      "        # Split the data into training/testing splits\n",
      "\n",
      "        log.info(\"Splitting the data into train/test\")\n",
      "        train_test_data = splitter(\n",
      "            data=data,\n",
      "            pct_train=0.8,\n",
      "            pct_dev=0.1,\n",
      "            pct_test=0.1,\n",
      "            tss=self.problem_definition.timeseries_settings.to_dict(),\n",
      "            seed=self.problem_definition.seed_nr,\n",
      "            target=self.target,\n",
      "            dtype_dict=self.dtype_dict,\n",
      "        )\n",
      "\n",
      "        return train_test_data\n",
      "\n",
      "    @timed_predictor\n",
      "    def prepare(self, data: Dict[str, pd.DataFrame]) -> None:\n",
      "        # Prepare encoders to featurize data\n",
      "\n",
      "        self.mode = \"train\"\n",
      "\n",
      "        if self.statistical_analysis is None:\n",
      "            raise Exception(\"Please run analyze_data first\")\n",
      "\n",
      "        # Column to encoder mapping\n",
      "        self.encoders = {\n",
      "            \"target\": NumericEncoder(\n",
      "                is_target=True,\n",
      "                positive_domain=self.statistical_analysis.positive_domain,\n",
      "            ),\n",
      "            \"excerpt\": PretrainedLangEncoder(\n",
      "                output_type=self.dtype_dict[self.target],\n",
      "                stop_after=self.problem_definition.seconds_per_encoder,\n",
      "            ),\n",
      "        }\n",
      "\n",
      "        # Prepare the training + dev data\n",
      "        concatenated_train_dev = pd.concat([data[\"train\"], data[\"dev\"]])\n",
      "\n",
      "        prepped_encoders = {}\n",
      "\n",
      "        # Prepare input encoders\n",
      "        parallel_encoding = parallel_encoding_check(data[\"train\"], self.encoders)\n",
      "\n",
      "        if parallel_encoding:\n",
      "            log.debug(\"Preparing in parallel...\")\n",
      "            for col_name, encoder in self.encoders.items():\n",
      "                if col_name != self.target and not encoder.is_trainable_encoder:\n",
      "                    prepped_encoders[col_name] = (\n",
      "                        encoder,\n",
      "                        concatenated_train_dev[col_name],\n",
      "                        \"prepare\",\n",
      "                    )\n",
      "            prepped_encoders = mut_method_call(prepped_encoders)\n",
      "\n",
      "        else:\n",
      "            log.debug(\"Preparing sequentially...\")\n",
      "            for col_name, encoder in self.encoders.items():\n",
      "                if col_name != self.target and not encoder.is_trainable_encoder:\n",
      "                    log.debug(f\"Preparing encoder for {col_name}...\")\n",
      "                    encoder.prepare(concatenated_train_dev[col_name])\n",
      "                    prepped_encoders[col_name] = encoder\n",
      "\n",
      "        # Store encoders\n",
      "        for col_name, encoder in prepped_encoders.items():\n",
      "            self.encoders[col_name] = encoder\n",
      "\n",
      "        # Prepare the target\n",
      "        if self.target not in prepped_encoders:\n",
      "            if self.encoders[self.target].is_trainable_encoder:\n",
      "                self.encoders[self.target].prepare(\n",
      "                    data[\"train\"][self.target], data[\"dev\"][self.target]\n",
      "                )\n",
      "            else:\n",
      "                self.encoders[self.target].prepare(\n",
      "                    pd.concat([data[\"train\"], data[\"dev\"]])[self.target]\n",
      "                )\n",
      "\n",
      "        # Prepare any non-target encoders that are learned\n",
      "        for col_name, encoder in self.encoders.items():\n",
      "            if col_name != self.target and encoder.is_trainable_encoder:\n",
      "                priming_data = pd.concat([data[\"train\"], data[\"dev\"]])\n",
      "                kwargs = {}\n",
      "                if self.dependencies[col_name]:\n",
      "                    kwargs[\"dependency_data\"] = {}\n",
      "                    for col in self.dependencies[col_name]:\n",
      "                        kwargs[\"dependency_data\"][col] = {\n",
      "                            \"original_type\": self.dtype_dict[col],\n",
      "                            \"data\": priming_data[col],\n",
      "                        }\n",
      "\n",
      "                # If an encoder representation requires the target, provide priming data\n",
      "                if hasattr(encoder, \"uses_target\"):\n",
      "                    kwargs[\"encoded_target_values\"] = self.encoders[self.target].encode(\n",
      "                        priming_data[self.target]\n",
      "                    )\n",
      "\n",
      "                encoder.prepare(\n",
      "                    data[\"train\"][col_name], data[\"dev\"][col_name], **kwargs\n",
      "                )\n",
      "\n",
      "    @timed_predictor\n",
      "    def featurize(self, split_data: Dict[str, pd.DataFrame]):\n",
      "        # Featurize data into numerical representations for models\n",
      "\n",
      "        log.info(\"Featurizing the data\")\n",
      "\n",
      "        tss = self.problem_definition.timeseries_settings\n",
      "\n",
      "        feature_data = dict()\n",
      "        for key, data in split_data.items():\n",
      "            if key != \"stratified_on\":\n",
      "\n",
      "                # compute and store two splits - full and filtered (useful for time series post-train analysis)\n",
      "                if key not in self.feature_cache:\n",
      "                    featurized_split = EncodedDs(self.encoders, data, self.target)\n",
      "                    filtered_subset = EncodedDs(\n",
      "                        self.encoders, filter_ts(data, tss), self.target\n",
      "                    )\n",
      "\n",
      "                    for k, s in zip(\n",
      "                        (key, f\"{key}_filtered\"), (featurized_split, filtered_subset)\n",
      "                    ):\n",
      "                        self.feature_cache[k] = s\n",
      "\n",
      "                for k in (key, f\"{key}_filtered\"):\n",
      "                    feature_data[k] = self.feature_cache[k]\n",
      "\n",
      "        return feature_data\n",
      "\n",
      "    @timed_predictor\n",
      "    def fit(self, enc_data: Dict[str, pd.DataFrame]) -> None:\n",
      "        # Fit predictors to estimate target\n",
      "\n",
      "        self.mode = \"train\"\n",
      "\n",
      "        # --------------- #\n",
      "        # Extract data\n",
      "        # --------------- #\n",
      "        # Extract the featurized data into train/dev/test\n",
      "        encoded_train_data = enc_data[\"train\"]\n",
      "        encoded_dev_data = enc_data[\"dev\"]\n",
      "        encoded_test_data = enc_data[\"test_filtered\"]\n",
      "\n",
      "        log.info(\"Training the mixers\")\n",
      "\n",
      "        # --------------- #\n",
      "        # Fit Models\n",
      "        # --------------- #\n",
      "        # Assign list of mixers\n",
      "        self.mixers = [\n",
      "            Neural(\n",
      "                fit_on_dev=True,\n",
      "                search_hyperparameters=True,\n",
      "                net=\"DefaultNet\",\n",
      "                stop_after=self.problem_definition.seconds_per_mixer,\n",
      "                target=self.target,\n",
      "                dtype_dict=self.dtype_dict,\n",
      "                target_encoder=self.encoders[self.target],\n",
      "            ),\n",
      "            XGBoostMixer(\n",
      "                fit_on_dev=True,\n",
      "                use_optuna=True,\n",
      "                stop_after=self.problem_definition.seconds_per_mixer,\n",
      "                target=self.target,\n",
      "                dtype_dict=self.dtype_dict,\n",
      "                input_cols=self.input_cols,\n",
      "                target_encoder=self.encoders[self.target],\n",
      "            ),\n",
      "            Regression(\n",
      "                stop_after=self.problem_definition.seconds_per_mixer,\n",
      "                target=self.target,\n",
      "                dtype_dict=self.dtype_dict,\n",
      "                target_encoder=self.encoders[self.target],\n",
      "            ),\n",
      "            RandomForest(\n",
      "                fit_on_dev=True,\n",
      "                stop_after=self.problem_definition.seconds_per_mixer,\n",
      "                target=self.target,\n",
      "                dtype_dict=self.dtype_dict,\n",
      "                target_encoder=self.encoders[self.target],\n",
      "            ),\n",
      "        ]\n",
      "\n",
      "        # Train mixers\n",
      "        trained_mixers = []\n",
      "        for mixer in self.mixers:\n",
      "            try:\n",
      "                if mixer.trains_once:\n",
      "                    self.fit_mixer(\n",
      "                        mixer,\n",
      "                        ConcatedEncodedDs([encoded_train_data, encoded_dev_data]),\n",
      "                        encoded_test_data,\n",
      "                    )\n",
      "                else:\n",
      "                    self.fit_mixer(mixer, encoded_train_data, encoded_dev_data)\n",
      "                trained_mixers.append(mixer)\n",
      "            except Exception as e:\n",
      "                log.warning(f\"Exception: {e} when training mixer: {mixer}\")\n",
      "                if True and mixer.stable:\n",
      "                    raise e\n",
      "\n",
      "        # Update mixers to trained versions\n",
      "        if not trained_mixers:\n",
      "            raise Exception(\n",
      "                \"No mixers could be trained! Please verify your problem definition or JsonAI model representation.\"\n",
      "            )\n",
      "        self.mixers = trained_mixers\n",
      "\n",
      "        # --------------- #\n",
      "        # Create Ensembles\n",
      "        # --------------- #\n",
      "        log.info(\"Ensembling the mixer\")\n",
      "        # Create an ensemble of mixers to identify best performing model\n",
      "        # Dirty hack\n",
      "        self.ensemble = BestOf(\n",
      "            data=encoded_test_data,\n",
      "            fit=True,\n",
      "            ts_analysis=None,\n",
      "            target=self.target,\n",
      "            mixers=self.mixers,\n",
      "            args=self.pred_args,\n",
      "            accuracy_functions=self.accuracy_functions,\n",
      "        )\n",
      "        self.supports_proba = self.ensemble.supports_proba\n",
      "\n",
      "    @timed_predictor\n",
      "    def fit_mixer(self, mixer, encoded_train_data, encoded_dev_data) -> None:\n",
      "        mixer.fit(encoded_train_data, encoded_dev_data)\n",
      "\n",
      "    @timed_predictor\n",
      "    def analyze_ensemble(self, enc_data: Dict[str, pd.DataFrame]) -> None:\n",
      "        # Evaluate quality of fit for the ensemble of mixers\n",
      "\n",
      "        # --------------- #\n",
      "        # Extract data\n",
      "        # --------------- #\n",
      "        # Extract the featurized data into train/dev/test\n",
      "        encoded_train_data = enc_data[\"train\"]\n",
      "        encoded_dev_data = enc_data[\"dev\"]\n",
      "        encoded_test_data = enc_data[\"test\"]\n",
      "\n",
      "        # --------------- #\n",
      "        # Analyze Ensembles\n",
      "        # --------------- #\n",
      "        log.info(\"Analyzing the ensemble of mixers\")\n",
      "        self.model_analysis, self.runtime_analyzer = model_analyzer(\n",
      "            data=encoded_test_data,\n",
      "            train_data=encoded_train_data,\n",
      "            ts_analysis=None,\n",
      "            stats_info=self.statistical_analysis,\n",
      "            pdef=self.problem_definition,\n",
      "            accuracy_functions=self.accuracy_functions,\n",
      "            predictor=self.ensemble,\n",
      "            target=self.target,\n",
      "            dtype_dict=self.dtype_dict,\n",
      "            analysis_blocks=self.analysis_blocks,\n",
      "        )\n",
      "\n",
      "    @timed_predictor\n",
      "    def learn(self, data: pd.DataFrame) -> None:\n",
      "        if self.problem_definition.ignore_features:\n",
      "            log.info(f\"Dropping features: {self.problem_definition.ignore_features}\")\n",
      "            data = data.drop(\n",
      "                columns=self.problem_definition.ignore_features, errors=\"ignore\"\n",
      "            )\n",
      "\n",
      "        self.mode = \"train\"\n",
      "        n_phases = 8 if self.problem_definition.fit_on_all else 7\n",
      "\n",
      "        # Perform stats analysis\n",
      "        log.info(f\"[Learn phase 1/{n_phases}] - Statistical analysis\")\n",
      "        self.analyze_data(data)\n",
      "\n",
      "        # Pre-process the data\n",
      "        log.info(f\"[Learn phase 2/{n_phases}] - Data preprocessing\")\n",
      "        data = self.preprocess(data)\n",
      "\n",
      "        # Create train/test (dev) split\n",
      "        log.info(f\"[Learn phase 3/{n_phases}] - Data splitting\")\n",
      "        train_dev_test = self.split(data)\n",
      "\n",
      "        # Prepare encoders\n",
      "        log.info(f\"[Learn phase 4/{n_phases}] - Preparing encoders\")\n",
      "        self.prepare(train_dev_test)\n",
      "\n",
      "        # Create feature vectors from data\n",
      "        log.info(f\"[Learn phase 5/{n_phases}] - Feature generation\")\n",
      "        enc_train_test = self.featurize(train_dev_test)\n",
      "\n",
      "        # Prepare mixers\n",
      "        log.info(f\"[Learn phase 6/{n_phases}] - Mixer training\")\n",
      "        if not self.problem_definition.embedding_only:\n",
      "            self.fit(enc_train_test)\n",
      "        else:\n",
      "            self.mixers = []\n",
      "            self.ensemble = Embedder(\n",
      "                self.target, mixers=list(), data=enc_train_test[\"train\"]\n",
      "            )\n",
      "            self.supports_proba = self.ensemble.supports_proba\n",
      "\n",
      "        # Analyze the ensemble\n",
      "        log.info(f\"[Learn phase 7/{n_phases}] - Ensemble analysis\")\n",
      "        self.analyze_ensemble(enc_train_test)\n",
      "\n",
      "        # ------------------------ #\n",
      "        # Enable model partial fit AFTER it is trained and evaluated for performance with the appropriate train/dev/test splits.\n",
      "        # This assumes the predictor could continuously evolve, hence including reserved testing data may improve predictions.\n",
      "        # SET `json_ai.problem_definition.fit_on_all=False` TO TURN THIS BLOCK OFF.\n",
      "\n",
      "        # Update the mixers with partial fit\n",
      "        if self.problem_definition.fit_on_all and all(\n",
      "            [not m.trains_once for m in self.mixers]\n",
      "        ):\n",
      "            log.info(f\"[Learn phase 8/{n_phases}] - Adjustment on validation requested\")\n",
      "            self.adjust(\n",
      "                enc_train_test[\"test\"].data_frame,\n",
      "                ConcatedEncodedDs(\n",
      "                    [enc_train_test[\"train\"], enc_train_test[\"dev\"]]\n",
      "                ).data_frame,\n",
      "                adjust_args={\"learn_call\": True},\n",
      "            )\n",
      "\n",
      "        self.feature_cache = (\n",
      "            dict()\n",
      "        )  # empty feature cache to avoid large predictor objects\n",
      "\n",
      "    @timed_predictor\n",
      "    def adjust(\n",
      "        self,\n",
      "        train_data: Union[EncodedDs, ConcatedEncodedDs, pd.DataFrame],\n",
      "        dev_data: Optional[Union[EncodedDs, ConcatedEncodedDs, pd.DataFrame]] = None,\n",
      "        adjust_args: Optional[dict] = None,\n",
      "    ) -> None:\n",
      "        # Update mixers with new information\n",
      "\n",
      "        self.mode = \"train\"\n",
      "\n",
      "        # --------------- #\n",
      "        # Prepare data\n",
      "        # --------------- #\n",
      "        if dev_data is None:\n",
      "            data = train_data\n",
      "            split = splitter(\n",
      "                data=data,\n",
      "                pct_train=0.8,\n",
      "                pct_dev=0.2,\n",
      "                pct_test=0,\n",
      "                tss=self.problem_definition.timeseries_settings.to_dict(),\n",
      "                seed=self.problem_definition.seed_nr,\n",
      "                target=self.target,\n",
      "                dtype_dict=self.dtype_dict,\n",
      "            )\n",
      "            train_data = split[\"train\"]\n",
      "            dev_data = split[\"dev\"]\n",
      "\n",
      "        if adjust_args is None or not adjust_args.get(\"learn_call\"):\n",
      "            train_data = self.preprocess(train_data)\n",
      "            dev_data = self.preprocess(dev_data)\n",
      "\n",
      "        dev_data = EncodedDs(self.encoders, dev_data, self.target)\n",
      "        train_data = EncodedDs(self.encoders, train_data, self.target)\n",
      "\n",
      "        # --------------- #\n",
      "        # Update/Adjust Mixers\n",
      "        # --------------- #\n",
      "        log.info(\"Updating the mixers\")\n",
      "\n",
      "        for mixer in self.mixers:\n",
      "            mixer.partial_fit(train_data, dev_data, adjust_args)\n",
      "\n",
      "    @timed_predictor\n",
      "    def predict(self, data: pd.DataFrame, args: Dict = {}) -> pd.DataFrame:\n",
      "\n",
      "        self.mode = \"predict\"\n",
      "        n_phases = 3 if self.pred_args.all_mixers else 4\n",
      "\n",
      "        if len(data) == 0:\n",
      "            raise Exception(\n",
      "                \"Empty input, aborting prediction. Please try again with some input data.\"\n",
      "            )\n",
      "\n",
      "        self.pred_args = PredictionArguments.from_dict(args)\n",
      "\n",
      "        log.info(f\"[Predict phase 1/{n_phases}] - Data preprocessing\")\n",
      "        if self.problem_definition.ignore_features:\n",
      "            log.info(f\"Dropping features: {self.problem_definition.ignore_features}\")\n",
      "            data = data.drop(\n",
      "                columns=self.problem_definition.ignore_features, errors=\"ignore\"\n",
      "            )\n",
      "        for col in self.input_cols:\n",
      "            if col not in data.columns:\n",
      "                data[col] = [None] * len(data)\n",
      "\n",
      "        # Pre-process the data\n",
      "        data = self.preprocess(data)\n",
      "\n",
      "        # Featurize the data\n",
      "        log.info(f\"[Predict phase 2/{n_phases}] - Feature generation\")\n",
      "        encoded_ds = self.featurize({\"predict_data\": data})[\"predict_data\"]\n",
      "        encoded_data = encoded_ds.get_encoded_data(include_target=False)\n",
      "\n",
      "        log.info(f\"[Predict phase 3/{n_phases}] - Calling ensemble\")\n",
      "\n",
      "        @timed\n",
      "        def _timed_call(encoded_ds):\n",
      "            if self.pred_args.return_embedding:\n",
      "                embedder = Embedder(self.target, mixers=list(), data=encoded_ds)\n",
      "                df = embedder(encoded_ds, args=self.pred_args)\n",
      "            else:\n",
      "                df = self.ensemble(encoded_ds, args=self.pred_args)\n",
      "            return df\n",
      "\n",
      "        df = _timed_call(encoded_ds)\n",
      "\n",
      "        if not (\n",
      "            any(\n",
      "                [\n",
      "                    self.pred_args.all_mixers,\n",
      "                    self.pred_args.return_embedding,\n",
      "                    self.problem_definition.embedding_only,\n",
      "                ]\n",
      "            )\n",
      "        ):\n",
      "            log.info(f\"[Predict phase 4/{n_phases}] - Analyzing output\")\n",
      "            df, global_insights = explain(\n",
      "                data=data,\n",
      "                encoded_data=encoded_data,\n",
      "                predictions=df,\n",
      "                ts_analysis=None,\n",
      "                problem_definition=self.problem_definition,\n",
      "                stat_analysis=self.statistical_analysis,\n",
      "                runtime_analysis=self.runtime_analyzer,\n",
      "                target_name=self.target,\n",
      "                target_dtype=self.dtype_dict[self.target],\n",
      "                explainer_blocks=self.analysis_blocks,\n",
      "                pred_args=self.pred_args,\n",
      "            )\n",
      "            self.global_insights = {**self.global_insights, **global_insights}\n",
      "\n",
      "        self.feature_cache = (\n",
      "            dict()\n",
      "        )  # empty feature cache to avoid large predictor objects\n",
      "\n",
      "        return df\n",
      "\n",
      "    def test(\n",
      "        self,\n",
      "        data: pd.DataFrame,\n",
      "        metrics: list,\n",
      "        args: Dict[str, object] = {},\n",
      "        strict: bool = False,\n",
      "    ) -> pd.DataFrame:\n",
      "\n",
      "        preds = self.predict(data, args)\n",
      "        preds = preds.rename(columns={\"prediction\": self.target})\n",
      "        filtered = []\n",
      "\n",
      "        # filter metrics if not supported\n",
      "        for metric in metrics:\n",
      "            # metric should be one of: an actual function, registered in the model class, or supported by the evaluator\n",
      "            if not (\n",
      "                callable(metric)\n",
      "                or metric in self.accuracy_functions\n",
      "                or metric in mdb_eval_accuracy_metrics\n",
      "            ):\n",
      "                if strict:\n",
      "                    raise Exception(f\"Invalid metric: {metric}\")\n",
      "                else:\n",
      "                    log.warning(f\"Invalid metric: {metric}. Skipping...\")\n",
      "            else:\n",
      "                filtered.append(metric)\n",
      "\n",
      "        metrics = filtered\n",
      "        try:\n",
      "            labels = self.model_analysis.histograms[self.target][\"x\"]\n",
      "        except:\n",
      "            if strict:\n",
      "                raise Exception(\"Label histogram not found\")\n",
      "            else:\n",
      "                label_map = (\n",
      "                    None  # some accuracy functions will crash without this, be mindful\n",
      "                )\n",
      "        scores = evaluate_accuracies(\n",
      "            data,\n",
      "            preds[self.target],\n",
      "            self.target,\n",
      "            metrics,\n",
      "            ts_analysis=self.ts_analysis,\n",
      "            labels=labels,\n",
      "        )\n",
      "\n",
      "        # TODO: remove once mdb_eval returns an actual list\n",
      "        scores = {k: [v] for k, v in scores.items() if not isinstance(v, list)}\n",
      "\n",
      "        return pd.DataFrame.from_records(\n",
      "            scores\n",
      "        )  # TODO: add logic to disaggregate per-mixer\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Make changes to your JSON-AI\n",
    "json_ai.cleaner = {\n",
    "        \"module\": \"MyCustomCleaner.cleaner\",\n",
    "        \"args\": {\n",
    "            \"identifiers\": \"$identifiers\",\n",
    "            \"data\": \"data\",\n",
    "            \"dtype_dict\": \"$dtype_dict\",\n",
    "            \"target\": \"$target\",\n",
    "            \"mode\": \"$mode\",\n",
    "            \"timeseries_settings\": \"$problem_definition.timeseries_settings.to_dict()\",\n",
    "            \"anomaly_detection\": \"$problem_definition.anomaly_detection\"\n",
    "        }\n",
    "}\n",
    "\n",
    "#Generate python code that fills in your pipeline\n",
    "code = code_from_json_ai(json_ai)\n",
    "\n",
    "print(code)\n",
    "\n",
    "# Save code to a file (Optional)\n",
    "with open('custom_cleaner_pipeline.py', 'w') as fp:\n",
    "    fp.write(code)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "handled-oasis",
   "metadata": {},
   "source": [
    "As you can see, an end-to-end pipeline of our entire ML procedure has been generating. There are several abstracted functions to enable transparency as to what processes your data goes through in order to build these models.\n",
    "\n",
    "The key steps of the pipeline are as follows:\n",
    "\n",
    "(1) Run a **statistical analysis** with `analyze_data` <br>\n",
    "(2) Clean your data with `preprocess` <br>\n",
    "(3) Make a training/dev/testing split with `split` <br>\n",
    "(4) Prepare your feature-engineering pipelines with `prepare` <br>\n",
    "(5) Create your features with `featurize` <br>\n",
    "(6) Fit your predictor models with `fit` <br>\n",
    "\n",
    "You can customize this further if necessary, but you have all the steps necessary to train a model!\n",
    "\n",
    "We recommend familiarizing with these steps by calling the above commands, ideally in order. Some commands (namely `prepare`, `featurize`, and `fit`) do depend on other steps.\n",
    "\n",
    "If you want to omit the individual steps, we recommend your simply call the `learn` method, which compiles all the necessary steps implemented to give your fully trained predictive models starting with unprocessed data! "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "meaning-saskatchewan",
   "metadata": {},
   "source": [
    "### 6) Call python to run your code and see your preprocessed outputs\n",
    "\n",
    "Once we have code, we can turn this into a python object by calling `predictor_from_code`. This instantiates the `PredictorInterface` object. \n",
    "\n",
    "This predictor object can be then used to run your pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "violent-guard",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-25T10:07:45.652137Z",
     "iopub.status.busy": "2025-03-25T10:07:45.651927Z",
     "iopub.status.idle": "2025-03-25T10:07:45.659363Z",
     "shell.execute_reply": "2025-03-25T10:07:45.658873Z"
    }
   },
   "outputs": [],
   "source": [
    "# Turn the code above into a predictor object\n",
    "predictor = predictor_from_code(code)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "closing-episode",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-25T10:07:45.661193Z",
     "iopub.status.busy": "2025-03-25T10:07:45.660999Z",
     "iopub.status.idle": "2025-03-25T10:07:45.794557Z",
     "shell.execute_reply": "2025-03-25T10:07:45.793986Z"
    },
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-3299:Starting statistical analysis\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-3299:Dropping features: ['id']\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-3299:Finished statistical analysis\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-3299: `analyze_data` runtime: 0.05 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-3299:Cleaning the data\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-3299:Dropping features: ['id']\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-3299:Cleaning column =excerpt\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-3299:Cleaning column =target\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-3299:Converted target into strictly non-negative\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-3299: `preprocess` runtime: 0.07 seconds\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>excerpt</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>When young people returned ballroom, presented...</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>All dinner time, Mrs. Fayre somewhat silent, e...</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>As Roger predicted, snow departed quickly came...</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>And outside palace great garden walled round, ...</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Once upon time Three Bears lived together hous...</td>\n",
       "      <td>0.247197</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             excerpt    target\n",
       "0  When young people returned ballroom, presented...  0.000000\n",
       "1  All dinner time, Mrs. Fayre somewhat silent, e...  0.000000\n",
       "2  As Roger predicted, snow departed quickly came...  0.000000\n",
       "3  And outside palace great garden walled round, ...  0.000000\n",
       "4  Once upon time Three Bears lived together hous...  0.247197"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictor.mode = \"train\"\n",
    "\n",
    "# Perform stats analysis\n",
    "predictor.analyze_data(data)\n",
    "\n",
    "# Pre-process the data\n",
    "cleaned_data = predictor.preprocess(data)\n",
    "\n",
    "cleaned_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "major-stake",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-03-25T10:07:45.796741Z",
     "iopub.status.busy": "2025-03-25T10:07:45.796338Z",
     "iopub.status.idle": "2025-03-25T10:07:45.801189Z",
     "shell.execute_reply": "2025-03-25T10:07:45.800559Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1mOriginal Data\n",
      "\u001b[0m\n",
      "Excerpt:\n",
      " When the young people returned to the ballroom, it presented a decidedly changed appearance. Instead of an interior scene, it was a winter landscape.\n",
      "The floor was covered with snow-white canvas, not laid on smoothly, but rumpled over bumps and hillocks, like a real snow field. The numerous palms and evergreens that had decorated the room, were powdered with flour and strewn with tufts of cotton, like snow. Also diamond dust had been lightly sprinkled on them, and glittering crystal icicles hung from the branches.\n",
      "At each end of the room, on the wall, hung a beautiful bear-skin rug.\n",
      "These rugs were for prizes, one for the girls and one for the boys. And this was the game.\n",
      "The girls were gathered at one end of the room and the boys at the other, and one end was called the North Pole, and the other the South Pole. Each player was given a small flag which they were to plant on reaching the Pole.\n",
      "This would have been an easy matter, but each traveller was obliged to wear snowshoes.\n",
      "\n",
      "Target:\n",
      " -0.340259125\n",
      "\u001b[1m\n",
      "\n",
      "Cleaned Data\n",
      "\u001b[0m\n",
      "Excerpt:\n",
      " When young people returned ballroom, presented decidedly changed appearance. Instead interior scene, winter landscape. The floor covered snow-white canvas, laid smoothly, rumpled bumps hillocks, like real snow field. The numerous palms evergreens decorated room, powdered flour strewn tufts cotton, like snow. Also diamond dust lightly sprinkled them, glittering crystal icicles hung branches. At end room, wall, hung beautiful bear-skin rug. These rugs prizes, one girls one boys. And game. The girls gathered one end room boys other, one end called North Pole, South Pole. Each player given small flag plant reaching Pole. This would easy matter, traveller obliged wear snowshoes.\n",
      "\n",
      "Target:\n",
      " 0.0\n"
     ]
    }
   ],
   "source": [
    "print(\"\\033[1m\"  + \"Original Data\\n\" + \"\\033[0m\")\n",
    "print(\"Excerpt:\\n\", data.iloc[0][\"excerpt\"])\n",
    "print(\"\\nTarget:\\n\", data.iloc[0][\"target\"])\n",
    "\n",
    "print(\"\\033[1m\"  + \"\\n\\nCleaned Data\\n\" + \"\\033[0m\")\n",
    "print(\"Excerpt:\\n\", cleaned_data.iloc[0][\"excerpt\"])\n",
    "print(\"\\nTarget:\\n\", cleaned_data.iloc[0][\"target\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "celtic-scientist",
   "metadata": {},
   "source": [
    "As you can see, the cleaning-process we introduced cut out the stop-words from the Excerpt, and enforced the target data to stay positive.\n",
    "\n",
    "We hope this tutorial was informative on how to introduce a **custom preprocessing method** to your datasets! For more customization tutorials, please check our [documentation](https://lightwood.io/tutorials.html).\n",
    "\n",
    "If you want to download the Jupyter-notebook version of this tutorial, check out the source github location found here: `lightwood/docssrc/source/tutorials/custom_cleaner`. "
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.21"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}