{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial - Implementing a custom mixer in Lightwood\n",
    "\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Mixers are the center piece of lightwood, tasked with learning the mapping between the encoded feature and target representation\n",
    "\n",
    "\n",
    "## Objective\n",
    "\n",
    "In this tutorial we'll be trying to implement a sklearn random forest as a mixer that handles categorical and binary targets. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: The Mixer Interface\n",
    "\n",
    "The Mixer interface is defined by the `BaseMixer` class, a mixer needs methods for 4 tasks:\n",
    "* fitting (`fit`)\n",
    "* predicting (`__call__`)\n",
    "* construction (`__init__`)\n",
    "* partial fitting (`partial_fit`), though this one is optional"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Writing our mixer\n",
    "\n",
    "I'm going to create a file called `random_forest_mixer.py` inside `/etc/lightwood_modules`, this is where lightwood sources custom modules from.\n",
    "\n",
    "Inside of it I'm going to write the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-07T17:14:05.251845Z",
     "iopub.status.busy": "2024-05-07T17:14:05.251655Z",
     "iopub.status.idle": "2024-05-07T17:14:05.260290Z",
     "shell.execute_reply": "2024-05-07T17:14:05.259701Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing random_forest_mixer.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile random_forest_mixer.py\n",
    "\n",
    "from lightwood.mixer import BaseMixer\n",
    "from lightwood.api.types import PredictionArguments\n",
    "from lightwood.data.encoded_ds import EncodedDs, ConcatedEncodedDs\n",
    "from type_infer.dtype import dtype\n",
    "from lightwood.encoder import BaseEncoder\n",
    "\n",
    "import torch\n",
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "\n",
    "class RandomForestMixer(BaseMixer):\n",
    "    clf: RandomForestClassifier\n",
    "\n",
    "    def __init__(self, stop_after: int, dtype_dict: dict, target: str, target_encoder: BaseEncoder):\n",
    "        super().__init__(stop_after)\n",
    "        self.target_encoder = target_encoder\n",
    "        # Throw in case someone tries to use this for a problem that's not classification, I'd fail anyway, but this way the error message is more intuitive\n",
    "        if dtype_dict[target] not in (dtype.categorical, dtype.binary):\n",
    "            raise Exception(f'This mixer can only be used for classification problems! Got target dtype {dtype_dict[target]} instead!')\n",
    "\n",
    "        # We could also initialize this in `fit` if some of the parameters depend on the input data, since `fit` is called exactly once\n",
    "        self.clf = RandomForestClassifier(max_depth=30)\n",
    "\n",
    "    def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:\n",
    "        X, Y = [], []\n",
    "        # By default mixers get some train data and a bit of dev data on which to do early stopping or hyper parameter optimization. For this mixer, we don't need dev data, so we're going to concat the two in order to get more training data. Then, we're going to turn them into an sklearn friendly foramat.\n",
    "        for x, y in ConcatedEncodedDs([train_data, dev_data]):\n",
    "            X.append(x.tolist())\n",
    "            Y.append(y.tolist())\n",
    "        self.clf.fit(X, Y)\n",
    "\n",
    "    def __call__(self, ds: EncodedDs,\n",
    "                 args: PredictionArguments = PredictionArguments()) -> pd.DataFrame:\n",
    "        # Turn the data into an sklearn friendly format\n",
    "        X = []\n",
    "        for x, _ in ds:\n",
    "            X.append(x.tolist())\n",
    "\n",
    "        Yh = self.clf.predict(X)\n",
    "\n",
    "        # Lightwood encoders are meant to decode torch tensors, so we have to cast the predictions first\n",
    "        decoded_predictions = self.target_encoder.decode(torch.Tensor(Yh))\n",
    "\n",
    "        # Finally, turn the decoded predictions into a dataframe with a single column called `prediction`. This is the standard behaviour all lightwood mixers use\n",
    "        ydf = pd.DataFrame({'prediction': decoded_predictions})\n",
    "\n",
    "        return ydf\n",
    "\n",
    "    \n",
    "    # We'll skip implementing `partial_fit`, thus making this mixer unsuitable for online training tasks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Using our mixer\n",
    "\n",
    "We're going to use our mixer for diagnosing heart disease using this dataset: [https://github.com/mindsdb/benchmarks/blob/main/benchmarks/datasets/heart_disease/data.csv](https://github.com/mindsdb/benchmarks/blob/main/benchmarks/datasets/heart_disease/data.csv)\n",
    "\n",
    "First, since we don't want to bother writing a Json AI for this dataset from scratch, we're going to let lightwood auto generate one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-07T17:14:05.295024Z",
     "iopub.status.busy": "2024-05-07T17:14:05.294626Z",
     "iopub.status.idle": "2024-05-07T17:14:08.346891Z",
     "shell.execute_reply": "2024-05-07T17:14:08.346231Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:No torchvision detected, image helpers not supported.\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:No torchvision/pillow detected, image encoder not supported\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Analyzing a sample of 298\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:from a total population of 303, this is equivalent to 98.3% of your data.\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: age\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column age has data type integer\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: sex\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column sex has data type binary\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: cp\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column cp has data type categorical\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: trestbps\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column trestbps has data type integer\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: chol\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column chol has data type integer\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: fbs\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column fbs has data type binary\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: restecg\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column restecg has data type categorical\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: thalach\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column thalach has data type integer\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: exang\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column exang has data type binary\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: oldpeak\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column oldpeak has data type float\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: slope\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column slope has data type categorical\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: ca\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column ca has data type categorical\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: thal\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column thal has data type categorical\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Infering type for: target\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:type_infer-2524:Column target has data type binary\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Starting statistical analysis\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Finished statistical analysis\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"encoders\": {\n",
      "        \"target\": {\n",
      "            \"module\": \"BinaryEncoder\",\n",
      "            \"args\": {\n",
      "                \"is_target\": \"True\",\n",
      "                \"target_weights\": \"$statistical_analysis.target_weights\"\n",
      "            }\n",
      "        },\n",
      "        \"age\": {\n",
      "            \"module\": \"NumericEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"sex\": {\n",
      "            \"module\": \"BinaryEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"cp\": {\n",
      "            \"module\": \"OneHotEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"trestbps\": {\n",
      "            \"module\": \"NumericEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"chol\": {\n",
      "            \"module\": \"NumericEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"fbs\": {\n",
      "            \"module\": \"BinaryEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"restecg\": {\n",
      "            \"module\": \"OneHotEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"thalach\": {\n",
      "            \"module\": \"NumericEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"exang\": {\n",
      "            \"module\": \"BinaryEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"oldpeak\": {\n",
      "            \"module\": \"NumericEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"slope\": {\n",
      "            \"module\": \"OneHotEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"ca\": {\n",
      "            \"module\": \"OneHotEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"thal\": {\n",
      "            \"module\": \"OneHotEncoder\",\n",
      "            \"args\": {}\n",
      "        }\n",
      "    },\n",
      "    \"dtype_dict\": {\n",
      "        \"age\": \"integer\",\n",
      "        \"sex\": \"binary\",\n",
      "        \"cp\": \"categorical\",\n",
      "        \"trestbps\": \"integer\",\n",
      "        \"chol\": \"integer\",\n",
      "        \"fbs\": \"binary\",\n",
      "        \"restecg\": \"categorical\",\n",
      "        \"thalach\": \"integer\",\n",
      "        \"exang\": \"binary\",\n",
      "        \"oldpeak\": \"float\",\n",
      "        \"slope\": \"categorical\",\n",
      "        \"ca\": \"categorical\",\n",
      "        \"thal\": \"categorical\",\n",
      "        \"target\": \"binary\"\n",
      "    },\n",
      "    \"dependency_dict\": {},\n",
      "    \"model\": {\n",
      "        \"module\": \"BestOf\",\n",
      "        \"args\": {\n",
      "            \"submodels\": [\n",
      "                {\n",
      "                    \"module\": \"Neural\",\n",
      "                    \"args\": {\n",
      "                        \"fit_on_dev\": true,\n",
      "                        \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n",
      "                        \"search_hyperparameters\": true\n",
      "                    }\n",
      "                },\n",
      "                {\n",
      "                    \"module\": \"XGBoostMixer\",\n",
      "                    \"args\": {\n",
      "                        \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n",
      "                        \"fit_on_dev\": true\n",
      "                    }\n",
      "                },\n",
      "                {\n",
      "                    \"module\": \"Regression\",\n",
      "                    \"args\": {\n",
      "                        \"stop_after\": \"$problem_definition.seconds_per_mixer\"\n",
      "                    }\n",
      "                },\n",
      "                {\n",
      "                    \"module\": \"RandomForest\",\n",
      "                    \"args\": {\n",
      "                        \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n",
      "                        \"fit_on_dev\": true\n",
      "                    }\n",
      "                }\n",
      "            ]\n",
      "        }\n",
      "    },\n",
      "    \"problem_definition\": {\n",
      "        \"target\": \"target\",\n",
      "        \"pct_invalid\": 2,\n",
      "        \"unbias_target\": true,\n",
      "        \"seconds_per_mixer\": 42768.0,\n",
      "        \"seconds_per_encoder\": null,\n",
      "        \"expected_additional_time\": 0.06697702407836914,\n",
      "        \"time_aim\": 259200,\n",
      "        \"target_weights\": null,\n",
      "        \"positive_domain\": false,\n",
      "        \"timeseries_settings\": {\n",
      "            \"is_timeseries\": false,\n",
      "            \"order_by\": null,\n",
      "            \"window\": null,\n",
      "            \"group_by\": null,\n",
      "            \"use_previous_target\": true,\n",
      "            \"horizon\": null,\n",
      "            \"historical_columns\": null,\n",
      "            \"target_type\": \"\",\n",
      "            \"allow_incomplete_history\": true,\n",
      "            \"eval_incomplete\": false,\n",
      "            \"interval_periods\": []\n",
      "        },\n",
      "        \"anomaly_detection\": false,\n",
      "        \"use_default_analysis\": true,\n",
      "        \"embedding_only\": false,\n",
      "        \"dtype_dict\": {},\n",
      "        \"ignore_features\": [],\n",
      "        \"fit_on_all\": true,\n",
      "        \"strict_mode\": true,\n",
      "        \"seed_nr\": 1\n",
      "    },\n",
      "    \"identifiers\": {},\n",
      "    \"imputers\": [],\n",
      "    \"accuracy_functions\": [\n",
      "        \"balanced_accuracy_score\"\n",
      "    ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, load_custom_module\n",
    "import pandas as pd\n",
    "\n",
    "# load the code\n",
    "load_custom_module('random_forest_mixer.py')\n",
    "\n",
    "# read dataset\n",
    "df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/heart_disease/data.csv')\n",
    "\n",
    "# define the predictive task\n",
    "pdef = ProblemDefinition.from_dict({\n",
    "    'target': 'target', # column you want to predict\n",
    "})\n",
    "\n",
    "# generate the Json AI intermediate representation from the data and its corresponding settings\n",
    "json_ai = json_ai_from_problem(df, problem_definition=pdef)\n",
    "\n",
    "# Print it (you can also put it in a file and edit it there)\n",
    "print(json_ai.to_json())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have to edit the `mixers` key of this json ai to tell lightwood to use our custom mixer. We can use it together with the others, and have it ensembled with them at the end, or standalone. In this case I'm going to replace all existing mixers with this one"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-07T17:14:08.349819Z",
     "iopub.status.busy": "2024-05-07T17:14:08.349269Z",
     "iopub.status.idle": "2024-05-07T17:14:08.352728Z",
     "shell.execute_reply": "2024-05-07T17:14:08.352173Z"
    }
   },
   "outputs": [],
   "source": [
    "json_ai.model['args']['submodels'] = [{\n",
    "    'module': 'random_forest_mixer.RandomForestMixer',\n",
    "    'args': {\n",
    "        'stop_after': '$problem_definition.seconds_per_mixer',\n",
    "        'dtype_dict': '$dtype_dict',\n",
    "        'target': '$target',\n",
    "                'target_encoder': '$encoders[self.target]'\n",
    "\n",
    "    }\n",
    "}]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we'll generate some code, and finally turn that code into a predictor object and fit it on the original data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-07T17:14:08.355084Z",
     "iopub.status.busy": "2024-05-07T17:14:08.354887Z",
     "iopub.status.idle": "2024-05-07T17:14:08.687170Z",
     "shell.execute_reply": "2024-05-07T17:14:08.686543Z"
    }
   },
   "outputs": [],
   "source": [
    "from lightwood.api.high_level import code_from_json_ai, predictor_from_code\n",
    "\n",
    "code = code_from_json_ai(json_ai)\n",
    "predictor = predictor_from_code(code)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-07T17:14:08.689873Z",
     "iopub.status.busy": "2024-05-07T17:14:08.689664Z",
     "iopub.status.idle": "2024-05-07T17:14:09.308058Z",
     "shell.execute_reply": "2024-05-07T17:14:09.307539Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Learn phase 1/8] - Statistical analysis\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Starting statistical analysis\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Finished statistical analysis\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `analyze_data` runtime: 0.03 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Learn phase 2/8] - Data preprocessing\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Cleaning the data\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `preprocess` runtime: 0.01 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Learn phase 3/8] - Data splitting\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Splitting the data into train/test\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `split` runtime: 0.01 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Learn phase 4/8] - Preparing encoders\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing sequentially...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for age...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for sex...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for cp...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524:Encoding UNKNOWN categories as index 0\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for trestbps...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for chol...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for fbs...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for restecg...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524:Encoding UNKNOWN categories as index 0\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for thalach...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for exang...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for oldpeak...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for slope...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524:Encoding UNKNOWN categories as index 0\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for ca...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524:Encoding UNKNOWN categories as index 0\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:dataprep_ml-2524:Preparing encoder for thal...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524:Encoding UNKNOWN categories as index 0\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `prepare` runtime: 0.02 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Learn phase 5/8] - Feature generation\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Featurizing the data\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `featurize` runtime: 0.09 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Learn phase 6/8] - Mixer training\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Training the mixers\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `fit_mixer` runtime: 0.12 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Ensembling the mixer\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:Mixer: RandomForestMixer got accuracy: 0.798\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:Picked best mixer: RandomForestMixer\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `fit` runtime: 0.13 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Learn phase 7/8] - Ensemble analysis\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Analyzing the ensemble of mixers\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:The block ICP is now running its analyze() method\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:975: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n",
      "\u001b[32mINFO:lightwood-2524:The block ConfStats is now running its analyze() method\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:The block AccStats is now running its analyze() method\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:The block PermutationFeatureImportance is now running its analyze() method\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:[PFI] Using a random sample (1000 rows out of 31).\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:[PFI] Set to consider first 10 columns out of 10: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak'].\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/sklearn/metrics/_classification.py:2399: UserWarning: y_pred contains classes not in y_true\n",
      "  warnings.warn(\"y_pred contains classes not in y_true\")\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/sklearn/metrics/_classification.py:2399: UserWarning: y_pred contains classes not in y_true\n",
      "  warnings.warn(\"y_pred contains classes not in y_true\")\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `analyze_ensemble` runtime: 0.27 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Learn phase 8/8] - Adjustment on validation requested\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Updating the mixers\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `adjust` runtime: 0.04 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `learn` runtime: 0.62 seconds\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "predictor.learn(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can use the trained predictor to make some predictions, or save it to a pickle for later use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-05-07T17:14:09.310656Z",
     "iopub.status.busy": "2024-05-07T17:14:09.310451Z",
     "iopub.status.idle": "2024-05-07T17:14:09.434569Z",
     "shell.execute_reply": "2024-05-07T17:14:09.434014Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Predict phase 1/4] - Data preprocessing\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Cleaning the data\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `preprocess` runtime: 0.01 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Predict phase 2/4] - Feature generation\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:Featurizing the data\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/numpy/lib/function_base.py:2455: RuntimeWarning: invalid value encountered in _none_fn (vectorized)\n",
      "  outputs = ufunc(*inputs)\n",
      "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/numpy/lib/function_base.py:2455: RuntimeWarning: invalid value encountered in _none_fn (vectorized)\n",
      "  outputs = ufunc(*inputs)\n",
      "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/numpy/lib/function_base.py:2455: RuntimeWarning: invalid value encountered in _none_fn (vectorized)\n",
      "  outputs = ufunc(*inputs)\n",
      "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/numpy/lib/function_base.py:2455: RuntimeWarning: invalid value encountered in _none_fn (vectorized)\n",
      "  outputs = ufunc(*inputs)\n",
      "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/numpy/lib/function_base.py:2455: RuntimeWarning: invalid value encountered in _none_fn (vectorized)\n",
      "  outputs = ufunc(*inputs)\n",
      "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/numpy/lib/function_base.py:2455: RuntimeWarning: invalid value encountered in _none_fn (vectorized)\n",
      "  outputs = ufunc(*inputs)\n",
      "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/numpy/lib/function_base.py:2455: RuntimeWarning: invalid value encountered in _none_fn (vectorized)\n",
      "  outputs = ufunc(*inputs)\n",
      "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/numpy/lib/function_base.py:2455: RuntimeWarning: invalid value encountered in _none_fn (vectorized)\n",
      "  outputs = ufunc(*inputs)\n",
      "\u001b[37mDEBUG:lightwood-2524: `featurize` runtime: 0.02 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Predict phase 3/4] - Calling ensemble\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `_timed_call` runtime: 0.01 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:dataprep_ml-2524:[Predict phase 4/4] - Analyzing output\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:The block ICP is now running its explain() method\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:The block ConfStats is now running its explain() method\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:The block AccStats is now running its explain() method\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:AccStats.explain() has not been implemented, no modifications will be done to the data insights.\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:The block PermutationFeatureImportance is now running its explain() method\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32mINFO:lightwood-2524:PermutationFeatureImportance.explain() has not been implemented, no modifications will be done to the data insights.\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `explain` runtime: 0.01 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[37mDEBUG:lightwood-2524: `predict` runtime: 0.05 seconds\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   original_index prediction confidence\n",
      "0               0          1   0.073676\n",
      "1               1          0   0.250612\n",
      "2               2          0   0.462595\n"
     ]
    }
   ],
   "source": [
    "predictions = predictor.predict(pd.DataFrame({\n",
    "    'age': [63, 15, None],\n",
    "    'sex': [1, 1, 0],\n",
    "    'thal': [3, 1, 1]\n",
    "}))\n",
    "print(predictions)\n",
    "\n",
    "predictor.save('my_custom_heart_disease_predictor.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it, all it takes to solve a predictive problem with lightwood using your own custom mixer."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}