{ "cells": [ { "cell_type": "markdown", "id": "israeli-spyware", "metadata": {}, "source": [ "## Build your own training/testing split\n", "\n", "#### Date: 2021.10.07\n", "\n", "When working with machine learning data, splitting into a \"train\", \"dev\" (or validation) and \"test\") set is important. Models use **train** data to learn representations and update their parameters; **dev** or validation data is reserved to see how the model may perform on unknown predictions. While it may not be explicitly trained on, it can be used as a stopping criteria, for hyper-parameter tuning, or as a simple sanity check. Lastly, **test** data is always reserved, hidden from the model, as a final pass to see what models perform best.\n", "\n", "Lightwood supports a variety of **encoders** (Feature engineering procedures) and **mixers** (predictor algorithms that go from feature vectors to the target). Given the diversity of algorithms, it is appropriate to split data into these three categories when *preparing* encoders or *fitting* mixers.\n", "\n", "Our default approach stratifies labeled data to ensure your train, validation, and test sets are equally represented in all classes. However, in many instances you may want a custom technique to build your own splits. We've included the `splitter` functionality (default found in `lightwood.data.splitter`) to enable you to build your own.\n", "\n", "In the following problem, we shall work with a Kaggle dataset around credit card fraud (found [here](https://www.kaggle.com/mlg-ulb/creditcardfraud)). Fraud detection is difficult because the events we are interested in catching are thankfully rare events. Because of that, there is a large **imbalance of classes** (in fact, in this dataset, less than 1% of the data are the rare-event).\n", "\n", "In a supervised technique, we may want to ensure our training data sees the rare event of interest. A random shuffle could potentially miss rare events. We will implement **SMOTE** to increase the number of positive classes in our training data.\n", "\n", "Let's get started!" ] }, { "cell_type": "code", "execution_count": 1, "id": "interim-discussion", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:12:19.505399Z", "iopub.status.busy": "2024-05-07T17:12:19.505191Z", "iopub.status.idle": "2024-05-07T17:12:23.726284Z", "shell.execute_reply": "2024-05-07T17:12:23.725617Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2306:No torchvision detected, image helpers not supported.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2306:No torchvision/pillow detected, image encoder not supported\u001b[0m\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import torch\n", "import nltk\n", "import matplotlib.pyplot as plt\n", "\n", "import os\n", "import sys\n", "\n", "# Lightwood modules\n", "import lightwood as lw\n", "from lightwood import ProblemDefinition, \\\n", " JsonAI, \\\n", " json_ai_from_problem, \\\n", " code_from_json_ai, \\\n", " predictor_from_code\n", "\n", "import imblearn # Vers 0.5.0 minimum requirement" ] }, { "cell_type": "markdown", "id": "decimal-techno", "metadata": {}, "source": [ "### 1) Load your data\n", "\n", "Lightwood works with `pandas` DataFrames. We can use pandas to load our data. Please download the dataset from the above link and place it in a folder called `data/` where this notebook is located." ] }, { "cell_type": "code", "execution_count": 2, "id": "foreign-orchestra", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:12:23.729268Z", "iopub.status.busy": "2024-05-07T17:12:23.728955Z", "iopub.status.idle": "2024-05-07T17:12:30.225282Z", "shell.execute_reply": "2024-05-07T17:12:30.224479Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeV1V2V3V4V5V6V7V8V9...V21V22V23V24V25V26V27V28AmountClass
00.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.363787...-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620
10.01.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425...-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.690
21.0-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.514654...0.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.660
31.0-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024...-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.500
42.0-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.817739...-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.990
\n", "

5 rows × 31 columns

\n", "
" ], "text/plain": [ " Time V1 V2 V3 V4 V5 V6 V7 \\\n", "0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 \n", "1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 \n", "2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 \n", "3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 \n", "4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 \n", "\n", " V8 V9 ... V21 V22 V23 V24 V25 \\\n", "0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 \n", "1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 \n", "2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 \n", "3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 \n", "4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 \n", "\n", " V26 V27 V28 Amount Class \n", "0 -0.189115 0.133558 -0.021053 149.62 0 \n", "1 0.125895 -0.008983 0.014724 2.69 0 \n", "2 -0.139097 -0.055353 -0.059752 378.66 0 \n", "3 -0.221929 0.062723 0.061458 123.50 0 \n", "4 0.502292 0.219422 0.215153 69.99 0 \n", "\n", "[5 rows x 31 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load the data\n", "data = pd.read_csv(\"https://mindsdb-example-data.s3.eu-west-2.amazonaws.com/jupyter/creditcard.csv.zip\")\n", "data.head()" ] }, { "cell_type": "markdown", "id": "rental-contribution", "metadata": {}, "source": [ "We see **31 columns**, most of these columns appear numerical. Due to confidentiality reasons, the Kaggle dataset mentions that the columns labeled $V_i$ indicate principle components (PCs) from a PCA analysis of the original data from the credit card company. There is also a \"Time\" and \"Amount\", two original features that remained. The time references time after the first transaction in the dataset, and amount is how much money was considered in the transaction. \n", "\n", "You can also see a heavy imbalance in the two classes below:" ] }, { "cell_type": "code", "execution_count": 3, "id": "cathedral-mills", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:12:30.228432Z", "iopub.status.busy": "2024-05-07T17:12:30.227800Z", "iopub.status.idle": "2024-05-07T17:12:30.592309Z", "shell.execute_reply": "2024-05-07T17:12:30.591642Z" } }, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Distribution of Classes')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "f = plt.figure()\n", "ax = f.add_subplot(1,1,1)\n", "ax.hist(data['Class'], bins = [-0.1, 0.1, 0.9, 1.1], log=True)\n", "ax.set_ylabel(\"Log Counts\")\n", "ax.set_xticks([0, 1])\n", "ax.set_xticklabels([\"0\", \"1\"])\n", "ax.set_xlabel(\"Class\")\n", "ax.set_title(\"Distribution of Classes\")" ] }, { "cell_type": "markdown", "id": "exact-timeline", "metadata": {}, "source": [ "### 2) Create a JSON-AI default object\n", "We will now create JSON-AI syntax for our problem based on its specifications. We can do so by setting up a ``ProblemDefinition``. The ``ProblemDefinition`` allows us to specify the target, the column we intend to predict, along with other details. \n", "\n", "The end goal of JSON-AI is to provide **a set of instructions on how to compile a machine learning pipeline*.\n", "\n", "Our target here is called \"**Class**\", which indicates \"0\" for no fraud and \"1\" for fraud. We'll generate the JSON-AI with the minimal syntax:" ] }, { "cell_type": "code", "execution_count": 4, "id": "medieval-zambia", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:12:30.595049Z", "iopub.status.busy": "2024-05-07T17:12:30.594645Z", "iopub.status.idle": "2024-05-07T17:13:39.928759Z", "shell.execute_reply": "2024-05-07T17:13:39.927965Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Analyzing a sample of 18424\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:from a total population of 284807, this is equivalent to 6.5% of your data.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Using 3 processes to deduct types.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: Time\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V3\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V6\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column Time has data type integer\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V1\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V3 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V4\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V6 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V7\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V4 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V5\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V1 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V2\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V5 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V9\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V7 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V8\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V9 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V10\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V2 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V12\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V8 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V15\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V10 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V11\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V12 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V13\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V11 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V18\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V15 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V16\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V18 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V19\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V13 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V14\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V16 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V17\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V17 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V19 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V20\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V14 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V21\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V24\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V24 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V25\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V20 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V21 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V22\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V27\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V22 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V23\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V25 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V26\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V27 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: V28\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V23 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: Class\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V28 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Infering type for: Amount\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column V26 has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column Class has data type binary\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2306:Column Amount has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2306:Starting statistical analysis\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2306:Finished statistical analysis\u001b[0m\n" ] } ], "source": [ "# Setup the problem definition\n", "problem_definition = {\n", " 'target': 'Class',\n", "}\n", "\n", "# Generate the j{ai}son syntax\n", "json_ai = json_ai_from_problem(data, problem_definition)\n" ] }, { "cell_type": "markdown", "id": "deadly-rotation", "metadata": {}, "source": [ "Lightwood looks at each of the many columns and indicates they are mostly float, with exception of \"**Class**\" which is binary.\n", "\n", "You can observe the JSON-AI if you run the command `print(json_ai.to_json())`. Given there are many input features, we won't print it out." ] }, { "cell_type": "markdown", "id": "immune-clone", "metadata": {}, "source": [ "These are the only elements required to get off the ground with JSON-AI. However, we're interested in making a *custom* approach. So, let's make this syntax a file, and introduce our own changes." ] }, { "cell_type": "markdown", "id": "massive-divide", "metadata": {}, "source": [ "### 3) Build your own splitter module\n", "\n", "For Lightwood, the goal of a splitter is to intake an initial dataset (pre-processed ideally, although you can run the pre-processor on each DataFrame within the splitter) and return a dictionary with the keys \"train\", \"test\", and \"dev\" (at minimum). Subsequent steps of the pipeline expect the keys \"train\", \"test\", and \"dev\", so it's important you assign datasets to these as necessary. \n", "\n", "We're going to introduce SMOTE sampling in our splitter. SMOTE allows you to quickly learn an approximation to make extra \"samples\" that mimic the undersampled class. \n", "\n", "We will use the package `imblearn` and `scikit-learn` to quickly create a train/test split and apply SMOTE to our training data only.\n", "\n", "**NOTE** This is simply an example of things you can do with the splitter; whether SMOTE sampling is ideal for your problem depends on the question you're trying to answer!" ] }, { "cell_type": "code", "execution_count": 5, "id": "4411ee53", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:13:39.932371Z", "iopub.status.busy": "2024-05-07T17:13:39.932103Z", "iopub.status.idle": "2024-05-07T17:13:39.937546Z", "shell.execute_reply": "2024-05-07T17:13:39.936875Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing MyCustomSplitter.py\n" ] } ], "source": [ "%%writefile MyCustomSplitter.py\n", "\n", "from type_infer.dtype import dtype\n", "import pandas as pd\n", "import numpy as np\n", "from typing import List, Dict\n", "from itertools import product\n", "from lightwood.api.types import TimeseriesSettings\n", "from lightwood.helpers.log import log\n", "\n", "\n", "from imblearn.over_sampling import SMOTE\n", "from sklearn.model_selection import train_test_split\n", "\n", "\n", "def MySplitter(\n", " data: pd.DataFrame,\n", " target: str,\n", " pct_train: float = 0.8,\n", " pct_dev: float = 0.1,\n", " seed: int = 1,\n", ") -> Dict[str, pd.DataFrame]:\n", " \"\"\"\n", " Custom splitting function\n", "\n", "\n", " :param data: Input data\n", " :param target: Name of the target\n", " :param pct_train: Percentage of data reserved for training, taken out of full data\n", " :param pct_dev: Percentage of data reserved for dev, taken out of train data\n", " :param seed: Random seed for reproducibility\n", "\n", " :returns: A dictionary containing the keys train, test and dev with their respective data frames.\n", " \"\"\"\n", "\n", " # Shuffle the data\n", " data = data.sample(frac=1, random_state=seed).reset_index(drop=True)\n", "\n", " # Split into feature columns + target\n", " X = data.iloc[:, data.columns != target] # .values\n", " y = data[target] # .values\n", "\n", " # Create a train/test split\n", " X2, X_test, y2, y_test = train_test_split(\n", " X, y, train_size=pct_train, random_state=seed, stratify=data[target]\n", " )\n", "\n", " X_train, X_dev, y_train, y_dev = train_test_split(\n", " X2, y2, test_size=pct_dev, random_state=seed, stratify=y2\n", " )\n", "\n", " # Create a SMOTE model and bump up underbalanced class JUST for train data\n", " SMOTE_model = SMOTE(random_state=seed)\n", "\n", " Xtrain_mod, ytrain_mod = SMOTE_model.fit_resample(X_train, y_train.ravel())\n", "\n", " Xtrain_mod[target] = ytrain_mod\n", " X_test[target] = y_test\n", " X_dev[target] = y_dev\n", "\n", " return {\"train\": Xtrain_mod, \"test\": X_test, \"dev\": X_dev}\n" ] }, { "cell_type": "markdown", "id": "analyzed-radical", "metadata": {}, "source": [ "#### Place your custom module in `~/lightwood_modules`\n", "\n", "We automatically search for custom scripts in your `~/lightwood_modules` path. Place your file there. Later, you'll see when we autogenerate code, that you can change your import location if you choose." ] }, { "cell_type": "code", "execution_count": 6, "id": "34092d12", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:13:39.940142Z", "iopub.status.busy": "2024-05-07T17:13:39.939757Z", "iopub.status.idle": "2024-05-07T17:13:39.943266Z", "shell.execute_reply": "2024-05-07T17:13:39.942717Z" } }, "outputs": [], "source": [ "from lightwood import load_custom_module\n", "\n", "load_custom_module('MyCustomSplitter.py')" ] }, { "cell_type": "markdown", "id": "lucky-blair", "metadata": {}, "source": [ "### 4) Introduce your custom splitter in JSON-AI\n", "\n", "Now let's introduce our custom splitter. JSON-AI keeps a lightweight syntax but fills in many default modules (like splitting, cleaning).\n", "\n", "For the custom cleaner, we'll work by editing the \"splitter\" key. We will change properties within it as follows:\n", "(1) \"module\" - place the name of the function. In our case it will be \"MyCustomCleaner.cleaner\"\n", "(2) \"args\" - any keyword argument specific to your cleaner's internals. \n", "\n", "This will look as follows:\n", "```\n", " \"splitter\": {\n", " \"module\": \"MyCustomSplitter.MySplitter\",\n", " \"args\": {\n", " \"data\": \"data\",\n", " \"target\": \"$target\",\n", " \"pct_train\": 0.8,\n", " \"pct_dev\": 0.1,\n", " \"seed\": 1\n", " }\n", " },\n", "```" ] }, { "cell_type": "markdown", "id": "identical-georgia", "metadata": {}, "source": [ "### 5) Generate Python code representing your ML pipeline\n", "\n", "Now we're ready to load up our custom JSON-AI and generate the predictor code!\n", "\n", "We can do this by first reading in our custom json-syntax, and then calling the function `code_from_json_ai`. " ] }, { "cell_type": "code", "execution_count": 7, "id": "alleged-concentrate", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:13:39.945933Z", "iopub.status.busy": "2024-05-07T17:13:39.945551Z", "iopub.status.idle": "2024-05-07T17:13:40.181125Z", "shell.execute_reply": "2024-05-07T17:13:40.180436Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "import lightwood\n", "from lightwood import __version__ as lightwood_version\n", "from lightwood.analysis import *\n", "from lightwood.api import *\n", "from lightwood.data import *\n", "from lightwood.encoder import *\n", "from lightwood.ensemble import *\n", "from lightwood.helpers.device import *\n", "from lightwood.helpers.general import *\n", "from lightwood.helpers.ts import *\n", "from lightwood.helpers.log import *\n", "from lightwood.helpers.numeric import *\n", "from lightwood.helpers.parallelism import *\n", "from lightwood.helpers.seed import *\n", "from lightwood.helpers.text import *\n", "from lightwood.helpers.torch import *\n", "from lightwood.mixer import *\n", "\n", "from dataprep_ml.insights import statistical_analysis\n", "from dataprep_ml.cleaners import cleaner\n", "from dataprep_ml.splitters import splitter\n", "from dataprep_ml.imputers import *\n", "\n", "from mindsdb_evaluator import evaluate_accuracies\n", "from mindsdb_evaluator.accuracy import __all__ as mdb_eval_accuracy_metrics\n", "\n", "import pandas as pd\n", "from typing import Dict, List, Union, Optional\n", "import os\n", "from types import ModuleType\n", "import importlib.machinery\n", "import sys\n", "import time\n", "\n", "\n", "for import_dir in [\n", " os.path.join(\n", " os.path.expanduser(\"~/lightwood_modules\"), lightwood_version.replace(\".\", \"_\")\n", " ),\n", " os.path.join(\"/etc/lightwood_modules\", lightwood_version.replace(\".\", \"_\")),\n", "]:\n", " if os.path.exists(import_dir) and os.access(import_dir, os.R_OK):\n", " for file_name in list(os.walk(import_dir))[0][2]:\n", " if file_name[-3:] != \".py\":\n", " continue\n", " mod_name = file_name[:-3]\n", " loader = importlib.machinery.SourceFileLoader(\n", " mod_name, os.path.join(import_dir, file_name)\n", " )\n", " module = ModuleType(loader.name)\n", " loader.exec_module(module)\n", " sys.modules[mod_name] = module\n", " exec(f\"import {mod_name}\")\n", "\n", "\n", "class Predictor(PredictorInterface):\n", " target: str\n", " mixers: List[BaseMixer]\n", " encoders: Dict[str, BaseEncoder]\n", " ensemble: BaseEnsemble\n", " mode: str\n", "\n", " def __init__(self):\n", " seed(1)\n", " self.target = \"Class\"\n", " self.mode = \"inactive\"\n", " self.problem_definition = ProblemDefinition.from_dict(\n", " {\n", " \"target\": \"Class\",\n", " \"pct_invalid\": 2,\n", " \"unbias_target\": True,\n", " \"seconds_per_mixer\": 42768.0,\n", " \"seconds_per_encoder\": None,\n", " \"expected_additional_time\": 69.30250954627991,\n", " \"time_aim\": 259200,\n", " \"target_weights\": None,\n", " \"positive_domain\": False,\n", " \"timeseries_settings\": {\n", " \"is_timeseries\": False,\n", " \"order_by\": None,\n", " \"window\": None,\n", " \"group_by\": None,\n", " \"use_previous_target\": True,\n", " \"horizon\": None,\n", " \"historical_columns\": None,\n", " \"target_type\": \"\",\n", " \"allow_incomplete_history\": True,\n", " \"eval_incomplete\": False,\n", " \"interval_periods\": [],\n", " },\n", " \"anomaly_detection\": False,\n", " \"use_default_analysis\": True,\n", " \"embedding_only\": False,\n", " \"dtype_dict\": {},\n", " \"ignore_features\": [],\n", " \"fit_on_all\": True,\n", " \"strict_mode\": True,\n", " \"seed_nr\": 1,\n", " }\n", " )\n", " self.accuracy_functions = [\"balanced_accuracy_score\"]\n", " self.identifiers = {}\n", " self.dtype_dict = {\n", " \"Time\": \"integer\",\n", " \"V1\": \"float\",\n", " \"V2\": \"float\",\n", " \"V3\": \"float\",\n", " \"V4\": \"float\",\n", " \"V5\": \"float\",\n", " \"V6\": \"float\",\n", " \"V7\": \"float\",\n", " \"V8\": \"float\",\n", " \"V9\": \"float\",\n", " \"V10\": \"float\",\n", " \"V11\": \"float\",\n", " \"V12\": \"float\",\n", " \"V13\": \"float\",\n", " \"V14\": \"float\",\n", " \"V15\": \"float\",\n", " \"V16\": \"float\",\n", " \"V17\": \"float\",\n", " \"V18\": \"float\",\n", " \"V19\": \"float\",\n", " \"V20\": \"float\",\n", " \"V21\": \"float\",\n", " \"V22\": \"float\",\n", " \"V23\": \"float\",\n", " \"V24\": \"float\",\n", " \"V25\": \"float\",\n", " \"V26\": \"float\",\n", " \"V27\": \"float\",\n", " \"V28\": \"float\",\n", " \"Amount\": \"float\",\n", " \"Class\": \"binary\",\n", " }\n", " self.lightwood_version = \"24.3.3.1\"\n", " self.pred_args = PredictionArguments()\n", "\n", " # Any feature-column dependencies\n", " self.dependencies = {\n", " \"Class\": [],\n", " \"Time\": [],\n", " \"V1\": [],\n", " \"V2\": [],\n", " \"V3\": [],\n", " \"V4\": [],\n", " \"V5\": [],\n", " \"V6\": [],\n", " \"V7\": [],\n", " \"V8\": [],\n", " \"V9\": [],\n", " \"V10\": [],\n", " \"V11\": [],\n", " \"V12\": [],\n", " \"V13\": [],\n", " \"V14\": [],\n", " \"V15\": [],\n", " \"V16\": [],\n", " \"V17\": [],\n", " \"V18\": [],\n", " \"V19\": [],\n", " \"V20\": [],\n", " \"V21\": [],\n", " \"V22\": [],\n", " \"V23\": [],\n", " \"V24\": [],\n", " \"V25\": [],\n", " \"V26\": [],\n", " \"V27\": [],\n", " \"V28\": [],\n", " \"Amount\": [],\n", " }\n", "\n", " self.input_cols = [\n", " \"Time\",\n", " \"V1\",\n", " \"V2\",\n", " \"V3\",\n", " \"V4\",\n", " \"V5\",\n", " \"V6\",\n", " \"V7\",\n", " \"V8\",\n", " \"V9\",\n", " \"V10\",\n", " \"V11\",\n", " \"V12\",\n", " \"V13\",\n", " \"V14\",\n", " \"V15\",\n", " \"V16\",\n", " \"V17\",\n", " \"V18\",\n", " \"V19\",\n", " \"V20\",\n", " \"V21\",\n", " \"V22\",\n", " \"V23\",\n", " \"V24\",\n", " \"V25\",\n", " \"V26\",\n", " \"V27\",\n", " \"V28\",\n", " \"Amount\",\n", " ]\n", "\n", " # Initial stats analysis\n", " self.statistical_analysis = None\n", " self.ts_analysis = None\n", " self.runtime_log = dict()\n", " self.global_insights = dict()\n", "\n", " # Feature cache\n", " self.feature_cache = dict()\n", "\n", " @timed_predictor\n", " def analyze_data(self, data: pd.DataFrame) -> None:\n", " # Perform a statistical analysis on the unprocessed data\n", "\n", " self.statistical_analysis = statistical_analysis(\n", " data, self.dtype_dict, self.problem_definition.to_dict(), {}\n", " )\n", "\n", " # Instantiate post-training evaluation\n", " self.analysis_blocks = [\n", " ICP(fixed_significance=None, confidence_normalizer=False, deps=[]),\n", " ConfStats(deps=[\"ICP\"]),\n", " AccStats(deps=[\"ICP\"]),\n", " PermutationFeatureImportance(deps=[\"AccStats\"]),\n", " ]\n", "\n", " @timed_predictor\n", " def preprocess(self, data: pd.DataFrame) -> pd.DataFrame:\n", " # Preprocess and clean data\n", "\n", " log.info(\"Cleaning the data\")\n", " self.imputers = {}\n", " data = cleaner(\n", " data=data,\n", " pct_invalid=self.problem_definition.pct_invalid,\n", " identifiers=self.identifiers,\n", " dtype_dict=self.dtype_dict,\n", " target=self.target,\n", " mode=self.mode,\n", " imputers=self.imputers,\n", " timeseries_settings=self.problem_definition.timeseries_settings.to_dict(),\n", " anomaly_detection=self.problem_definition.anomaly_detection,\n", " )\n", "\n", " # Time-series blocks\n", "\n", " return data\n", "\n", " @timed_predictor\n", " def split(self, data: pd.DataFrame) -> Dict[str, pd.DataFrame]:\n", " # Split the data into training/testing splits\n", "\n", " log.info(\"Splitting the data into train/test\")\n", " train_test_data = MyCustomSplitter.MySplitter(\n", " data=data, pct_train=0.8, pct_dev=0.1, seed=1, target=self.target\n", " )\n", "\n", " return train_test_data\n", "\n", " @timed_predictor\n", " def prepare(self, data: Dict[str, pd.DataFrame]) -> None:\n", " # Prepare encoders to featurize data\n", "\n", " self.mode = \"train\"\n", "\n", " if self.statistical_analysis is None:\n", " raise Exception(\"Please run analyze_data first\")\n", "\n", " # Column to encoder mapping\n", " self.encoders = {\n", " \"Class\": BinaryEncoder(\n", " is_target=True, target_weights=self.statistical_analysis.target_weights\n", " ),\n", " \"Time\": NumericEncoder(),\n", " \"V1\": NumericEncoder(),\n", " \"V2\": NumericEncoder(),\n", " \"V3\": NumericEncoder(),\n", " \"V4\": NumericEncoder(),\n", " \"V5\": NumericEncoder(),\n", " \"V6\": NumericEncoder(),\n", " \"V7\": NumericEncoder(),\n", " \"V8\": NumericEncoder(),\n", " \"V9\": NumericEncoder(),\n", " \"V10\": NumericEncoder(),\n", " \"V11\": NumericEncoder(),\n", " \"V12\": NumericEncoder(),\n", " \"V13\": NumericEncoder(),\n", " \"V14\": NumericEncoder(),\n", " \"V15\": NumericEncoder(),\n", " \"V16\": NumericEncoder(),\n", " \"V17\": NumericEncoder(),\n", " \"V18\": NumericEncoder(),\n", " \"V19\": NumericEncoder(),\n", " \"V20\": NumericEncoder(),\n", " \"V21\": NumericEncoder(),\n", " \"V22\": NumericEncoder(),\n", " \"V23\": NumericEncoder(),\n", " \"V24\": NumericEncoder(),\n", " \"V25\": NumericEncoder(),\n", " \"V26\": NumericEncoder(),\n", " \"V27\": NumericEncoder(),\n", " \"V28\": NumericEncoder(),\n", " \"Amount\": NumericEncoder(),\n", " }\n", "\n", " # Prepare the training + dev data\n", " concatenated_train_dev = pd.concat([data[\"train\"], data[\"dev\"]])\n", "\n", " prepped_encoders = {}\n", "\n", " # Prepare input encoders\n", " parallel_encoding = parallel_encoding_check(data[\"train\"], self.encoders)\n", "\n", " if parallel_encoding:\n", " log.debug(\"Preparing in parallel...\")\n", " for col_name, encoder in self.encoders.items():\n", " if col_name != self.target and not encoder.is_trainable_encoder:\n", " prepped_encoders[col_name] = (\n", " encoder,\n", " concatenated_train_dev[col_name],\n", " \"prepare\",\n", " )\n", " prepped_encoders = mut_method_call(prepped_encoders)\n", "\n", " else:\n", " log.debug(\"Preparing sequentially...\")\n", " for col_name, encoder in self.encoders.items():\n", " if col_name != self.target and not encoder.is_trainable_encoder:\n", " log.debug(f\"Preparing encoder for {col_name}...\")\n", " encoder.prepare(concatenated_train_dev[col_name])\n", " prepped_encoders[col_name] = encoder\n", "\n", " # Store encoders\n", " for col_name, encoder in prepped_encoders.items():\n", " self.encoders[col_name] = encoder\n", "\n", " # Prepare the target\n", " if self.target not in prepped_encoders:\n", " if self.encoders[self.target].is_trainable_encoder:\n", " self.encoders[self.target].prepare(\n", " data[\"train\"][self.target], data[\"dev\"][self.target]\n", " )\n", " else:\n", " self.encoders[self.target].prepare(\n", " pd.concat([data[\"train\"], data[\"dev\"]])[self.target]\n", " )\n", "\n", " # Prepare any non-target encoders that are learned\n", " for col_name, encoder in self.encoders.items():\n", " if col_name != self.target and encoder.is_trainable_encoder:\n", " priming_data = pd.concat([data[\"train\"], data[\"dev\"]])\n", " kwargs = {}\n", " if self.dependencies[col_name]:\n", " kwargs[\"dependency_data\"] = {}\n", " for col in self.dependencies[col_name]:\n", " kwargs[\"dependency_data\"][col] = {\n", " \"original_type\": self.dtype_dict[col],\n", " \"data\": priming_data[col],\n", " }\n", "\n", " # If an encoder representation requires the target, provide priming data\n", " if hasattr(encoder, \"uses_target\"):\n", " kwargs[\"encoded_target_values\"] = self.encoders[self.target].encode(\n", " priming_data[self.target]\n", " )\n", "\n", " encoder.prepare(\n", " data[\"train\"][col_name], data[\"dev\"][col_name], **kwargs\n", " )\n", "\n", " @timed_predictor\n", " def featurize(self, split_data: Dict[str, pd.DataFrame]):\n", " # Featurize data into numerical representations for models\n", "\n", " log.info(\"Featurizing the data\")\n", "\n", " tss = self.problem_definition.timeseries_settings\n", "\n", " feature_data = dict()\n", " for key, data in split_data.items():\n", " if key != \"stratified_on\":\n", "\n", " # compute and store two splits - full and filtered (useful for time series post-train analysis)\n", " if key not in self.feature_cache:\n", " featurized_split = EncodedDs(self.encoders, data, self.target)\n", " filtered_subset = EncodedDs(\n", " self.encoders, filter_ts(data, tss), self.target\n", " )\n", "\n", " for k, s in zip(\n", " (key, f\"{key}_filtered\"), (featurized_split, filtered_subset)\n", " ):\n", " self.feature_cache[k] = s\n", "\n", " for k in (key, f\"{key}_filtered\"):\n", " feature_data[k] = self.feature_cache[k]\n", "\n", " return feature_data\n", "\n", " @timed_predictor\n", " def fit(self, enc_data: Dict[str, pd.DataFrame]) -> None:\n", " # Fit predictors to estimate target\n", "\n", " self.mode = \"train\"\n", "\n", " # --------------- #\n", " # Extract data\n", " # --------------- #\n", " # Extract the featurized data into train/dev/test\n", " encoded_train_data = enc_data[\"train\"]\n", " encoded_dev_data = enc_data[\"dev\"]\n", " encoded_test_data = enc_data[\"test_filtered\"]\n", "\n", " log.info(\"Training the mixers\")\n", "\n", " # --------------- #\n", " # Fit Models\n", " # --------------- #\n", " # Assign list of mixers\n", " self.mixers = [\n", " Neural(\n", " fit_on_dev=True,\n", " search_hyperparameters=True,\n", " net=\"DefaultNet\",\n", " stop_after=self.problem_definition.seconds_per_mixer,\n", " target=self.target,\n", " dtype_dict=self.dtype_dict,\n", " target_encoder=self.encoders[self.target],\n", " ),\n", " XGBoostMixer(\n", " fit_on_dev=True,\n", " use_optuna=True,\n", " stop_after=self.problem_definition.seconds_per_mixer,\n", " target=self.target,\n", " dtype_dict=self.dtype_dict,\n", " input_cols=self.input_cols,\n", " target_encoder=self.encoders[self.target],\n", " ),\n", " Regression(\n", " stop_after=self.problem_definition.seconds_per_mixer,\n", " target=self.target,\n", " dtype_dict=self.dtype_dict,\n", " target_encoder=self.encoders[self.target],\n", " ),\n", " RandomForest(\n", " fit_on_dev=True,\n", " stop_after=self.problem_definition.seconds_per_mixer,\n", " target=self.target,\n", " dtype_dict=self.dtype_dict,\n", " target_encoder=self.encoders[self.target],\n", " ),\n", " ]\n", "\n", " # Train mixers\n", " trained_mixers = []\n", " for mixer in self.mixers:\n", " try:\n", " if mixer.trains_once:\n", " self.fit_mixer(\n", " mixer,\n", " ConcatedEncodedDs([encoded_train_data, encoded_dev_data]),\n", " encoded_test_data,\n", " )\n", " else:\n", " self.fit_mixer(mixer, encoded_train_data, encoded_dev_data)\n", " trained_mixers.append(mixer)\n", " except Exception as e:\n", " log.warning(f\"Exception: {e} when training mixer: {mixer}\")\n", " if True and mixer.stable:\n", " raise e\n", "\n", " # Update mixers to trained versions\n", " if not trained_mixers:\n", " raise Exception(\n", " \"No mixers could be trained! Please verify your problem definition or JsonAI model representation.\"\n", " )\n", " self.mixers = trained_mixers\n", "\n", " # --------------- #\n", " # Create Ensembles\n", " # --------------- #\n", " log.info(\"Ensembling the mixer\")\n", " # Create an ensemble of mixers to identify best performing model\n", " # Dirty hack\n", " self.ensemble = BestOf(\n", " data=encoded_test_data,\n", " fit=True,\n", " ts_analysis=None,\n", " target=self.target,\n", " mixers=self.mixers,\n", " args=self.pred_args,\n", " accuracy_functions=self.accuracy_functions,\n", " )\n", " self.supports_proba = self.ensemble.supports_proba\n", "\n", " @timed_predictor\n", " def fit_mixer(self, mixer, encoded_train_data, encoded_dev_data) -> None:\n", " mixer.fit(encoded_train_data, encoded_dev_data)\n", "\n", " @timed_predictor\n", " def analyze_ensemble(self, enc_data: Dict[str, pd.DataFrame]) -> None:\n", " # Evaluate quality of fit for the ensemble of mixers\n", "\n", " # --------------- #\n", " # Extract data\n", " # --------------- #\n", " # Extract the featurized data into train/dev/test\n", " encoded_train_data = enc_data[\"train\"]\n", " encoded_dev_data = enc_data[\"dev\"]\n", " encoded_test_data = enc_data[\"test\"]\n", "\n", " # --------------- #\n", " # Analyze Ensembles\n", " # --------------- #\n", " log.info(\"Analyzing the ensemble of mixers\")\n", " self.model_analysis, self.runtime_analyzer = model_analyzer(\n", " data=encoded_test_data,\n", " train_data=encoded_train_data,\n", " ts_analysis=None,\n", " stats_info=self.statistical_analysis,\n", " pdef=self.problem_definition,\n", " accuracy_functions=self.accuracy_functions,\n", " predictor=self.ensemble,\n", " target=self.target,\n", " dtype_dict=self.dtype_dict,\n", " analysis_blocks=self.analysis_blocks,\n", " )\n", "\n", " @timed_predictor\n", " def learn(self, data: pd.DataFrame) -> None:\n", " if self.problem_definition.ignore_features:\n", " log.info(f\"Dropping features: {self.problem_definition.ignore_features}\")\n", " data = data.drop(\n", " columns=self.problem_definition.ignore_features, errors=\"ignore\"\n", " )\n", "\n", " self.mode = \"train\"\n", " n_phases = 8 if self.problem_definition.fit_on_all else 7\n", "\n", " # Perform stats analysis\n", " log.info(f\"[Learn phase 1/{n_phases}] - Statistical analysis\")\n", " self.analyze_data(data)\n", "\n", " # Pre-process the data\n", " log.info(f\"[Learn phase 2/{n_phases}] - Data preprocessing\")\n", " data = self.preprocess(data)\n", "\n", " # Create train/test (dev) split\n", " log.info(f\"[Learn phase 3/{n_phases}] - Data splitting\")\n", " train_dev_test = self.split(data)\n", "\n", " # Prepare encoders\n", " log.info(f\"[Learn phase 4/{n_phases}] - Preparing encoders\")\n", " self.prepare(train_dev_test)\n", "\n", " # Create feature vectors from data\n", " log.info(f\"[Learn phase 5/{n_phases}] - Feature generation\")\n", " enc_train_test = self.featurize(train_dev_test)\n", "\n", " # Prepare mixers\n", " log.info(f\"[Learn phase 6/{n_phases}] - Mixer training\")\n", " if not self.problem_definition.embedding_only:\n", " self.fit(enc_train_test)\n", " else:\n", " self.mixers = []\n", " self.ensemble = Embedder(\n", " self.target, mixers=list(), data=enc_train_test[\"train\"]\n", " )\n", " self.supports_proba = self.ensemble.supports_proba\n", "\n", " # Analyze the ensemble\n", " log.info(f\"[Learn phase 7/{n_phases}] - Ensemble analysis\")\n", " self.analyze_ensemble(enc_train_test)\n", "\n", " # ------------------------ #\n", " # Enable model partial fit AFTER it is trained and evaluated for performance with the appropriate train/dev/test splits.\n", " # This assumes the predictor could continuously evolve, hence including reserved testing data may improve predictions.\n", " # SET `json_ai.problem_definition.fit_on_all=False` TO TURN THIS BLOCK OFF.\n", "\n", " # Update the mixers with partial fit\n", " if self.problem_definition.fit_on_all and all(\n", " [not m.trains_once for m in self.mixers]\n", " ):\n", " log.info(f\"[Learn phase 8/{n_phases}] - Adjustment on validation requested\")\n", " self.adjust(\n", " enc_train_test[\"test\"].data_frame,\n", " ConcatedEncodedDs(\n", " [enc_train_test[\"train\"], enc_train_test[\"dev\"]]\n", " ).data_frame,\n", " adjust_args={\"learn_call\": True},\n", " )\n", "\n", " self.feature_cache = (\n", " dict()\n", " ) # empty feature cache to avoid large predictor objects\n", "\n", " @timed_predictor\n", " def adjust(\n", " self,\n", " train_data: Union[EncodedDs, ConcatedEncodedDs, pd.DataFrame],\n", " dev_data: Optional[Union[EncodedDs, ConcatedEncodedDs, pd.DataFrame]] = None,\n", " adjust_args: Optional[dict] = None,\n", " ) -> None:\n", " # Update mixers with new information\n", "\n", " self.mode = \"train\"\n", "\n", " # --------------- #\n", " # Prepare data\n", " # --------------- #\n", " if dev_data is None:\n", " data = train_data\n", " split = splitter(\n", " data=data,\n", " pct_train=0.8,\n", " pct_dev=0.2,\n", " pct_test=0,\n", " tss=self.problem_definition.timeseries_settings.to_dict(),\n", " seed=self.problem_definition.seed_nr,\n", " target=self.target,\n", " dtype_dict=self.dtype_dict,\n", " )\n", " train_data = split[\"train\"]\n", " dev_data = split[\"dev\"]\n", "\n", " if adjust_args is None or not adjust_args.get(\"learn_call\"):\n", " train_data = self.preprocess(train_data)\n", " dev_data = self.preprocess(dev_data)\n", "\n", " dev_data = EncodedDs(self.encoders, dev_data, self.target)\n", " train_data = EncodedDs(self.encoders, train_data, self.target)\n", "\n", " # --------------- #\n", " # Update/Adjust Mixers\n", " # --------------- #\n", " log.info(\"Updating the mixers\")\n", "\n", " for mixer in self.mixers:\n", " mixer.partial_fit(train_data, dev_data, adjust_args)\n", "\n", " @timed_predictor\n", " def predict(self, data: pd.DataFrame, args: Dict = {}) -> pd.DataFrame:\n", "\n", " self.mode = \"predict\"\n", " n_phases = 3 if self.pred_args.all_mixers else 4\n", "\n", " if len(data) == 0:\n", " raise Exception(\n", " \"Empty input, aborting prediction. Please try again with some input data.\"\n", " )\n", "\n", " self.pred_args = PredictionArguments.from_dict(args)\n", "\n", " log.info(f\"[Predict phase 1/{n_phases}] - Data preprocessing\")\n", " if self.problem_definition.ignore_features:\n", " log.info(f\"Dropping features: {self.problem_definition.ignore_features}\")\n", " data = data.drop(\n", " columns=self.problem_definition.ignore_features, errors=\"ignore\"\n", " )\n", " for col in self.input_cols:\n", " if col not in data.columns:\n", " data[col] = [None] * len(data)\n", "\n", " # Pre-process the data\n", " data = self.preprocess(data)\n", "\n", " # Featurize the data\n", " log.info(f\"[Predict phase 2/{n_phases}] - Feature generation\")\n", " encoded_ds = self.featurize({\"predict_data\": data})[\"predict_data\"]\n", " encoded_data = encoded_ds.get_encoded_data(include_target=False)\n", "\n", " log.info(f\"[Predict phase 3/{n_phases}] - Calling ensemble\")\n", "\n", " @timed\n", " def _timed_call(encoded_ds):\n", " if self.pred_args.return_embedding:\n", " embedder = Embedder(self.target, mixers=list(), data=encoded_ds)\n", " df = embedder(encoded_ds, args=self.pred_args)\n", " else:\n", " df = self.ensemble(encoded_ds, args=self.pred_args)\n", " return df\n", "\n", " df = _timed_call(encoded_ds)\n", "\n", " if not (\n", " any(\n", " [\n", " self.pred_args.all_mixers,\n", " self.pred_args.return_embedding,\n", " self.problem_definition.embedding_only,\n", " ]\n", " )\n", " ):\n", " log.info(f\"[Predict phase 4/{n_phases}] - Analyzing output\")\n", " df, global_insights = explain(\n", " data=data,\n", " encoded_data=encoded_data,\n", " predictions=df,\n", " ts_analysis=None,\n", " problem_definition=self.problem_definition,\n", " stat_analysis=self.statistical_analysis,\n", " runtime_analysis=self.runtime_analyzer,\n", " target_name=self.target,\n", " target_dtype=self.dtype_dict[self.target],\n", " explainer_blocks=self.analysis_blocks,\n", " pred_args=self.pred_args,\n", " )\n", " self.global_insights = {**self.global_insights, **global_insights}\n", "\n", " self.feature_cache = (\n", " dict()\n", " ) # empty feature cache to avoid large predictor objects\n", "\n", " return df\n", "\n", " def test(\n", " self,\n", " data: pd.DataFrame,\n", " metrics: list,\n", " args: Dict[str, object] = {},\n", " strict: bool = False,\n", " ) -> pd.DataFrame:\n", "\n", " preds = self.predict(data, args)\n", " preds = preds.rename(columns={\"prediction\": self.target})\n", " filtered = []\n", "\n", " # filter metrics if not supported\n", " for metric in metrics:\n", " # metric should be one of: an actual function, registered in the model class, or supported by the evaluator\n", " if not (\n", " callable(metric)\n", " or metric in self.accuracy_functions\n", " or metric in mdb_eval_accuracy_metrics\n", " ):\n", " if strict:\n", " raise Exception(f\"Invalid metric: {metric}\")\n", " else:\n", " log.warning(f\"Invalid metric: {metric}. Skipping...\")\n", " else:\n", " filtered.append(metric)\n", "\n", " metrics = filtered\n", " try:\n", " labels = self.model_analysis.histograms[self.target][\"x\"]\n", " except:\n", " if strict:\n", " raise Exception(\"Label histogram not found\")\n", " else:\n", " label_map = (\n", " None # some accuracy functions will crash without this, be mindful\n", " )\n", " scores = evaluate_accuracies(\n", " data,\n", " preds[self.target],\n", " self.target,\n", " metrics,\n", " ts_analysis=self.ts_analysis,\n", " labels=labels,\n", " )\n", "\n", " # TODO: remove once mdb_eval returns an actual list\n", " scores = {k: [v] for k, v in scores.items() if not isinstance(v, list)}\n", "\n", " return pd.DataFrame.from_records(\n", " scores\n", " ) # TODO: add logic to disaggregate per-mixer\n", "\n" ] } ], "source": [ "json_ai.splitter = {\n", " \"module\": \"MyCustomSplitter.MySplitter\",\n", " \"args\": {\n", " \"data\": \"data\",\n", " \"target\": \"$target\",\n", " \"pct_train\": 0.8,\n", " \"pct_dev\": 0.1,\n", " \"seed\": 1\n", " }\n", " }\n", "\n", "#Generate python code that fills in your pipeline\n", "code = code_from_json_ai(json_ai)\n", "\n", "print(code)\n", "\n", "# Save code to a file (Optional)\n", "with open('custom_splitter_pipeline.py', 'w') as fp:\n", " fp.write(code)" ] }, { "cell_type": "markdown", "id": "dental-beauty", "metadata": {}, "source": [ "As you can see, an end-to-end pipeline of our entire ML procedure has been generating. There are several abstracted functions to enable transparency as to what processes your data goes through in order to build these models.\n", "\n", "The key steps of the pipeline are as follows:\n", "\n", "(1) Run a **statistical analysis** with `analyze_data`
\n", "(2) Clean your data with `preprocess`
\n", "(3) Make a training/dev/testing split with `split`
\n", "(4) Prepare your feature-engineering pipelines with `prepare`
\n", "(5) Create your features with `featurize`
\n", "(6) Fit your predictor models with `fit`
\n", "\n", "You can customize this further if necessary, but you have all the steps necessary to train a model!\n", "\n", "We recommend familiarizing with these steps by calling the above commands, ideally in order. Some commands (namely `prepare`, `featurize`, and `fit`) do depend on other steps.\n", "\n", "If you want to omit the individual steps, we recommend your simply call the `learn` method, which compiles all the necessary steps implemented to give your fully trained predictive models starting with unprocessed data! " ] }, { "cell_type": "markdown", "id": "amended-oklahoma", "metadata": {}, "source": [ "### 6) Call python to run your code and see your preprocessed outputs\n", "\n", "Once we have code, we can turn this into a python object by calling `predictor_from_code`. This instantiates the `PredictorInterface` object. \n", "\n", "This predictor object can be then used to run your pipeline." ] }, { "cell_type": "code", "execution_count": 8, "id": "organic-london", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:13:40.183770Z", "iopub.status.busy": "2024-05-07T17:13:40.183572Z", "iopub.status.idle": "2024-05-07T17:13:40.190941Z", "shell.execute_reply": "2024-05-07T17:13:40.190412Z" } }, "outputs": [], "source": [ "# Turn the code above into a predictor object\n", "predictor = predictor_from_code(code)" ] }, { "cell_type": "code", "execution_count": 9, "id": "fabulous-prime", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:13:40.193217Z", "iopub.status.busy": "2024-05-07T17:13:40.192988Z", "iopub.status.idle": "2024-05-07T17:14:00.395495Z", "shell.execute_reply": "2024-05-07T17:14:00.394798Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2306:Cleaning the data\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2306: `preprocess` runtime: 18.64 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2306:Splitting the data into train/test\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2306: `split` runtime: 1.55 seconds\u001b[0m\n" ] } ], "source": [ "# Pre-process the data\n", "cleaned_data = predictor.preprocess(data)\n", "train_test_data = predictor.split(cleaned_data)" ] }, { "cell_type": "code", "execution_count": 10, "id": "suspended-biography", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:14:00.398126Z", "iopub.status.busy": "2024-05-07T17:14:00.397903Z", "iopub.status.idle": "2024-05-07T17:14:01.928878Z", "shell.execute_reply": "2024-05-07T17:14:01.928158Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.rcParams['font.size']=15\n", "f = plt.figure(figsize=(18, 5))\n", "\n", "ax = f.add_subplot(1,3,1)\n", "ax.hist(train_test_data[\"train\"]['Class'], bins = [-0.1, 0.1, 0.9, 1.1], log=True)\n", "ax.set_ylabel(\"Log Counts\")\n", "ax.set_xticks([0, 1])\n", "ax.set_xticklabels([\"0\", \"1\"])\n", "ax.set_xlabel(\"Class\")\n", "ax.set_title(\"Train:\\nDistribution of Classes\")\n", "ax.set_ylim([1, 1e6])\n", "\n", "ax = f.add_subplot(1,3,2)\n", "ax.hist(train_test_data[\"dev\"]['Class'], bins = [-0.1, 0.1, 0.9, 1.1], log=True, color='k')\n", "ax.set_ylabel(\"Log Counts\")\n", "ax.set_xticks([0, 1])\n", "ax.set_xticklabels([\"0\", \"1\"])\n", "ax.set_xlabel(\"Class\")\n", "ax.set_title(\"Dev:\\nDistribution of Classes\")\n", "ax.set_ylim([1, 1e6])\n", "\n", "\n", "ax = f.add_subplot(1,3,3)\n", "ax.hist(train_test_data[\"test\"]['Class'], bins = [-0.1, 0.1, 0.9, 1.1], log=True, color='r')\n", "ax.set_ylabel(\"Log Counts\")\n", "ax.set_xticks([0, 1])\n", "ax.set_xticklabels([\"0\", \"1\"])\n", "ax.set_xlabel(\"Class\")\n", "ax.set_title(\"Test:\\nDistribution of Classes\")\n", "ax.set_ylim([1, 1e6])\n", "\n", "f.tight_layout()" ] }, { "cell_type": "markdown", "id": "operational-binary", "metadata": {}, "source": [ "As you can see, our splitter has greatly increased the representation of the minority class within the training data, but not so for the testing or dev data.\n", "\n", "We hope this tutorial was informative on how to introduce a **custom splitter method** to your datasets! For more customization tutorials, please check our [documentation](https://lightwood.io/tutorials.html).\n", "\n", "If you want to download the Jupyter-notebook version of this tutorial, check out the source github location found here: `lightwood/docssrc/source/tutorials/custom_splitter`. " ] } ], "metadata": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" }, "kernelspec": { "display_name": "Python 3.8.10 64-bit", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 5 }