{ "cells": [ { "cell_type": "markdown", "id": "israeli-spyware", "metadata": {}, "source": [ "## Build your own training/testing split\n", "\n", "#### Date: 2021.10.07\n", "\n", "When working with machine learning data, splitting into a \"train\", \"dev\" (or validation) and \"test\") set is important. Models use **train** data to learn representations and update their parameters; **dev** or validation data is reserved to see how the model may perform on unknown predictions. While it may not be explicitly trained on, it can be used as a stopping criteria, for hyper-parameter tuning, or as a simple sanity check. Lastly, **test** data is always reserved, hidden from the model, as a final pass to see what models perform best.\n", "\n", "Lightwood supports a variety of **encoders** (Feature engineering procedures) and **mixers** (predictor algorithms that go from feature vectors to the target). Given the diversity of algorithms, it is appropriate to split data into these three categories when *preparing* encoders or *fitting* mixers.\n", "\n", "Our default approach stratifies labeled data to ensure your train, validation, and test sets are equally represented in all classes. However, in many instances you may want a custom technique to build your own splits. We've included the `splitter` functionality (default found in `lightwood.data.splitter`) to enable you to build your own.\n", "\n", "In the following problem, we shall work with a Kaggle dataset around credit card fraud (found [here](https://www.kaggle.com/mlg-ulb/creditcardfraud)). Fraud detection is difficult because the events we are interested in catching are thankfully rare events. Because of that, there is a large **imbalance of classes** (in fact, in this dataset, less than 1% of the data are the rare-event).\n", "\n", "In a supervised technique, we may want to ensure our training data sees the rare event of interest. A random shuffle could potentially miss rare events. We will implement **SMOTE** to increase the number of positive classes in our training data.\n", "\n", "Let's get started!" ] }, { "cell_type": "code", "execution_count": 1, "id": "interim-discussion", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:12:19.505399Z", "iopub.status.busy": "2024-05-07T17:12:19.505191Z", "iopub.status.idle": "2024-05-07T17:12:23.726284Z", "shell.execute_reply": "2024-05-07T17:12:23.725617Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2306:No torchvision detected, image helpers not supported.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2306:No torchvision/pillow detected, image encoder not supported\u001b[0m\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import torch\n", "import nltk\n", "import matplotlib.pyplot as plt\n", "\n", "import os\n", "import sys\n", "\n", "# Lightwood modules\n", "import lightwood as lw\n", "from lightwood import ProblemDefinition, \\\n", " JsonAI, \\\n", " json_ai_from_problem, \\\n", " code_from_json_ai, \\\n", " predictor_from_code\n", "\n", "import imblearn # Vers 0.5.0 minimum requirement" ] }, { "cell_type": "markdown", "id": "decimal-techno", "metadata": {}, "source": [ "### 1) Load your data\n", "\n", "Lightwood works with `pandas` DataFrames. We can use pandas to load our data. Please download the dataset from the above link and place it in a folder called `data/` where this notebook is located." ] }, { "cell_type": "code", "execution_count": 2, "id": "foreign-orchestra", "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:12:23.729268Z", "iopub.status.busy": "2024-05-07T17:12:23.728955Z", "iopub.status.idle": "2024-05-07T17:12:30.225282Z", "shell.execute_reply": "2024-05-07T17:12:30.224479Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | Time | \n", "V1 | \n", "V2 | \n", "V3 | \n", "V4 | \n", "V5 | \n", "V6 | \n", "V7 | \n", "V8 | \n", "V9 | \n", "... | \n", "V21 | \n", "V22 | \n", "V23 | \n", "V24 | \n", "V25 | \n", "V26 | \n", "V27 | \n", "V28 | \n", "Amount | \n", "Class | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.0 | \n", "-1.359807 | \n", "-0.072781 | \n", "2.536347 | \n", "1.378155 | \n", "-0.338321 | \n", "0.462388 | \n", "0.239599 | \n", "0.098698 | \n", "0.363787 | \n", "... | \n", "-0.018307 | \n", "0.277838 | \n", "-0.110474 | \n", "0.066928 | \n", "0.128539 | \n", "-0.189115 | \n", "0.133558 | \n", "-0.021053 | \n", "149.62 | \n", "0 | \n", "
1 | \n", "0.0 | \n", "1.191857 | \n", "0.266151 | \n", "0.166480 | \n", "0.448154 | \n", "0.060018 | \n", "-0.082361 | \n", "-0.078803 | \n", "0.085102 | \n", "-0.255425 | \n", "... | \n", "-0.225775 | \n", "-0.638672 | \n", "0.101288 | \n", "-0.339846 | \n", "0.167170 | \n", "0.125895 | \n", "-0.008983 | \n", "0.014724 | \n", "2.69 | \n", "0 | \n", "
2 | \n", "1.0 | \n", "-1.358354 | \n", "-1.340163 | \n", "1.773209 | \n", "0.379780 | \n", "-0.503198 | \n", "1.800499 | \n", "0.791461 | \n", "0.247676 | \n", "-1.514654 | \n", "... | \n", "0.247998 | \n", "0.771679 | \n", "0.909412 | \n", "-0.689281 | \n", "-0.327642 | \n", "-0.139097 | \n", "-0.055353 | \n", "-0.059752 | \n", "378.66 | \n", "0 | \n", "
3 | \n", "1.0 | \n", "-0.966272 | \n", "-0.185226 | \n", "1.792993 | \n", "-0.863291 | \n", "-0.010309 | \n", "1.247203 | \n", "0.237609 | \n", "0.377436 | \n", "-1.387024 | \n", "... | \n", "-0.108300 | \n", "0.005274 | \n", "-0.190321 | \n", "-1.175575 | \n", "0.647376 | \n", "-0.221929 | \n", "0.062723 | \n", "0.061458 | \n", "123.50 | \n", "0 | \n", "
4 | \n", "2.0 | \n", "-1.158233 | \n", "0.877737 | \n", "1.548718 | \n", "0.403034 | \n", "-0.407193 | \n", "0.095921 | \n", "0.592941 | \n", "-0.270533 | \n", "0.817739 | \n", "... | \n", "-0.009431 | \n", "0.798278 | \n", "-0.137458 | \n", "0.141267 | \n", "-0.206010 | \n", "0.502292 | \n", "0.219422 | \n", "0.215153 | \n", "69.99 | \n", "0 | \n", "
5 rows × 31 columns
\n", "