{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "In this tutorial, we will go through an example to update a preexisting model. This might be useful when you come across additional data that you would want to consider, without having to train a model from scratch.\n", "\n", "The main abstraction that Lightwood offers for this is the `BaseMixer.partial_fit()` method. To call it, you need to pass new training data and a held-out dev subset for internal mixer usage (e.g. early stopping). If you are using an aggregate ensemble, it's likely you will want to do this for every single mixer. The convienient `PredictorInterface.adjust()` does this automatically for you.\n", "\n", "\n", "# Initial model training\n", "\n", "First, let's train a Lightwood predictor for the `concrete strength` dataset:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2024-05-15T12:38:36.597254Z", "iopub.status.busy": "2024-05-15T12:38:36.597058Z", "iopub.status.idle": "2024-05-15T12:38:39.444383Z", "shell.execute_reply": "2024-05-15T12:38:39.443711Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:No torchvision detected, image helpers not supported.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:No torchvision/pillow detected, image encoder not supported\u001b[0m\n" ] } ], "source": [ "from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, predictor_from_json_ai\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-05-15T12:38:39.447713Z", "iopub.status.busy": "2024-05-15T12:38:39.447197Z", "iopub.status.idle": "2024-05-15T12:38:39.569705Z", "shell.execute_reply": "2024-05-15T12:38:39.568972Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train dataframe shape: (103, 10)\n", "Update dataframe shape: (721, 10)\n", "Test dataframe shape: (206, 10)\n" ] } ], "source": [ "# Load data\n", "df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/lightwood/staging/tests/data/concrete_strength.csv')\n", "\n", "df = df.sample(frac=1, random_state=1)\n", "train_df = df[:int(0.1*len(df))]\n", "update_df = df[int(0.1*len(df)):int(0.8*len(df))]\n", "test_df = df[int(0.8*len(df)):]\n", "\n", "print(f'Train dataframe shape: {train_df.shape}')\n", "print(f'Update dataframe shape: {update_df.shape}')\n", "print(f'Test dataframe shape: {test_df.shape}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we have three different data splits.\n", "\n", "We will use the `training` split for the initial model training. As you can see, it's only a 20% of the total data we have. The `update` split will be used as training data to adjust/update our model. Finally, the held out `test` set will give us a rough idea of the impact our updating procedure has on the model's predictive capabilities." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-05-15T12:38:39.572420Z", "iopub.status.busy": "2024-05-15T12:38:39.572012Z", "iopub.status.idle": "2024-05-15T12:38:41.013683Z", "shell.execute_reply": "2024-05-15T12:38:41.013049Z" }, "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Analyzing a sample of 979\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:from a total population of 1030, this is equivalent to 95.0% of your data.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Using 3 processes to deduct types.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Infering type for: cement\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Infering type for: slag\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Column slag has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Column cement has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Infering type for: water\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Infering type for: flyAsh\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Column water has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Column flyAsh has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Infering type for: superPlasticizer\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Infering type for: coarseAggregate\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Infering type for: id\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Column coarseAggregate has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Column superPlasticizer has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Column id has data type integer\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Infering type for: fineAggregate\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Infering type for: age\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Infering type for: concrete_strength\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Column age has data type integer\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Column fineAggregate has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:type_infer-2649:Column concrete_strength has data type float\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Starting statistical analysis\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Finished statistical analysis\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Learn phase 1/8] - Statistical analysis\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Starting statistical analysis\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Finished statistical analysis\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `analyze_data` runtime: 0.02 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Learn phase 2/8] - Data preprocessing\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Cleaning the data\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `preprocess` runtime: 0.01 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Learn phase 3/8] - Data splitting\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Splitting the data into train/test\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `split` runtime: 0.0 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Learn phase 4/8] - Preparing encoders\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2649:Preparing sequentially...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2649:Preparing encoder for id...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2649:Preparing encoder for cement...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2649:Preparing encoder for slag...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2649:Preparing encoder for flyAsh...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2649:Preparing encoder for water...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2649:Preparing encoder for superPlasticizer...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2649:Preparing encoder for coarseAggregate...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2649:Preparing encoder for fineAggregate...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:dataprep_ml-2649:Preparing encoder for age...\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `prepare` runtime: 0.01 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Learn phase 5/8] - Feature generation\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Featurizing the data\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `featurize` runtime: 0.06 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Learn phase 6/8] - Mixer training\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Training the mixers\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/torch/amp/grad_scaler.py:131: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:\n", "\taddcmul_(Number value, Tensor tensor1, Tensor tensor2)\n", "Consider using one of the following signatures instead:\n", "\taddcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1578.)\n", " exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)\n", "\u001b[32mINFO:lightwood-2649:Loss of 39.99637508392334 with learning rate 0.0001\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss of 21.826460361480713 with learning rate 0.0005\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss of 15.12899512052536 with learning rate 0.001\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss of 15.062753021717072 with learning rate 0.002\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss of 26.490495562553406 with learning rate 0.003\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss of 33.6572003364563 with learning rate 0.005\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss of 303.60721158981323 with learning rate 0.01\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss of nan with learning rate 0.05\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Found learning rate of: 0.002\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss @ epoch 1: 0.11838734149932861\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss @ epoch 2: 0.4641949534416199\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss @ epoch 3: 0.3976145386695862\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss @ epoch 4: 0.3706841468811035\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss @ epoch 5: 0.2367912232875824\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss @ epoch 6: 0.22560915350914001\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss @ epoch 7: 0.12089195847511292\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `fit_mixer` runtime: 0.53 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Ensembling the mixer\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Mixer: Neural got accuracy: 0.238\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Picked best mixer: Neural\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `fit` runtime: 0.54 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Learn phase 7/8] - Ensemble analysis\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Analyzing the ensemble of mixers\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block ICP is now running its analyze() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block ConfStats is now running its analyze() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block AccStats is now running its analyze() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block PermutationFeatureImportance is now running its analyze() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:[PFI] Using a random sample (1000 rows out of 10).\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:[PFI] Set to consider first 10 columns out of 9: ['id', 'cement', 'slag', 'flyAsh', 'water', 'superPlasticizer', 'coarseAggregate', 'fineAggregate', 'age'].\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `analyze_ensemble` runtime: 0.15 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Learn phase 8/8] - Adjustment on validation requested\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Updating the mixers\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/torch/amp/grad_scaler.py:131: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss @ epoch 1: 0.1678172747294108\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `adjust` runtime: 0.03 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `learn` runtime: 0.83 seconds\u001b[0m\n" ] } ], "source": [ "# Define predictive task and predictor\n", "target = 'concrete_strength'\n", "pdef = ProblemDefinition.from_dict({'target': target, 'time_aim': 200})\n", "jai = json_ai_from_problem(df, pdef)\n", "\n", "# We will keep the architecture simple: a single neural mixer, and a `BestOf` ensemble:\n", "jai.model = {\n", " \"module\": \"BestOf\",\n", " \"args\": {\n", " \"args\": \"$pred_args\",\n", " \"accuracy_functions\": \"$accuracy_functions\",\n", " \"submodels\": [{\n", " \"module\": \"Neural\",\n", " \"args\": {\n", " \"fit_on_dev\": False,\n", " \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n", " \"search_hyperparameters\": False,\n", " }\n", " }]\n", " }\n", "}\n", "\n", "# Build and train the predictor\n", "predictor = predictor_from_json_ai(jai)\n", "predictor.learn(train_df)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-05-15T12:38:41.016512Z", "iopub.status.busy": "2024-05-15T12:38:41.016092Z", "iopub.status.idle": "2024-05-15T12:38:41.160419Z", "shell.execute_reply": "2024-05-15T12:38:41.159845Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Predict phase 1/4] - Data preprocessing\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Cleaning the data\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `preprocess` runtime: 0.01 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Predict phase 2/4] - Feature generation\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Featurizing the data\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `featurize` runtime: 0.03 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Predict phase 3/4] - Calling ensemble\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `_timed_call` runtime: 0.03 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Predict phase 4/4] - Analyzing output\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block ICP is now running its explain() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block ConfStats is now running its explain() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block AccStats is now running its explain() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:AccStats.explain() has not been implemented, no modifications will be done to the data insights.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block PermutationFeatureImportance is now running its explain() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:PermutationFeatureImportance.explain() has not been implemented, no modifications will be done to the data insights.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `explain` runtime: 0.05 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `predict` runtime: 0.13 seconds\u001b[0m\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
original_indexpredictionconfidencelowerupper
0040.9096300.99910.00000087.398161
1119.1468220.99910.00000065.635353
2222.4822940.99910.00000068.970825
3319.5937650.99910.00000066.082296
4431.7245370.99910.00000078.213068
..................
20120150.5531040.99914.06457497.041635
20220248.5804250.99912.09189595.068956
20320330.1141870.99910.00000076.602718
20420425.6760030.99910.00000072.164533
20520541.2316360.99910.00000087.720167
\n", "

206 rows × 5 columns

\n", "
" ], "text/plain": [ " original_index prediction confidence lower upper\n", "0 0 40.909630 0.9991 0.000000 87.398161\n", "1 1 19.146822 0.9991 0.000000 65.635353\n", "2 2 22.482294 0.9991 0.000000 68.970825\n", "3 3 19.593765 0.9991 0.000000 66.082296\n", "4 4 31.724537 0.9991 0.000000 78.213068\n", ".. ... ... ... ... ...\n", "201 201 50.553104 0.9991 4.064574 97.041635\n", "202 202 48.580425 0.9991 2.091895 95.068956\n", "203 203 30.114187 0.9991 0.000000 76.602718\n", "204 204 25.676003 0.9991 0.000000 72.164533\n", "205 205 41.231636 0.9991 0.000000 87.720167\n", "\n", "[206 rows x 5 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Train and get predictions for the held out test set\n", "predictions = predictor.predict(test_df)\n", "predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Updating the predictor\n", "\n", "For this, we have two options:\n", "\n", "### `BaseMixer.partial_fit()`\n", "\n", "Updates a single mixer. You need to pass the new data wrapped in `EncodedDs` objects.\n", "\n", "**Arguments:** \n", "* `train_data: EncodedDs`\n", "* `dev_data: EncodedDs`\n", "* `adjust_args: Optional[dict]` - This will contain any arguments needed by the mixer to adjust new data.\n", "\n", "If the mixer does not need a `dev_data` partition, pass a dummy:\n", "\n", "```\n", "dev_data = EncodedDs(predictor.encoders, pd.DataFrame(), predictor.target)\n", "```\n", "\n", "### `PredictorInterface.adjust()`\n", "\n", "Updates all mixers inside the predictor by calling their respective `partial_fit()` methods. Any `adjust_args` will be transparently passed as well.\n", "\n", "**Arguments:**\n", "\n", "* `new_data: pd.DataFrame`\n", "* `old_data: Optional[pd.DataFrame]`\n", "* `adjust_args: Optional[dict]`\n", "\n", "Let's `adjust` our predictor:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-05-15T12:38:41.162902Z", "iopub.status.busy": "2024-05-15T12:38:41.162689Z", "iopub.status.idle": "2024-05-15T12:38:41.276226Z", "shell.execute_reply": "2024-05-15T12:38:41.275636Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Cleaning the data\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `preprocess` runtime: 0.02 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Cleaning the data\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `preprocess` runtime: 0.01 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Updating the mixers\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/torch/amp/grad_scaler.py:131: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:Loss @ epoch 1: 0.10915952424208324\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `adjust` runtime: 0.11 seconds\u001b[0m\n" ] } ], "source": [ "predictor.adjust(update_df, train_df) # data to adjust and original data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-05-15T12:38:41.279047Z", "iopub.status.busy": "2024-05-15T12:38:41.278590Z", "iopub.status.idle": "2024-05-15T12:38:41.418298Z", "shell.execute_reply": "2024-05-15T12:38:41.417642Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Predict phase 1/4] - Data preprocessing\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Cleaning the data\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `preprocess` runtime: 0.01 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Predict phase 2/4] - Feature generation\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:Featurizing the data\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `featurize` runtime: 0.03 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Predict phase 3/4] - Calling ensemble\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `_timed_call` runtime: 0.03 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:dataprep_ml-2649:[Predict phase 4/4] - Analyzing output\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block ICP is now running its explain() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block ConfStats is now running its explain() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block AccStats is now running its explain() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:AccStats.explain() has not been implemented, no modifications will be done to the data insights.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:The block PermutationFeatureImportance is now running its explain() method\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32mINFO:lightwood-2649:PermutationFeatureImportance.explain() has not been implemented, no modifications will be done to the data insights.\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `explain` runtime: 0.05 seconds\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[37mDEBUG:lightwood-2649: `predict` runtime: 0.13 seconds\u001b[0m\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
original_indexpredictionconfidencelowerupper
0043.6455420.99910.00000090.134073
1126.9649030.99910.00000073.453434
2224.1519180.99910.00000070.640449
3320.8158000.99910.00000067.304330
4434.9875300.99910.00000081.476060
..................
20120152.6300580.99916.14152899.118589
20220239.1752280.99910.00000085.663759
20320333.0474400.99910.00000079.535970
20420428.6591380.99910.00000075.147668
20520534.2645800.99910.00000080.753111
\n", "

206 rows × 5 columns

\n", "
" ], "text/plain": [ " original_index prediction confidence lower upper\n", "0 0 43.645542 0.9991 0.000000 90.134073\n", "1 1 26.964903 0.9991 0.000000 73.453434\n", "2 2 24.151918 0.9991 0.000000 70.640449\n", "3 3 20.815800 0.9991 0.000000 67.304330\n", "4 4 34.987530 0.9991 0.000000 81.476060\n", ".. ... ... ... ... ...\n", "201 201 52.630058 0.9991 6.141528 99.118589\n", "202 202 39.175228 0.9991 0.000000 85.663759\n", "203 203 33.047440 0.9991 0.000000 79.535970\n", "204 204 28.659138 0.9991 0.000000 75.147668\n", "205 205 34.264580 0.9991 0.000000 80.753111\n", "\n", "[206 rows x 5 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_predictions = predictor.predict(test_df)\n", "new_predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nice! Our predictor was updated, and new predictions are looking good. Let's compare the old and new accuracies to complete the experiment:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-05-15T12:38:41.421175Z", "iopub.status.busy": "2024-05-15T12:38:41.420697Z", "iopub.status.idle": "2024-05-15T12:38:41.426502Z", "shell.execute_reply": "2024-05-15T12:38:41.425844Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Old Accuracy: 0.233\n", "New Accuracy: 0.428\n" ] } ], "source": [ "from sklearn.metrics import r2_score\n", "import numpy as np\n", "\n", "old_acc = r2_score(test_df['concrete_strength'], predictions['prediction'])\n", "new_acc = r2_score(test_df['concrete_strength'], new_predictions['prediction'])\n", "\n", "print(f'Old Accuracy: {round(old_acc, 3)}\\nNew Accuracy: {round(new_acc, 3)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "We have gone through a simple example of how Lightwood predictors can leverage newly acquired data to improve their predictions. The interface for doing so is fairly simple, requiring only some new data and a single call to update.\n", "\n", "You can further customize the logic for updating your mixers by modifying the `partial_fit()` methods in them." ] } ], "metadata": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 4 }