Introduction
In this tutorial, we will go through an example to update a preexisting model. This might be useful when you come across additional data that you would want to consider, without having to train a model from scratch.
The main abstraction that Lightwood offers for this is the BaseMixer.partial_fit()
method. To call it, you need to pass new training data and a held-out dev subset for internal mixer usage (e.g. early stopping). If you are using an aggregate ensemble, it’s likely you will want to do this for every single mixer. The convienient PredictorInterface.adjust()
does this automatically for you.
Initial model training
First, let’s train a Lightwood predictor for the concrete strength
dataset:
[1]:
from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, predictor_from_json_ai
import pandas as pd
INFO:lightwood-2077:No torchvision detected, image helpers not supported.
INFO:lightwood-2077:No torchvision/pillow detected, image encoder not supported
[2]:
# Load data
df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/lightwood/main/tests/data/concrete_strength.csv')
df = df.sample(frac=1, random_state=1)
train_df = df[:int(0.1*len(df))]
update_df = df[int(0.1*len(df)):int(0.8*len(df))]
test_df = df[int(0.8*len(df)):]
print(f'Train dataframe shape: {train_df.shape}')
print(f'Update dataframe shape: {update_df.shape}')
print(f'Test dataframe shape: {test_df.shape}')
Train dataframe shape: (103, 10)
Update dataframe shape: (721, 10)
Test dataframe shape: (206, 10)
Note that we have three different data splits.
We will use the training
split for the initial model training. As you can see, it’s only a 20% of the total data we have. The update
split will be used as training data to adjust/update our model. Finally, the held out test
set will give us a rough idea of the impact our updating procedure has on the model’s predictive capabilities.
[3]:
# Define predictive task and predictor
target = 'concrete_strength'
pdef = ProblemDefinition.from_dict({'target': target, 'time_aim': 200})
jai = json_ai_from_problem(df, pdef)
# We will keep the architecture simple: a single neural mixer, and a `BestOf` ensemble:
jai.model = {
"module": "BestOf",
"args": {
"args": "$pred_args",
"accuracy_functions": "$accuracy_functions",
"submodels": [{
"module": "Neural",
"args": {
"fit_on_dev": False,
"stop_after": "$problem_definition.seconds_per_mixer",
"search_hyperparameters": False,
}
}]
}
}
# Build and train the predictor
predictor = predictor_from_json_ai(jai)
predictor.learn(train_df)
INFO:type_infer-2077:Analyzing a sample of 979
INFO:type_infer-2077:from a total population of 1030, this is equivalent to 95.0% of your data.
INFO:type_infer-2077:Using 3 processes to deduct types.
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/multiprocessing/popen_fork.py:66: RuntimeWarning: Using fork() can cause Polars to deadlock in the child process.
In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.
The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.
See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.
If you really know what your doing, you can silence this warning with the warning module
or by setting POLARS_ALLOW_FORKING_THREAD=1.
self.pid = os.fork()
INFO:type_infer-2077:Infering type for: cement
INFO:type_infer-2077:Infering type for: slag
INFO:type_infer-2077:Column cement has data type float
INFO:type_infer-2077:Column slag has data type float
INFO:type_infer-2077:Infering type for: flyAsh
INFO:type_infer-2077:Infering type for: water
INFO:type_infer-2077:Column flyAsh has data type float
INFO:type_infer-2077:Column water has data type float
INFO:type_infer-2077:Infering type for: superPlasticizer
INFO:type_infer-2077:Infering type for: coarseAggregate
INFO:type_infer-2077:Column superPlasticizer has data type float
INFO:type_infer-2077:Column coarseAggregate has data type float
INFO:type_infer-2077:Infering type for: fineAggregate
INFO:type_infer-2077:Infering type for: age
INFO:type_infer-2077:Column fineAggregate has data type float
INFO:type_infer-2077:Column age has data type integer
INFO:type_infer-2077:Infering type for: concrete_strength
INFO:type_infer-2077:Column concrete_strength has data type float
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/multiprocessing/popen_fork.py:66: RuntimeWarning: Using fork() can cause Polars to deadlock in the child process.
In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.
The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.
See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.
If you really know what your doing, you can silence this warning with the warning module
or by setting POLARS_ALLOW_FORKING_THREAD=1.
self.pid = os.fork()
INFO:type_infer-2077:Infering type for: id
INFO:type_infer-2077:Column id has data type integer
INFO:dataprep_ml-2077:Starting statistical analysis
INFO:dataprep_ml-2077:Finished statistical analysis
INFO:dataprep_ml-2077:[Learn phase 1/8] - Statistical analysis
INFO:dataprep_ml-2077:Starting statistical analysis
INFO:dataprep_ml-2077:Finished statistical analysis
DEBUG:lightwood-2077: `analyze_data` runtime: 0.03 seconds
INFO:dataprep_ml-2077:[Learn phase 2/8] - Data preprocessing
INFO:dataprep_ml-2077:Cleaning the data
DEBUG:lightwood-2077: `preprocess` runtime: 0.01 seconds
INFO:dataprep_ml-2077:[Learn phase 3/8] - Data splitting
INFO:dataprep_ml-2077:Splitting the data into train/test
DEBUG:lightwood-2077: `split` runtime: 0.0 seconds
INFO:dataprep_ml-2077:[Learn phase 4/8] - Preparing encoders
DEBUG:dataprep_ml-2077:Preparing sequentially...
DEBUG:dataprep_ml-2077:Preparing encoder for id...
DEBUG:dataprep_ml-2077:Preparing encoder for cement...
DEBUG:dataprep_ml-2077:Preparing encoder for slag...
DEBUG:dataprep_ml-2077:Preparing encoder for flyAsh...
DEBUG:dataprep_ml-2077:Preparing encoder for water...
DEBUG:dataprep_ml-2077:Preparing encoder for superPlasticizer...
DEBUG:dataprep_ml-2077:Preparing encoder for coarseAggregate...
DEBUG:dataprep_ml-2077:Preparing encoder for fineAggregate...
DEBUG:dataprep_ml-2077:Preparing encoder for age...
DEBUG:lightwood-2077: `prepare` runtime: 0.01 seconds
INFO:dataprep_ml-2077:[Learn phase 5/8] - Feature generation
INFO:dataprep_ml-2077:Featurizing the data
DEBUG:lightwood-2077: `featurize` runtime: 0.06 seconds
INFO:dataprep_ml-2077:[Learn phase 6/8] - Mixer training
INFO:dataprep_ml-2077:Training the mixers
/home/runner/work/lightwood/lightwood/lightwood/mixer/neural.py:124: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
scaler = GradScaler()
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/amp/grad_scaler.py:132: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn(
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:
addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
addcmul_(Tensor tensor1, Tensor tensor2, *, Number value = 1) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1642.)
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
INFO:lightwood-2077:Loss of 39.99637508392334 with learning rate 0.0001
INFO:lightwood-2077:Loss of 21.826460361480713 with learning rate 0.0005
INFO:lightwood-2077:Loss of 15.12899512052536 with learning rate 0.001
INFO:lightwood-2077:Loss of 15.062753021717072 with learning rate 0.002
INFO:lightwood-2077:Loss of 26.490495562553406 with learning rate 0.003
INFO:lightwood-2077:Loss of 33.6572003364563 with learning rate 0.005
INFO:lightwood-2077:Loss of 303.60721158981323 with learning rate 0.01
INFO:lightwood-2077:Loss of nan with learning rate 0.05
INFO:lightwood-2077:Found learning rate of: 0.002
/home/runner/work/lightwood/lightwood/lightwood/mixer/neural.py:305: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
scaler = GradScaler()
INFO:lightwood-2077:Loss @ epoch 1: 0.11838734149932861
INFO:lightwood-2077:Loss @ epoch 2: 0.4641949534416199
INFO:lightwood-2077:Loss @ epoch 3: 0.3976145386695862
INFO:lightwood-2077:Loss @ epoch 4: 0.3706841468811035
INFO:lightwood-2077:Loss @ epoch 5: 0.2367912232875824
INFO:lightwood-2077:Loss @ epoch 6: 0.22560915350914001
INFO:lightwood-2077:Loss @ epoch 7: 0.12089195847511292
DEBUG:lightwood-2077: `fit_mixer` runtime: 0.16 seconds
INFO:dataprep_ml-2077:Ensembling the mixer
INFO:lightwood-2077:Mixer: Neural got accuracy: 0.238
INFO:lightwood-2077:Picked best mixer: Neural
DEBUG:lightwood-2077: `fit` runtime: 0.17 seconds
INFO:dataprep_ml-2077:[Learn phase 7/8] - Ensemble analysis
INFO:dataprep_ml-2077:Analyzing the ensemble of mixers
INFO:lightwood-2077:The block ICP is now running its analyze() method
INFO:lightwood-2077:The block ConfStats is now running its analyze() method
INFO:lightwood-2077:The block AccStats is now running its analyze() method
INFO:lightwood-2077:The block PermutationFeatureImportance is now running its analyze() method
INFO:lightwood-2077:[PFI] Using a random sample (1000 rows out of 10).
INFO:lightwood-2077:[PFI] Set to consider first 10 columns out of 9: ['id', 'cement', 'slag', 'flyAsh', 'water', 'superPlasticizer', 'coarseAggregate', 'fineAggregate', 'age'].
DEBUG:lightwood-2077: `analyze_ensemble` runtime: 0.15 seconds
INFO:dataprep_ml-2077:[Learn phase 8/8] - Adjustment on validation requested
INFO:dataprep_ml-2077:Updating the mixers
/home/runner/work/lightwood/lightwood/lightwood/mixer/neural.py:335: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
scaler = GradScaler()
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/amp/grad_scaler.py:132: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn(
INFO:lightwood-2077:Loss @ epoch 1: 0.1678172747294108
DEBUG:lightwood-2077: `adjust` runtime: 0.03 seconds
DEBUG:lightwood-2077: `learn` runtime: 0.45 seconds
[4]:
# Train and get predictions for the held out test set
predictions = predictor.predict(test_df)
predictions
INFO:dataprep_ml-2077:[Predict phase 1/4] - Data preprocessing
INFO:dataprep_ml-2077:Cleaning the data
DEBUG:lightwood-2077: `preprocess` runtime: 0.01 seconds
INFO:dataprep_ml-2077:[Predict phase 2/4] - Feature generation
INFO:dataprep_ml-2077:Featurizing the data
DEBUG:lightwood-2077: `featurize` runtime: 0.03 seconds
INFO:dataprep_ml-2077:[Predict phase 3/4] - Calling ensemble
DEBUG:lightwood-2077: `_timed_call` runtime: 0.03 seconds
INFO:dataprep_ml-2077:[Predict phase 4/4] - Analyzing output
INFO:lightwood-2077:The block ICP is now running its explain() method
INFO:lightwood-2077:The block ConfStats is now running its explain() method
INFO:lightwood-2077:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.
INFO:lightwood-2077:The block AccStats is now running its explain() method
INFO:lightwood-2077:AccStats.explain() has not been implemented, no modifications will be done to the data insights.
INFO:lightwood-2077:The block PermutationFeatureImportance is now running its explain() method
INFO:lightwood-2077:PermutationFeatureImportance.explain() has not been implemented, no modifications will be done to the data insights.
DEBUG:lightwood-2077: `explain` runtime: 0.05 seconds
DEBUG:lightwood-2077: `predict` runtime: 0.13 seconds
[4]:
original_index | prediction | confidence | lower | upper | |
---|---|---|---|---|---|
0 | 0 | 40.909630 | 0.9991 | 0.000000 | 87.398161 |
1 | 1 | 19.146822 | 0.9991 | 0.000000 | 65.635353 |
2 | 2 | 22.482294 | 0.9991 | 0.000000 | 68.970825 |
3 | 3 | 19.593765 | 0.9991 | 0.000000 | 66.082296 |
4 | 4 | 31.724537 | 0.9991 | 0.000000 | 78.213068 |
... | ... | ... | ... | ... | ... |
201 | 201 | 50.553104 | 0.9991 | 4.064574 | 97.041635 |
202 | 202 | 48.580425 | 0.9991 | 2.091895 | 95.068956 |
203 | 203 | 30.114187 | 0.9991 | 0.000000 | 76.602718 |
204 | 204 | 25.676003 | 0.9991 | 0.000000 | 72.164533 |
205 | 205 | 41.231636 | 0.9991 | 0.000000 | 87.720167 |
206 rows × 5 columns
Updating the predictor
For this, we have two options:
BaseMixer.partial_fit()
Updates a single mixer. You need to pass the new data wrapped in EncodedDs
objects.
Arguments: * train_data: EncodedDs
* dev_data: EncodedDs
* adjust_args: Optional[dict]
- This will contain any arguments needed by the mixer to adjust new data.
If the mixer does not need a dev_data
partition, pass a dummy:
dev_data = EncodedDs(predictor.encoders, pd.DataFrame(), predictor.target)
PredictorInterface.adjust()
Updates all mixers inside the predictor by calling their respective partial_fit()
methods. Any adjust_args
will be transparently passed as well.
Arguments:
new_data: pd.DataFrame
old_data: Optional[pd.DataFrame]
adjust_args: Optional[dict]
Let’s adjust
our predictor:
[5]:
predictor.adjust(update_df, train_df) # data to adjust and original data
INFO:dataprep_ml-2077:Cleaning the data
DEBUG:lightwood-2077: `preprocess` runtime: 0.02 seconds
INFO:dataprep_ml-2077:Cleaning the data
DEBUG:lightwood-2077: `preprocess` runtime: 0.01 seconds
INFO:dataprep_ml-2077:Updating the mixers
/home/runner/work/lightwood/lightwood/lightwood/mixer/neural.py:335: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
scaler = GradScaler()
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/amp/grad_scaler.py:132: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn(
INFO:lightwood-2077:Loss @ epoch 1: 0.10915952424208324
DEBUG:lightwood-2077: `adjust` runtime: 0.11 seconds
[6]:
new_predictions = predictor.predict(test_df)
new_predictions
INFO:dataprep_ml-2077:[Predict phase 1/4] - Data preprocessing
INFO:dataprep_ml-2077:Cleaning the data
DEBUG:lightwood-2077: `preprocess` runtime: 0.01 seconds
INFO:dataprep_ml-2077:[Predict phase 2/4] - Feature generation
INFO:dataprep_ml-2077:Featurizing the data
DEBUG:lightwood-2077: `featurize` runtime: 0.03 seconds
INFO:dataprep_ml-2077:[Predict phase 3/4] - Calling ensemble
DEBUG:lightwood-2077: `_timed_call` runtime: 0.03 seconds
INFO:dataprep_ml-2077:[Predict phase 4/4] - Analyzing output
INFO:lightwood-2077:The block ICP is now running its explain() method
INFO:lightwood-2077:The block ConfStats is now running its explain() method
INFO:lightwood-2077:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.
INFO:lightwood-2077:The block AccStats is now running its explain() method
INFO:lightwood-2077:AccStats.explain() has not been implemented, no modifications will be done to the data insights.
INFO:lightwood-2077:The block PermutationFeatureImportance is now running its explain() method
INFO:lightwood-2077:PermutationFeatureImportance.explain() has not been implemented, no modifications will be done to the data insights.
DEBUG:lightwood-2077: `explain` runtime: 0.05 seconds
DEBUG:lightwood-2077: `predict` runtime: 0.13 seconds
[6]:
original_index | prediction | confidence | lower | upper | |
---|---|---|---|---|---|
0 | 0 | 43.645542 | 0.9991 | 0.000000 | 90.134073 |
1 | 1 | 26.964903 | 0.9991 | 0.000000 | 73.453434 |
2 | 2 | 24.151918 | 0.9991 | 0.000000 | 70.640449 |
3 | 3 | 20.815800 | 0.9991 | 0.000000 | 67.304330 |
4 | 4 | 34.987530 | 0.9991 | 0.000000 | 81.476060 |
... | ... | ... | ... | ... | ... |
201 | 201 | 52.630058 | 0.9991 | 6.141528 | 99.118589 |
202 | 202 | 39.175228 | 0.9991 | 0.000000 | 85.663759 |
203 | 203 | 33.047440 | 0.9991 | 0.000000 | 79.535970 |
204 | 204 | 28.659138 | 0.9991 | 0.000000 | 75.147668 |
205 | 205 | 34.264580 | 0.9991 | 0.000000 | 80.753111 |
206 rows × 5 columns
Nice! Our predictor was updated, and new predictions are looking good. Let’s compare the old and new accuracies to complete the experiment:
[7]:
from sklearn.metrics import r2_score
import numpy as np
old_acc = r2_score(test_df['concrete_strength'], predictions['prediction'])
new_acc = r2_score(test_df['concrete_strength'], new_predictions['prediction'])
print(f'Old Accuracy: {round(old_acc, 3)}\nNew Accuracy: {round(new_acc, 3)}')
Old Accuracy: 0.233
New Accuracy: 0.428
Conclusion
We have gone through a simple example of how Lightwood predictors can leverage newly acquired data to improve their predictions. The interface for doing so is fairly simple, requiring only some new data and a single call to update.
You can further customize the logic for updating your mixers by modifying the partial_fit()
methods in them.