Introduction

In this tutorial, we will go through an example to update a preexisting model. This might be useful when you come across additional data that you would want to consider, without having to train a model from scratch.

The main abstraction that Lightwood offers for this is the BaseMixer.partial_fit() method. To call it, you need to pass new training data and a held-out dev subset for internal mixer usage (e.g. early stopping). If you are using an aggregate ensemble, it’s likely you will want to do this for every single mixer. The convienient PredictorInterface.adjust() does this automatically for you.

Initial model training

First, let’s train a Lightwood predictor for the concrete strength dataset:

[1]:
from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, predictor_from_json_ai
import pandas as pd
INFO:lightwood-2077:No torchvision detected, image helpers not supported.
INFO:lightwood-2077:No torchvision/pillow detected, image encoder not supported
[2]:
# Load data
df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/lightwood/main/tests/data/concrete_strength.csv')

df = df.sample(frac=1, random_state=1)
train_df = df[:int(0.1*len(df))]
update_df = df[int(0.1*len(df)):int(0.8*len(df))]
test_df = df[int(0.8*len(df)):]

print(f'Train dataframe shape: {train_df.shape}')
print(f'Update dataframe shape: {update_df.shape}')
print(f'Test dataframe shape: {test_df.shape}')
Train dataframe shape: (103, 10)
Update dataframe shape: (721, 10)
Test dataframe shape: (206, 10)

Note that we have three different data splits.

We will use the training split for the initial model training. As you can see, it’s only a 20% of the total data we have. The update split will be used as training data to adjust/update our model. Finally, the held out test set will give us a rough idea of the impact our updating procedure has on the model’s predictive capabilities.

[3]:
# Define predictive task and predictor
target = 'concrete_strength'
pdef = ProblemDefinition.from_dict({'target': target, 'time_aim': 200})
jai = json_ai_from_problem(df, pdef)

# We will keep the architecture simple: a single neural mixer, and a `BestOf` ensemble:
jai.model = {
    "module": "BestOf",
    "args": {
        "args": "$pred_args",
        "accuracy_functions": "$accuracy_functions",
        "submodels": [{
            "module": "Neural",
            "args": {
                "fit_on_dev": False,
                "stop_after": "$problem_definition.seconds_per_mixer",
                "search_hyperparameters": False,
            }
        }]
    }
}

# Build and train the predictor
predictor = predictor_from_json_ai(jai)
predictor.learn(train_df)
INFO:type_infer-2077:Analyzing a sample of 979
INFO:type_infer-2077:from a total population of 1030, this is equivalent to 95.0% of your data.
INFO:type_infer-2077:Using 3 processes to deduct types.
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/multiprocessing/popen_fork.py:66: RuntimeWarning: Using fork() can cause Polars to deadlock in the child process.
In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

If you really know what your doing, you can silence this warning with the warning module
or by setting POLARS_ALLOW_FORKING_THREAD=1.

  self.pid = os.fork()
INFO:type_infer-2077:Infering type for: cement
INFO:type_infer-2077:Infering type for: slag
INFO:type_infer-2077:Column cement has data type float
INFO:type_infer-2077:Column slag has data type float
INFO:type_infer-2077:Infering type for: flyAsh
INFO:type_infer-2077:Infering type for: water
INFO:type_infer-2077:Column flyAsh has data type float
INFO:type_infer-2077:Column water has data type float
INFO:type_infer-2077:Infering type for: superPlasticizer
INFO:type_infer-2077:Infering type for: coarseAggregate
INFO:type_infer-2077:Column superPlasticizer has data type float
INFO:type_infer-2077:Column coarseAggregate has data type float
INFO:type_infer-2077:Infering type for: fineAggregate
INFO:type_infer-2077:Infering type for: age
INFO:type_infer-2077:Column fineAggregate has data type float
INFO:type_infer-2077:Column age has data type integer
INFO:type_infer-2077:Infering type for: concrete_strength
INFO:type_infer-2077:Column concrete_strength has data type float
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/multiprocessing/popen_fork.py:66: RuntimeWarning: Using fork() can cause Polars to deadlock in the child process.
In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

If you really know what your doing, you can silence this warning with the warning module
or by setting POLARS_ALLOW_FORKING_THREAD=1.

  self.pid = os.fork()
INFO:type_infer-2077:Infering type for: id
INFO:type_infer-2077:Column id has data type integer
INFO:dataprep_ml-2077:Starting statistical analysis
INFO:dataprep_ml-2077:Finished statistical analysis
INFO:dataprep_ml-2077:[Learn phase 1/8] - Statistical analysis
INFO:dataprep_ml-2077:Starting statistical analysis
INFO:dataprep_ml-2077:Finished statistical analysis
DEBUG:lightwood-2077: `analyze_data` runtime: 0.03 seconds
INFO:dataprep_ml-2077:[Learn phase 2/8] - Data preprocessing
INFO:dataprep_ml-2077:Cleaning the data
DEBUG:lightwood-2077: `preprocess` runtime: 0.01 seconds
INFO:dataprep_ml-2077:[Learn phase 3/8] - Data splitting
INFO:dataprep_ml-2077:Splitting the data into train/test
DEBUG:lightwood-2077: `split` runtime: 0.0 seconds
INFO:dataprep_ml-2077:[Learn phase 4/8] - Preparing encoders
DEBUG:dataprep_ml-2077:Preparing sequentially...
DEBUG:dataprep_ml-2077:Preparing encoder for id...
DEBUG:dataprep_ml-2077:Preparing encoder for cement...
DEBUG:dataprep_ml-2077:Preparing encoder for slag...
DEBUG:dataprep_ml-2077:Preparing encoder for flyAsh...
DEBUG:dataprep_ml-2077:Preparing encoder for water...
DEBUG:dataprep_ml-2077:Preparing encoder for superPlasticizer...
DEBUG:dataprep_ml-2077:Preparing encoder for coarseAggregate...
DEBUG:dataprep_ml-2077:Preparing encoder for fineAggregate...
DEBUG:dataprep_ml-2077:Preparing encoder for age...
DEBUG:lightwood-2077: `prepare` runtime: 0.01 seconds
INFO:dataprep_ml-2077:[Learn phase 5/8] - Feature generation
INFO:dataprep_ml-2077:Featurizing the data
DEBUG:lightwood-2077: `featurize` runtime: 0.06 seconds
INFO:dataprep_ml-2077:[Learn phase 6/8] - Mixer training
INFO:dataprep_ml-2077:Training the mixers
/home/runner/work/lightwood/lightwood/lightwood/mixer/neural.py:124: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = GradScaler()
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/amp/grad_scaler.py:132: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value = 1) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1642.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
INFO:lightwood-2077:Loss of 39.99637508392334 with learning rate 0.0001
INFO:lightwood-2077:Loss of 21.826460361480713 with learning rate 0.0005
INFO:lightwood-2077:Loss of 15.12899512052536 with learning rate 0.001
INFO:lightwood-2077:Loss of 15.062753021717072 with learning rate 0.002
INFO:lightwood-2077:Loss of 26.490495562553406 with learning rate 0.003
INFO:lightwood-2077:Loss of 33.6572003364563 with learning rate 0.005
INFO:lightwood-2077:Loss of 303.60721158981323 with learning rate 0.01
INFO:lightwood-2077:Loss of nan with learning rate 0.05
INFO:lightwood-2077:Found learning rate of: 0.002
/home/runner/work/lightwood/lightwood/lightwood/mixer/neural.py:305: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = GradScaler()
INFO:lightwood-2077:Loss @ epoch 1: 0.11838734149932861
INFO:lightwood-2077:Loss @ epoch 2: 0.4641949534416199
INFO:lightwood-2077:Loss @ epoch 3: 0.3976145386695862
INFO:lightwood-2077:Loss @ epoch 4: 0.3706841468811035
INFO:lightwood-2077:Loss @ epoch 5: 0.2367912232875824
INFO:lightwood-2077:Loss @ epoch 6: 0.22560915350914001
INFO:lightwood-2077:Loss @ epoch 7: 0.12089195847511292
DEBUG:lightwood-2077: `fit_mixer` runtime: 0.16 seconds
INFO:dataprep_ml-2077:Ensembling the mixer
INFO:lightwood-2077:Mixer: Neural got accuracy: 0.238
INFO:lightwood-2077:Picked best mixer: Neural
DEBUG:lightwood-2077: `fit` runtime: 0.17 seconds
INFO:dataprep_ml-2077:[Learn phase 7/8] - Ensemble analysis
INFO:dataprep_ml-2077:Analyzing the ensemble of mixers
INFO:lightwood-2077:The block ICP is now running its analyze() method
INFO:lightwood-2077:The block ConfStats is now running its analyze() method
INFO:lightwood-2077:The block AccStats is now running its analyze() method
INFO:lightwood-2077:The block PermutationFeatureImportance is now running its analyze() method
INFO:lightwood-2077:[PFI] Using a random sample (1000 rows out of 10).
INFO:lightwood-2077:[PFI] Set to consider first 10 columns out of 9: ['id', 'cement', 'slag', 'flyAsh', 'water', 'superPlasticizer', 'coarseAggregate', 'fineAggregate', 'age'].
DEBUG:lightwood-2077: `analyze_ensemble` runtime: 0.15 seconds
INFO:dataprep_ml-2077:[Learn phase 8/8] - Adjustment on validation requested
INFO:dataprep_ml-2077:Updating the mixers
/home/runner/work/lightwood/lightwood/lightwood/mixer/neural.py:335: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = GradScaler()
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/amp/grad_scaler.py:132: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
INFO:lightwood-2077:Loss @ epoch 1: 0.1678172747294108
DEBUG:lightwood-2077: `adjust` runtime: 0.03 seconds
DEBUG:lightwood-2077: `learn` runtime: 0.45 seconds
[4]:
# Train and get predictions for the held out test set
predictions = predictor.predict(test_df)
predictions
INFO:dataprep_ml-2077:[Predict phase 1/4] - Data preprocessing
INFO:dataprep_ml-2077:Cleaning the data
DEBUG:lightwood-2077: `preprocess` runtime: 0.01 seconds
INFO:dataprep_ml-2077:[Predict phase 2/4] - Feature generation
INFO:dataprep_ml-2077:Featurizing the data
DEBUG:lightwood-2077: `featurize` runtime: 0.03 seconds
INFO:dataprep_ml-2077:[Predict phase 3/4] - Calling ensemble
DEBUG:lightwood-2077: `_timed_call` runtime: 0.03 seconds
INFO:dataprep_ml-2077:[Predict phase 4/4] - Analyzing output
INFO:lightwood-2077:The block ICP is now running its explain() method
INFO:lightwood-2077:The block ConfStats is now running its explain() method
INFO:lightwood-2077:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.
INFO:lightwood-2077:The block AccStats is now running its explain() method
INFO:lightwood-2077:AccStats.explain() has not been implemented, no modifications will be done to the data insights.
INFO:lightwood-2077:The block PermutationFeatureImportance is now running its explain() method
INFO:lightwood-2077:PermutationFeatureImportance.explain() has not been implemented, no modifications will be done to the data insights.
DEBUG:lightwood-2077: `explain` runtime: 0.05 seconds
DEBUG:lightwood-2077: `predict` runtime: 0.13 seconds
[4]:
original_index prediction confidence lower upper
0 0 40.909630 0.9991 0.000000 87.398161
1 1 19.146822 0.9991 0.000000 65.635353
2 2 22.482294 0.9991 0.000000 68.970825
3 3 19.593765 0.9991 0.000000 66.082296
4 4 31.724537 0.9991 0.000000 78.213068
... ... ... ... ... ...
201 201 50.553104 0.9991 4.064574 97.041635
202 202 48.580425 0.9991 2.091895 95.068956
203 203 30.114187 0.9991 0.000000 76.602718
204 204 25.676003 0.9991 0.000000 72.164533
205 205 41.231636 0.9991 0.000000 87.720167

206 rows × 5 columns

Updating the predictor

For this, we have two options:

BaseMixer.partial_fit()

Updates a single mixer. You need to pass the new data wrapped in EncodedDs objects.

Arguments: * train_data: EncodedDs * dev_data: EncodedDs * adjust_args: Optional[dict] - This will contain any arguments needed by the mixer to adjust new data.

If the mixer does not need a dev_data partition, pass a dummy:

dev_data = EncodedDs(predictor.encoders, pd.DataFrame(), predictor.target)

PredictorInterface.adjust()

Updates all mixers inside the predictor by calling their respective partial_fit() methods. Any adjust_args will be transparently passed as well.

Arguments:

  • new_data: pd.DataFrame

  • old_data: Optional[pd.DataFrame]

  • adjust_args: Optional[dict]

Let’s adjust our predictor:

[5]:
predictor.adjust(update_df, train_df)  # data to adjust and original data
INFO:dataprep_ml-2077:Cleaning the data
DEBUG:lightwood-2077: `preprocess` runtime: 0.02 seconds
INFO:dataprep_ml-2077:Cleaning the data
DEBUG:lightwood-2077: `preprocess` runtime: 0.01 seconds
INFO:dataprep_ml-2077:Updating the mixers
/home/runner/work/lightwood/lightwood/lightwood/mixer/neural.py:335: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = GradScaler()
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/amp/grad_scaler.py:132: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
INFO:lightwood-2077:Loss @ epoch 1: 0.10915952424208324
DEBUG:lightwood-2077: `adjust` runtime: 0.11 seconds
[6]:
new_predictions = predictor.predict(test_df)
new_predictions
INFO:dataprep_ml-2077:[Predict phase 1/4] - Data preprocessing
INFO:dataprep_ml-2077:Cleaning the data
DEBUG:lightwood-2077: `preprocess` runtime: 0.01 seconds
INFO:dataprep_ml-2077:[Predict phase 2/4] - Feature generation
INFO:dataprep_ml-2077:Featurizing the data
DEBUG:lightwood-2077: `featurize` runtime: 0.03 seconds
INFO:dataprep_ml-2077:[Predict phase 3/4] - Calling ensemble
DEBUG:lightwood-2077: `_timed_call` runtime: 0.03 seconds
INFO:dataprep_ml-2077:[Predict phase 4/4] - Analyzing output
INFO:lightwood-2077:The block ICP is now running its explain() method
INFO:lightwood-2077:The block ConfStats is now running its explain() method
INFO:lightwood-2077:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.
INFO:lightwood-2077:The block AccStats is now running its explain() method
INFO:lightwood-2077:AccStats.explain() has not been implemented, no modifications will be done to the data insights.
INFO:lightwood-2077:The block PermutationFeatureImportance is now running its explain() method
INFO:lightwood-2077:PermutationFeatureImportance.explain() has not been implemented, no modifications will be done to the data insights.
DEBUG:lightwood-2077: `explain` runtime: 0.05 seconds
DEBUG:lightwood-2077: `predict` runtime: 0.13 seconds
[6]:
original_index prediction confidence lower upper
0 0 43.645542 0.9991 0.000000 90.134073
1 1 26.964903 0.9991 0.000000 73.453434
2 2 24.151918 0.9991 0.000000 70.640449
3 3 20.815800 0.9991 0.000000 67.304330
4 4 34.987530 0.9991 0.000000 81.476060
... ... ... ... ... ...
201 201 52.630058 0.9991 6.141528 99.118589
202 202 39.175228 0.9991 0.000000 85.663759
203 203 33.047440 0.9991 0.000000 79.535970
204 204 28.659138 0.9991 0.000000 75.147668
205 205 34.264580 0.9991 0.000000 80.753111

206 rows × 5 columns

Nice! Our predictor was updated, and new predictions are looking good. Let’s compare the old and new accuracies to complete the experiment:

[7]:
from sklearn.metrics import r2_score
import numpy as np

old_acc = r2_score(test_df['concrete_strength'], predictions['prediction'])
new_acc = r2_score(test_df['concrete_strength'], new_predictions['prediction'])

print(f'Old Accuracy: {round(old_acc, 3)}\nNew Accuracy: {round(new_acc, 3)}')
Old Accuracy: 0.233
New Accuracy: 0.428

Conclusion

We have gone through a simple example of how Lightwood predictors can leverage newly acquired data to improve their predictions. The interface for doing so is fairly simple, requiring only some new data and a single call to update.

You can further customize the logic for updating your mixers by modifying the partial_fit() methods in them.