Tutorial - Time series forecasting

Introduction

Time series are an ubiquitous type of data in all types of processes. Producing forecasts for them can be highly valuable in domains like retail or industrial manufacture, among many others.

Lightwood supports time series forecasting (both univariate and multivariate inputs), handling many of the pain points commonly associated with setting up a manual time series predictive pipeline.

In this tutorial, we will train a lightwood predictor and analyze its forecasts for the task of counting sunspots in monthly intervals.

Load data

Let’s begin by loading the dataset and looking at it:

[1]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/monthly_sunspots/data.csv")
df
[1]:
Month Sunspots
0 1749-01 58.0
1 1749-02 62.6
2 1749-03 70.0
3 1749-04 55.7
4 1749-05 85.0
... ... ...
2815 1983-08 71.8
2816 1983-09 50.3
2817 1983-10 55.8
2818 1983-11 33.3
2819 1983-12 33.4

2820 rows × 2 columns

This is a very simple dataset. It’s got a single column that specifies the month in which the measurement was done, and then in the ‘Sunspots’ column we have the actual quantity we are interested in forecasting. As such, we can characterize this as a univariate time series problem.

Define the predictive task

We will use Lightwood high level methods to state what we want to predict. As this is a time series task (because we want to leverage the notion of time to predict), we need to specify a set of arguments that will activate Lightwood’s time series pipeline:

[2]:
from lightwood.api.high_level import ProblemDefinition
INFO:lightwood-2669:No torchvision detected, image helpers not supported.
INFO:lightwood-2669:No torchvision/pillow detected, image encoder not supported
[3]:
tss = {'horizon': 6,   # the predictor will learn to forecast what the next semester counts will look like (6 data points at monthly intervals -> 6 months)
       'order_by': 'Month', # what column is used to order the entire datset
       'window': 12           # how many past values to consider for emitting predictions
      }

pdef = ProblemDefinition.from_dict({'target': 'Sunspots',         # specify the column to forecast
                                    'timeseries_settings': tss    # pass along all time series specific parameters
                                   })

Now, let’s do a very simple train-test split, leaving 10% of the data to check the forecasts that our predictor will produce:

[4]:
cutoff = int(len(df)*0.9)

train = df[:cutoff]
test = df[cutoff:]

print(train.shape, test.shape)
(2538, 2) (282, 2)

Generate the predictor object

Now, we can generate code for a machine learning model by using our problem definition and the data:

[5]:
from lightwood.api.high_level import (
    json_ai_from_problem,
    code_from_json_ai,
    predictor_from_code
)

json_ai = json_ai_from_problem(df, problem_definition=pdef)
code = code_from_json_ai(json_ai)
predictor = predictor_from_code(code)

# uncomment this to see the generated code:
# print(code)
INFO:type_infer-2669:Analyzing a sample of 2467
INFO:type_infer-2669:from a total population of 2820, this is equivalent to 87.5% of your data.
INFO:type_infer-2669:Infering type for: Month
INFO:type_infer-2669:Column Month has data type date
INFO:type_infer-2669:Infering type for: Sunspots
INFO:type_infer-2669:Column Sunspots has data type float
INFO:dataprep_ml-2669:Starting statistical analysis
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/dataprep_ml/cleaners.py:163: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  result = pd.to_datetime(element,
INFO:dataprep_ml-2669:Finished statistical analysis

Train

Okay, everything is ready now for our predictor to learn based on the training data we will provide.

Internally, lightwood cleans and reshapes the data, featurizes measurements and timestamps, and comes up with a handful of different models that will be evaluated to keep the one that produces the best forecasts.

Let’s train the predictor. This should take a couple of minutes, at most:

[6]:
predictor.learn(train)
INFO:dataprep_ml-2669:[Learn phase 1/8] - Statistical analysis
INFO:dataprep_ml-2669:Starting statistical analysis
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/dataprep_ml/cleaners.py:163: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  result = pd.to_datetime(element,
INFO:dataprep_ml-2669:Finished statistical analysis
DEBUG:lightwood-2669: `analyze_data` runtime: 0.05 seconds
INFO:dataprep_ml-2669:[Learn phase 2/8] - Data preprocessing
INFO:dataprep_ml-2669:Cleaning the data
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/dataprep_ml/cleaners.py:163: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  result = pd.to_datetime(element,
INFO:dataprep_ml-2669:Transforming timeseries data
DEBUG:lightwood-2669: `preprocess` runtime: 0.09 seconds
INFO:dataprep_ml-2669:[Learn phase 3/8] - Data splitting
INFO:dataprep_ml-2669:Splitting the data into train/test
DEBUG:lightwood-2669: `split` runtime: 0.0 seconds
INFO:dataprep_ml-2669:[Learn phase 4/8] - Preparing encoders
DEBUG:dataprep_ml-2669:Preparing sequentially...
DEBUG:lightwood-2669: `prepare` runtime: 0.05 seconds
INFO:dataprep_ml-2669:[Learn phase 5/8] - Feature generation
INFO:dataprep_ml-2669:Featurizing the data
DEBUG:lightwood-2669: `featurize` runtime: 0.05 seconds
INFO:dataprep_ml-2669:[Learn phase 6/8] - Mixer training
INFO:dataprep_ml-2669:Training the mixers
WARNING:lightwood-2669:XGBoost running on CPU
WARNING:lightwood-2669:XGBoost running on CPU
WARNING:lightwood-2669:XGBoost running on CPU
WARNING:lightwood-2669:XGBoost running on CPU
WARNING:lightwood-2669:XGBoost running on CPU
WARNING:lightwood-2669:XGBoost running on CPU
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
[10:18:50] WARNING: ../src/learner.cc:339: No visible GPU is found, setting `gpu_id` to -1
[10:18:50] WARNING: ../src/learner.cc:339: No visible GPU is found, setting `gpu_id` to -1
[10:18:50] WARNING: ../src/learner.cc:339: No visible GPU is found, setting `gpu_id` to -1
[10:18:50] WARNING: ../src/learner.cc:339: No visible GPU is found, setting `gpu_id` to -1
[10:18:50] WARNING: ../src/learner.cc:339: No visible GPU is found, setting `gpu_id` to -1
[10:18:50] WARNING: ../src/learner.cc:339: No visible GPU is found, setting `gpu_id` to -1
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1630.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
INFO:lightwood-2669:Loss of 9.051180630922318 with learning rate 0.0001
INFO:lightwood-2669:Loss of 9.014871209859848 with learning rate 0.0005
INFO:lightwood-2669:Loss of 8.969509482383728 with learning rate 0.001
INFO:lightwood-2669:Loss of 8.879052013158798 with learning rate 0.002
INFO:lightwood-2669:Loss of 8.788950502872467 with learning rate 0.003
INFO:lightwood-2669:Loss of 8.611965209245682 with learning rate 0.005
INFO:lightwood-2669:Loss of 8.195775926113129 with learning rate 0.01
INFO:lightwood-2669:Loss of 6.255893141031265 with learning rate 0.05
INFO:lightwood-2669:Found learning rate of: 0.05
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
INFO:lightwood-2669:Loss @ epoch 1: 0.5818348675966263
INFO:lightwood-2669:Loss @ epoch 2: 0.4797109067440033
INFO:lightwood-2669:Loss @ epoch 3: 0.48386093974113464
INFO:lightwood-2669:Loss @ epoch 4: 0.49511992931365967
INFO:lightwood-2669:Loss @ epoch 5: 0.39475560188293457
INFO:lightwood-2669:Loss @ epoch 6: 0.39592696726322174
INFO:lightwood-2669:Loss @ epoch 7: 0.3622782379388809
INFO:lightwood-2669:Loss @ epoch 8: 0.38170479238033295
INFO:lightwood-2669:Loss @ epoch 9: 0.5138543993234634
INFO:lightwood-2669:Loss @ epoch 10: 0.6360723078250885
INFO:lightwood-2669:Loss @ epoch 1: 0.29868809472430835
INFO:lightwood-2669:Loss @ epoch 2: 0.30318967591632495
DEBUG:lightwood-2669: `fit_mixer` runtime: 0.9 seconds
INFO:lightwood-2669:Started fitting LGBM models for array prediction
INFO:lightwood-2669:Started fitting XGBoost model
[0]     validation_0-rmse:42.76798
INFO:lightwood-2669:A single GBM iteration takes 0.1 seconds
INFO:lightwood-2669:Training XGBoost with 57023 iterations given 7127.987701892853 seconds constraint
[0]     validation_0-rmse:42.76798
[1]     validation_0-rmse:31.72661
[2]     validation_0-rmse:24.49596
[3]     validation_0-rmse:20.38592
[4]     validation_0-rmse:18.09356
[5]     validation_0-rmse:16.88080
[6]     validation_0-rmse:16.21734
[7]     validation_0-rmse:15.95640
[8]     validation_0-rmse:15.80745
[9]     validation_0-rmse:15.76428
[10]    validation_0-rmse:15.89176
[11]    validation_0-rmse:15.89176
[12]    validation_0-rmse:15.87901
[13]    validation_0-rmse:15.87505
[14]    validation_0-rmse:16.06330
INFO:lightwood-2669:Started fitting XGBoost model
[0]     validation_0-rmse:42.95930
INFO:lightwood-2669:A single GBM iteration takes 0.1 seconds
INFO:lightwood-2669:Training XGBoost with 57023 iterations given 7127.988470077515 seconds constraint
[0]     validation_0-rmse:42.95930
[1]     validation_0-rmse:32.27936
[2]     validation_0-rmse:25.47815
[3]     validation_0-rmse:21.37610
[4]     validation_0-rmse:19.25243
[5]     validation_0-rmse:18.03199
[6]     validation_0-rmse:17.67706
[7]     validation_0-rmse:17.57516
[8]     validation_0-rmse:17.51227
[9]     validation_0-rmse:17.51216
[10]    validation_0-rmse:17.55192
[11]    validation_0-rmse:17.56609
[12]    validation_0-rmse:17.71702
[13]    validation_0-rmse:17.75939
INFO:lightwood-2669:Started fitting XGBoost model
[0]     validation_0-rmse:43.14000
INFO:lightwood-2669:A single GBM iteration takes 0.1 seconds
INFO:lightwood-2669:Training XGBoost with 57023 iterations given 7127.988441467285 seconds constraint
[0]     validation_0-rmse:43.14000
[1]     validation_0-rmse:32.50446
[2]     validation_0-rmse:25.73040
[3]     validation_0-rmse:22.16599
[4]     validation_0-rmse:20.28726
[5]     validation_0-rmse:19.46406
[6]     validation_0-rmse:19.07306
[7]     validation_0-rmse:19.00714
[8]     validation_0-rmse:19.13990
[9]     validation_0-rmse:19.12589
[10]    validation_0-rmse:19.34977
[11]    validation_0-rmse:19.43217
[12]    validation_0-rmse:19.48230
INFO:lightwood-2669:Started fitting XGBoost model
[0]     validation_0-rmse:44.19079
INFO:lightwood-2669:A single GBM iteration takes 0.1 seconds
INFO:lightwood-2669:Training XGBoost with 57023 iterations given 7127.98904633522 seconds constraint
[0]     validation_0-rmse:44.19079
[1]     validation_0-rmse:34.13289
[2]     validation_0-rmse:27.40621
[3]     validation_0-rmse:23.82532
[4]     validation_0-rmse:22.03399
[5]     validation_0-rmse:21.07010
[6]     validation_0-rmse:20.74813
[7]     validation_0-rmse:20.81255
[8]     validation_0-rmse:20.69303
[9]     validation_0-rmse:20.71044
[10]    validation_0-rmse:20.79641
[11]    validation_0-rmse:20.78759
[12]    validation_0-rmse:20.83998
[13]    validation_0-rmse:20.77980
INFO:lightwood-2669:Started fitting XGBoost model
[0]     validation_0-rmse:44.24747
INFO:lightwood-2669:A single GBM iteration takes 0.1 seconds
INFO:lightwood-2669:Training XGBoost with 57023 iterations given 7127.98738694191 seconds constraint
[0]     validation_0-rmse:44.24747
[1]     validation_0-rmse:34.37446
[2]     validation_0-rmse:27.88767
[3]     validation_0-rmse:24.63817
[4]     validation_0-rmse:22.84209
[5]     validation_0-rmse:22.35045
[6]     validation_0-rmse:22.11300
[7]     validation_0-rmse:22.16132
[8]     validation_0-rmse:22.21348
[9]     validation_0-rmse:22.10747
[10]    validation_0-rmse:22.20352
[11]    validation_0-rmse:22.25761
[12]    validation_0-rmse:22.25308
[13]    validation_0-rmse:22.31415
[14]    validation_0-rmse:22.31000
INFO:lightwood-2669:Started fitting XGBoost model
[0]     validation_0-rmse:44.48913
INFO:lightwood-2669:A single GBM iteration takes 0.1 seconds
INFO:lightwood-2669:Training XGBoost with 57023 iterations given 7127.989282369614 seconds constraint
[0]     validation_0-rmse:44.48913
[1]     validation_0-rmse:34.69001
[2]     validation_0-rmse:28.87323
[3]     validation_0-rmse:25.32567
[4]     validation_0-rmse:23.09943
[5]     validation_0-rmse:22.12203
[6]     validation_0-rmse:21.71523
[7]     validation_0-rmse:21.70934
[8]     validation_0-rmse:21.74380
[9]     validation_0-rmse:21.61157
[10]    validation_0-rmse:21.73507
[11]    validation_0-rmse:21.84587
[12]    validation_0-rmse:21.78099
[13]    validation_0-rmse:21.68890
[14]    validation_0-rmse:21.70025
DEBUG:lightwood-2669: `fit_mixer` runtime: 0.49 seconds
INFO:dataprep_ml-2669:Ensembling the mixer
INFO:lightwood-2669:Mixer: NeuralTs got accuracy: 0.875
WARNING:lightwood-2669:This model does not output probability estimates
INFO:lightwood-2669:Mixer: XGBoostArrayMixer got accuracy: 0.869
INFO:lightwood-2669:Picked best mixer: NeuralTs
DEBUG:lightwood-2669: `fit` runtime: 1.44 seconds
INFO:dataprep_ml-2669:[Learn phase 7/8] - Ensemble analysis
INFO:dataprep_ml-2669:Analyzing the ensemble of mixers
INFO:lightwood-2669:The block ICP is now running its analyze() method
INFO:lightwood-2669:The block ConfStats is now running its analyze() method
INFO:lightwood-2669:The block AccStats is now running its analyze() method
INFO:lightwood-2669:The block PermutationFeatureImportance is now running its analyze() method
WARNING:lightwood-2669:Block 'PermutationFeatureImportance' does not support time series nor text encoding, skipping...
DEBUG:lightwood-2669: `analyze_ensemble` runtime: 0.15 seconds
INFO:dataprep_ml-2669:[Learn phase 8/8] - Adjustment on validation requested
INFO:dataprep_ml-2669:Updating the mixers
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
INFO:lightwood-2669:Loss @ epoch 1: 0.29626286526521045
INFO:lightwood-2669:Loss @ epoch 2: 0.2954987535874049
INFO:lightwood-2669:Updating array of LGBM models...
INFO:lightwood-2669:XGBoost mixer does not have a `partial_fit` implementation
INFO:lightwood-2669:XGBoost mixer does not have a `partial_fit` implementation
INFO:lightwood-2669:XGBoost mixer does not have a `partial_fit` implementation
INFO:lightwood-2669:XGBoost mixer does not have a `partial_fit` implementation
INFO:lightwood-2669:XGBoost mixer does not have a `partial_fit` implementation
INFO:lightwood-2669:XGBoost mixer does not have a `partial_fit` implementation
DEBUG:lightwood-2669: `adjust` runtime: 0.09 seconds
DEBUG:lightwood-2669: `learn` runtime: 1.94 seconds

Predict

Once the predictor has trained, we can use it to generate 6-month forecasts for each of the test set data points:

[7]:
forecasts = predictor.predict(test)
INFO:dataprep_ml-2669:[Predict phase 1/4] - Data preprocessing
/tmp/f73a2317178f784e4c57c843f563cceab09655f97903f47417108435306748986.py:584: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col] = [None] * len(data)
INFO:dataprep_ml-2669:Cleaning the data
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/dataprep_ml/cleaners.py:163: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  result = pd.to_datetime(element,
INFO:dataprep_ml-2669:Transforming timeseries data
DEBUG:lightwood-2669: `preprocess` runtime: 0.02 seconds
INFO:dataprep_ml-2669:[Predict phase 2/4] - Feature generation
INFO:dataprep_ml-2669:Featurizing the data
DEBUG:lightwood-2669: `featurize` runtime: 0.01 seconds
INFO:dataprep_ml-2669:[Predict phase 3/4] - Calling ensemble
DEBUG:lightwood-2669: `_timed_call` runtime: 0.09 seconds
INFO:dataprep_ml-2669:[Predict phase 4/4] - Analyzing output
INFO:lightwood-2669:The block ICP is now running its explain() method
INFO:lightwood-2669:The block ConfStats is now running its explain() method
INFO:lightwood-2669:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.
INFO:lightwood-2669:The block AccStats is now running its explain() method
INFO:lightwood-2669:AccStats.explain() has not been implemented, no modifications will be done to the data insights.
INFO:lightwood-2669:The block PermutationFeatureImportance is now running its explain() method
INFO:lightwood-2669:PermutationFeatureImportance.explain() has not been implemented, no modifications will be done to the data insights.
DEBUG:lightwood-2669: `explain` runtime: 0.09 seconds
DEBUG:lightwood-2669: `predict` runtime: 0.22 seconds

Let’s check how a single row might look:

[8]:
forecasts.iloc[[10]]
[8]:
original_index prediction order_Month confidence lower upper anomaly prediction_sum lower_sum upper_sum confidence_mean
10 10 [50.51358374451768, 53.78975402923053, 51.0303... [-273628800.0, -270950400.0, -268272000.0, -26... [0.79, 0.02, 0.9991, 0.9991, 0.9991, 0.9991] [30.14139795091352, 32.97332333016865, 0.0, 0.... [70.88576953812185, 74.60618472829242, 137.289... False 294.494088 0.0 209.075865 0.801067

You’ll note that the point prediction has associated lower and upper bounds that are a function of the estimated confidence the model has on its own output. Apart from this, order_Month yields the timestamps of each prediction, the anomaly tag will let you know if the observed value falls outside of the predicted region.

Visualizing a forecast

Okay, time series are much easier to appreciate through plots. Let’s make one:

NOTE: We will use matplotlib to generate a simple plot of these forecasts. If you want to run this notebook locally, you will need to pip install matplotlib for the following code to work.

[9]:
import matplotlib.pyplot as plt
[10]:
plt.figure(figsize=(12, 8))
plt.plot([None for _ in range(forecasts.shape[0])] + forecasts.iloc[-1]['prediction'], color='purple', label='point prediction')
plt.plot([None for _ in range(forecasts.shape[0])] + forecasts.iloc[-1]['lower'], color='grey')
plt.plot([None for _ in range(forecasts.shape[0])] + forecasts.iloc[-1]['upper'], color='grey')
plt.xlabel('timestep')
plt.ylabel('# sunspots')
plt.title("Forecasted amount of sunspots for the next semester")
plt.legend()
plt.show()
../../_images/tutorials_tutorial_time_series_tutorial_time_series_17_0.png

Conclusion

In this tutorial, we have gone through how you can train a machine learning model with Lightwood to produce forecasts for a univariate time series task.

There are additional parameters to further customize your timeseries settings and/or prediction insights, so be sure to check the rest of the documentation.