`Data`

The focus of these modules is on storing, transforming, cleaning, splitting, merging, getting and removing data.

class data.ConcatedEncodedDs(encoded_ds_arr)[source]

Bases: EncodedDs

ConcatedEncodedDs abstracts over multiple encoded datasources (EncodedDs) as if they were a single entity.

Create a Lightwood datasource from a data frame and some encoders. This class inherits from torch.utils.data.Dataset.

Note: normal behavior is to cache encoded representations to avoid duplicated computations. If you want an option to disable, this please open an issue.

Parameters:

encoders – dictionary of Lightwood encoders used to encode the data per each column.
data_frame – original dataframe.
target – name of the target column to predict.

clear_cache()[source]: See lightwood.data.encoded_ds.EncodedDs.clear_cache().

get_column_original_data(column_name)[source]

See lightwood.data.encoded_ds.EncodedDs.get_column_original_data().

Return type:: Series

get_encoded_column_data(column_name)[source]

See lightwood.data.encoded_ds.EncodedDs.get_encoded_column_data().

Return type:: Tensor

class data.EncodedDs(encoders, data_frame, target)[source]

Bases: Dataset

Create a Lightwood datasource from a data frame and some encoders. This class inherits from torch.utils.data.Dataset.

Note: normal behavior is to cache encoded representations to avoid duplicated computations. If you want an option to disable, this please open an issue.

Parameters:

encoders (Dict[str, BaseEncoder]) – dictionary of Lightwood encoders used to encode the data per each column.
data_frame (DataFrame) – original dataframe.
target (str) – name of the target column to predict.

build_cache()[source]: This method builds a cache for the entire dataframe provided at initialization.

clear_cache()[source]: Clears the EncodedDs cache.

get_column_original_data(column_name)[source]

Gets the original data for any given column of the EncodedDs.

Parameters:: column_name (str) – name of the column.
Return type:: Series
Returns:: A pd.Series with the original data stored in the column_name column.

get_encoded_column_data(column_name)[source]

Gets the encoded data for any given column of the EncodedDs.

Parameters:: column_name (str) – name of the column.
Return type:: Tensor
Returns:: A torch.Tensor with the encoded data of the column_name column.

get_encoded_data(include_target=True)[source]

Gets all encoded data.

Parameters:: include_target (bool) – whether to include the target column in the output or not.
Return type:: Tensor
Returns:: A torch.Tensor with the encoded dataframe.

data.timeseries_analyzer(data, dtype_dict, timeseries_settings, target)[source]

This module analyzes (pre-processed) time series data and stores a few useful insights used in the rest of Lightwood’s pipeline.

Parameters:

data (Dict[str, DataFrame]) – dictionary with the dataset split into train, val, test subsets.
dtype_dict (Dict[str, str]) – dictionary with inferred types for every column.
timeseries_settings (TimeseriesSettings) – A TimeseriesSettings object. For more details, check lightwood.types.TimeseriesSettings.
target (str) – name of the target column.

The following things are extracted from each time series inside the dataset:

group_combinations: all observed combinations of values for the set of group_by columns. The length of this list determines how many time series are in the data.
deltas: inferred sampling interval
ts_naive_residuals: Residuals obtained from the data by a naive forecaster that repeats the last-seen value.
ts_naive_mae: Mean residual value obtained from the data by a naive forecaster that repeats the last-seen value.
target_normalizers: objects that may normalize the data within any given time series for effective learning. See lightwood.encoder.time_series.helpers.common for available choices.

Return type:: Dict
Returns:: Dictionary with the aforementioned insights and the TimeseriesSettings object for future references.

data.transform_timeseries(data, dtype_dict, timeseries_settings, target, mode, pred_args=None)[source]

Block that transforms the dataframe of a time series task to a convenient format for use in posterior phases like model training.

The main transformations performed by this block are:

Type casting (e.g. to numerical for order_by column).
Windowing functions for historical context based on TimeseriesSettings.window parameter.
Explicitly add target columns according to the TimeseriesSettings.horizon parameter.
Flag all rows that are “predictable” based on all TimeseriesSettings.
Plus, handle all logic for the streaming use case (where forecasts are only emitted for the last observed data point).

Parameters:

data (DataFrame) – Dataframe with data to transform.
dtype_dict (Dict[str, str]) – Dictionary with the types of each column.
timeseries_settings (TimeseriesSettings) – A TimeseriesSettings object.
target (str) – The name of the target column to forecast.
mode (str) – Either “train” or “predict”, depending on what phase is calling this procedure.
pred_args (Optional[PredictionArguments]) – Optional prediction arguments to control the transformation process.

Return type:

DataFrame

Returns:

A dataframe with all the transformations applied.

Data

`Data`