Data

The focus of these modules is on storing, transforming, cleaning, splitting, merging, getting and removing data.

class data.ConcatedEncodedDs(encoded_ds_arr)[source]

Bases: EncodedDs

ConcatedEncodedDs abstracts over multiple encoded datasources (EncodedDs) as if they were a single entity.

Create a Lightwood datasource from a data frame and some encoders. This class inherits from torch.utils.data.Dataset.

Note: normal behavior is to cache encoded representations to avoid duplicated computations. If you want an option to disable, this please open an issue.

Parameters:
  • encoders – dictionary of Lightwood encoders used to encode the data per each column.

  • data_frame – original dataframe.

  • target – name of the target column to predict.

clear_cache()[source]

See lightwood.data.encoded_ds.EncodedDs.clear_cache().

get_column_original_data(column_name)[source]

See lightwood.data.encoded_ds.EncodedDs.get_column_original_data().

Return type:

Series

get_encoded_column_data(column_name)[source]

See lightwood.data.encoded_ds.EncodedDs.get_encoded_column_data().

Return type:

Tensor

class data.EncodedDs(encoders, data_frame, target)[source]

Bases: Dataset

Create a Lightwood datasource from a data frame and some encoders. This class inherits from torch.utils.data.Dataset.

Note: normal behavior is to cache encoded representations to avoid duplicated computations. If you want an option to disable, this please open an issue.

Parameters:
  • encoders (Dict[str, BaseEncoder]) – dictionary of Lightwood encoders used to encode the data per each column.

  • data_frame (DataFrame) – original dataframe.

  • target (str) – name of the target column to predict.

build_cache()[source]

This method builds a cache for the entire dataframe provided at initialization.

clear_cache()[source]

Clears the EncodedDs cache.

get_column_original_data(column_name)[source]

Gets the original data for any given column of the EncodedDs.

Parameters:

column_name (str) – name of the column.

Return type:

Series

Returns:

A pd.Series with the original data stored in the column_name column.

get_encoded_column_data(column_name)[source]

Gets the encoded data for any given column of the EncodedDs.

Parameters:

column_name (str) – name of the column.

Return type:

Tensor

Returns:

A torch.Tensor with the encoded data of the column_name column.

get_encoded_data(include_target=True)[source]

Gets all encoded data.

Parameters:

include_target (bool) – whether to include the target column in the output or not.

Return type:

Tensor

Returns:

A torch.Tensor with the encoded dataframe.

data.timeseries_analyzer(data, dtype_dict, timeseries_settings, target)[source]

This module analyzes (pre-processed) time series data and stores a few useful insights used in the rest of Lightwood’s pipeline.

Parameters:
  • data (Dict[str, DataFrame]) – dictionary with the dataset split into train, val, test subsets.

  • dtype_dict (Dict[str, str]) – dictionary with inferred types for every column.

  • timeseries_settings (TimeseriesSettings) – A TimeseriesSettings object. For more details, check lightwood.types.TimeseriesSettings.

  • target (str) – name of the target column.

The following things are extracted from each time series inside the dataset:
  • group_combinations: all observed combinations of values for the set of group_by columns. The length of this list determines how many time series are in the data.

  • deltas: inferred sampling interval

  • ts_naive_residuals: Residuals obtained from the data by a naive forecaster that repeats the last-seen value.

  • ts_naive_mae: Mean residual value obtained from the data by a naive forecaster that repeats the last-seen value.

  • target_normalizers: objects that may normalize the data within any given time series for effective learning. See lightwood.encoder.time_series.helpers.common for available choices.

Return type:

Dict

Returns:

Dictionary with the aforementioned insights and the TimeseriesSettings object for future references.

data.transform_timeseries(data, dtype_dict, timeseries_settings, target, mode, pred_args=None)[source]

Block that transforms the dataframe of a time series task to a convenient format for use in posterior phases like model training.

The main transformations performed by this block are:
  • Type casting (e.g. to numerical for order_by column).

  • Windowing functions for historical context based on TimeseriesSettings.window parameter.

  • Explicitly add target columns according to the TimeseriesSettings.horizon parameter.

  • Flag all rows that are “predictable” based on all TimeseriesSettings.

  • Plus, handle all logic for the streaming use case (where forecasts are only emitted for the last observed data point).

Parameters:
  • data (DataFrame) – Dataframe with data to transform.

  • dtype_dict (Dict[str, str]) – Dictionary with the types of each column.

  • timeseries_settings (TimeseriesSettings) – A TimeseriesSettings object.

  • target (str) – The name of the target column to forecast.

  • mode (str) – Either “train” or “predict”, depending on what phase is calling this procedure.

  • pred_args (Optional[PredictionArguments]) – Optional prediction arguments to control the transformation process.

Return type:

DataFrame

Returns:

A dataframe with all the transformations applied.