Encoders

Used for encoding data into PyTorch tensors and decoding it from pytorch tensors

class encoder.ArrayEncoder(stop_after, window=None, is_target=False, original_type=None)[source]

Fits a normalizer for array data.

To encode, ArrayEncoder returns a normalized window of previous data. It can be used for generic arrays, as well as for handling historical target values in time series tasks.

Currently supported normalizing strategies are minmax for numerical arrays, and a simple one-hot for categorical arrays. See lightwood.encoder.helpers for more details on each approach.

Parameters:
  • stop_after (float) – time budget in seconds.

  • window (Optional[int]) – expected length of array data.

  • original_type (Optional[dtype]) – element-wise data type

decode(data)[source]

Converts data as a list of arrays.

Parameters:

data (Tensor) – Encoded data prepared by this array encoder

Return type:

List[Iterable]

Returns:

A list of iterable sequences in the original data space

encode(column_data)[source]

Encode the properties of a sequence-of-sequence representation

Parameters:

column_data (Iterable[Iterable]) – Input column data to be encoded

Return type:

Tensor

Returns:

a torch-tensor representing the encoded sequence

prepare(train_priming_data, dev_priming_data)[source]

Prepare the array encoder for sequence data. :type train_priming_data: Iterable[Iterable] :param train_priming_data: Training data of sequences :type dev_priming_data: Iterable[Iterable] :param dev_priming_data: Dev data of sequences

class encoder.BaseEncoder(is_target=False)[source]

Base class for all encoders.

An encoder should return encoded representations of any columnar data. The procedure for this is defined inside the encode() method.

If this encoder is expected to handle an output column, then it also needs to implement the respective decode() method that handles the inverse transformation from encoded representations to the final prediction in the original column space.

For encoders that learn representations (as opposed to rule-based), the prepare() method will handle all learning logic.

The to() method is used to move PyTorch-based encoders to and from a GPU.

Parameters:
  • is_target – Whether the data to encode is the target, as per the problem definition.

  • is_timeseries_encoder – Whether encoder represents sequential/time-series data. Lightwood must provide specific treatment for this kind of encoder

  • is_trainable_encoder – Whether the encoder must return learned representations. Lightwood checks whether this flag is present in order to pass data to the feature representation via the prepare statement.

Class Attributes: - is_prepared: Internal flag to signal that the prepare() method has been successfully executed. - is_nn_encoder: Whether the encoder is neural network-based. - dependencies: list of additional columns that the encoder might need to encode. - output_size: length of each encoding tensor for a single data point.

decode(encoded_data)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters:

encoded_data (Tensor) – The input representation in encoded format

Return type:

List[object]

Returns:

The decoded representation of data, per column, in the original data-type presented.

encode(column_data)[source]

Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.

After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.

Parameters:

column_data (Iterable[object]) – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

Return type:

Tensor

Returns:

The encoded representation of data, per column

prepare(priming_data)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters:

priming_data (Iterable[object]) – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

Return type:

None

class encoder.BinaryEncoder(is_target=False, target_weights=None, handle_unknown='use_encoded_value')[source]

Creates a one-hot-encoding for binary class data. Assume two arbitrary categories \(A\) and \(B\); representation for them will be as such:

\[A &= [1, 0] \ B &= [0, 1]\]

This encoder is a specialized case of one-hot encoding (OHE); unknown categories are explicitly handled as [0, 0]. Unknowns may only be reported if the input row value is NULL (or python None type) or if new data, after the encoder is prepared, has examples outside the feature map.

When data is typed with Lightwood, this class is only deployed if an input data type is explicitly recognized as binary (i.e. the column has only 2 unique values like True/False). If future data shows a new category (thus the data is no longer truly binary), this encoder will no longer be appropriate unless you are comfortable mapping ALL new classes as [0, 0].

An encoder can represent a feature column or target column; in this case it represents a target, is_target is True, and target_weights. The target_weights parameter enables users to specify how heavily each class should be weighted within a mixer - useful in imbalanced classes.

By default, dataprep_ml.StatisticalAnalysis phase will provide target_weights as the relative fraction of each class in the data which is important for imbalanced populations; for example, suppose there is a 80/20 imbalanced representation across 3 different classes - target_weights will be a vector as such:

target_weights = {“class1”: 0.8, “class2”: 0.2}

Users should note that models will be presented with the inverse of the target weights, inv_target_weights, which will perform the 1/target_value_per_class operation. This means large values will result in small weights for the model.

decode(encoded_data)[source]

Given encoded data, return in form of original category labels. The input to decode makes no presumption on whether the data is already in OHE form OR not, as it some models may output a set of probabilities of weights assigned to each class. The decoded value will always be the argmax of such a vector.

In the case that the vector is all 0s, the output is decoded as “UNKNOWN”

Parameters:

encoded_data (Tensor) – the output of a mixer model

Returns:

Decoded values for each data point

decode_probabilities(encoded_data)[source]

Provides decoded answers, as well as a probability assignment to each data point.

Parameters:

encoded_data (Tensor) – the output of a mixer model

Return type:

Tuple[List[str], List[List[float]], Dict[int, str]]

Returns:

Decoded values for each data point, Probability vector for each category, and the reverse map of dimension to category name

encode(column_data)[source]

Encodes categories as OHE binary. Unknown/unrecognized classes return [0,0]. :rtype: Tensor

Parameters:

column_data (Iterable[str]) – Pre-processed data to encode

:returns Encoded data of form \(N_{rows} x 2\)

prepare(priming_data)[source]

Given priming data, create a map/inverse-map corresponding category name to index (and vice versa).

Parameters:

priming_data (Iterable[str]) – Binary data to encode

class encoder.CatArrayEncoder(stop_after, window=None, is_target=False)[source]
Parameters:
  • stop_after (float) – time budget in seconds.

  • window (Optional[int]) – expected length of array data.

  • original_type – element-wise data type

decode(data)[source]

Converts data as a list of arrays.

Parameters:

data (Tensor) – Encoded data prepared by this array encoder

Return type:

List[Iterable]

Returns:

A list of iterable sequences in the original data space

prepare(train_priming_data, dev_priming_data)[source]

Prepare the array encoder for sequence data. :type train_priming_data: Iterable[Iterable] :param train_priming_data: Training data of sequences :type dev_priming_data: Iterable[Iterable] :param dev_priming_data: Dev data of sequences

class encoder.CategoricalAutoEncoder(stop_after=3600, is_target=False, max_encoded_length=100, desired_error=0.01, batch_size=200, device='', input_encoder=None)[source]

Trains an autoencoder (AE) to represent categorical information with over 100 categories. This is used to ensure that feature vectors for categorical data with many categories are not excessively large.

The AE defaults to a vector sized 100 but can be adjusted to user preference. It is highly advised NOT to use this encoder to feature engineer your target, as reconstruction accuracy will determine your AE’s ability to decode properly.

Parameters:
  • stop_after (float) – Stops training with provided time limit (sec)

  • is_target (bool) – Encoder represents target class (NOT recommended)

  • max_encoded_length (int) – Maximum length of vector represented

  • desired_error (float) – Threshold for reconstruction accuracy error

  • batch_size (int) – Minimum batch size while training

  • device (str) – Name of the device that get_device_from_name will attempt to use

  • input_encoder (Optional[str]) – one of OneHotEncoder or SimpleLabelEncoder to force usage of the underlying input encoder. Note that OHE does not scale for categorical features with high cardinality, while SLE can but is less accurate overall.

decode(encoded_data)[source]

Decodes from the embedding space, the original categories.

..warning If your reconstruction accuracy is not 100%, the CatAE may not return the correct category.

Parameters:

encoded_data (Tensor) – A torch tensor of embeddings for category predictions

Return type:

List[str]

Returns:

A list of ‘translated’ categories for each embedding

encode(column_data)[source]

Encodes categorical information in column as the compressed vector from the CatAE.

Parameters:

column_data (Iterable[str]) – An iterable of category samples from a column

Return type:

Tensor

Returns:

An embedding for each sample in original input

prepare(train_priming_data, dev_priming_data)[source]

Creates inputs and prepares a categorical autoencoder (CatAE) for input data. Currently, does not support a dev set; inputs for train and dev are concatenated together to train an autoencoder.

Parameters:
  • train_priming_data (Series) – Input training data

  • dev_priming_data (Series) – Input dev data (Not supported currently)

class encoder.DatetimeEncoder(is_target=False)[source]

This encoder produces an encoded representation for timestamps.

The approach consists on decomposing the timestamp objects into its constituent units (e.g. month, year, etc), and describing each of those with a single value that represents the magnitude in a sensible cycle length.

decode(encoded_data, return_as_datetime=False)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters:

encoded_data (Tensor) – The input representation in encoded format

Return type:

list

Returns:

The decoded representation of data, per column, in the original data-type presented.

encode(data)[source]
Parameters:

data (Union[ndarray, Series]) – a pandas series with numerical dtype, previously cleaned with dataprep_ml

Return type:

Tensor

Returns:

encoded data, shape (len(data), self.output_size)

prepare(priming_data)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters:

priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

class encoder.DatetimeNormalizerEncoder(is_target=False, sinusoidal=False)[source]
decode(encoded_data, return_as_datetime=False)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters:

encoded_data – The input representation in encoded format

Returns:

The decoded representation of data, per column, in the original data-type presented.

encode(data)[source]
Parameters:

data – # @TODO: receive a consistent data type here; currently either list of lists or pd.Series w/lists

Returns:

encoded data

encode_one(data)[source]

Encodes a list of unix_timestamps, or a list of tensors with unix_timestamps :type data: :param data: list of unix_timestamps (unix_timestamp resolution is seconds) :return: a list of vectors

prepare(priming_data)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters:

priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

class encoder.Img2VecEncoder(stop_after=3600, is_target=False, scale=(224, 224), mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], device='')[source]

Generates encoded representations for images using a pre-trained deep neural network. Inputs must be str-based location of the data.

Without user-specified details, all input images are rescaled to a standard size of 224x224, and normalized using the mean and standard deviation of the ImageNet dataset (as it was used to train the underlying NN).

This encoder currently does not support a decode() call; models with an image output will not work.

For more information about the neural network this encoder uses, refer to the lightwood.encoder.image.helpers.img_to_vec.Img2Vec.

Parameters:
  • stop_after (float) – time budget, in seconds.

  • is_target (bool) – Whether encoder represents target or not

  • scale (Tuple[int, int]) – Resize scale of image (x, y)

  • mean (List[float]) – Mean of pixel values

  • std (List[float]) – Standard deviation of pixel values

  • device (str) – Name of the device that get_device_from_name will attempt to use

decode(encoded_values_tensor)[source]

Currently not supported

encode(images)[source]

Creates encodings for a list of images; each image is referenced by a filepath or url.

Parameters:

images (List[str]) – list of images, each image is a path to a file or a url.

Return type:

Tensor

Returns:

a torch.floatTensor

prepare(train_priming_data, dev_priming_data)[source]

Sets an Img2Vec object (model) and sets the expected size for encoded representations.

to(device, available_devices=1)[source]

Changes device of model to support CPU/GPU

Parameters:
  • device – will move the model to this device.

  • available_devices – all available devices as reported by lightwood.

Returns:

same object but moved to the target device.

class encoder.MultiHotEncoder(is_target=False)[source]
decode(vectors)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters:

encoded_data – The input representation in encoded format

Returns:

The decoded representation of data, per column, in the original data-type presented.

encode(column_data)[source]

Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.

After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.

Parameters:

column_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

Returns:

The encoded representation of data, per column

prepare(priming_data, max_dimensions=100)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters:

priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

class encoder.NumArrayEncoder(stop_after, window=None, is_target=False, positive_domain=False)[source]
Parameters:
  • stop_after (float) – time budget in seconds.

  • window (Optional[int]) – expected length of array data.

  • original_type – element-wise data type

class encoder.NumericEncoder(data_type=None, target_weights=None, is_target=False, positive_domain=False)[source]

The numeric encoder takes numbers (float or integer) and converts it into tensors of the form: [0 if the number is none, otherwise 1, 1 if the number is positive, otherwise 0, natural_log(abs(number)), number/absolute_mean]

This representation is: [1 if the number is positive, otherwise 0, natural_log(abs(number)), number/absolute_mean]] if encoding target values, since target values can’t be none.

The absolute_mean is computed in the prepare method and is just the mean of the absolute values of all numbers feed to prepare (which are not none)

none stands for any number that is an actual python None value or any sort of non-numeric value (a string, nan, inf)

Parameters:
  • data_type (Optional[dtype]) – The data type of the number (integer, float, quantity)

  • target_weights (Optional[Dict[float, float]]) – a dictionary of weights to use on the examples.

  • is_target (bool) – Indicates whether the encoder refers to a target column or feature column (True==target)

  • positive_domain (bool) – Forces the encoder to always output positive values

decode(encoded_values, decode_log=None)[source]
Parameters:
  • encoded_values (Tensor) – The encoded values to decode into single numbers

  • decode_log (Optional[bool]) – Whether to decode the log or linear part of the representation, since the encoded vector contains both a log and a linear part

Return type:

list

Returns:

The decoded array

encode(data)[source]
Parameters:

data (Union[ndarray, Series]) – A pandas series or numpy array containing the numbers to be encoded

Returns:

A torch tensor with the representations of each number

prepare(priming_data)[source]

“NumericalEncoder” uses a rule-based form to prepare results on training (priming) data. The averages etc. are taken from this distribution.

Parameters:

priming_data (Series) – an iterable data structure containing numbers numbers which will be used to compute the values used for normalizing the encoded representations

class encoder.OneHotEncoder(is_target=False, target_weights=None, use_unknown=True)[source]

Creates a one-hot encoding (OHE) for categorical data. One-hot encoding represents categorical information as a vector where each individual dimension corresponds to a category. A category has a 1:1 mapping between dimension indicated by a “1” in that position. For example, imagine 3 categories, \(A\), \(B\), and \(C\); these can be represented as follows:

\[A &= [1, 0, 0] \ B &= [0, 1, 0] \ C &= [0, 0, 1]\]
The OHE encoder operates in 2 modes:
  1. “use_unknown=True”: Makes an \(N+1\) length vector for \(N\) categories, the first index always corresponds to the unknown category.

  2. “use_unknown=False”: Makes an \(N\) length vector for \(N\) categories, where an empty vector of 0s indicates an unknown/missing category.

An encoder can represent a feature column or target column; in this case it represents a target, is_target is True, and target_weights. The target_weights parameter enables users to specify how heavily each class should be weighted within a mixer - useful in imbalanced classes.

By default, dataprep_ml.StatisticalAnalysis phase will provide target_weights as the relative fraction of each class in the data which is important for imbalanced populations; for example, suppose there is a 80/05/15 imbalanced representation across 3 different classes - target_weights will be a vector as such:

target_weights = {“class1”: 0.8, “class2”: 0.05, “class3”: 0.15}

Users should note that models will be presented with the inverse of the target weights, inv_target_weights, which will perform the 1/target_value_per_class operation. This means large values will result in small weights for the model.

Parameters:
  • is_target (bool) – True if this encoder featurizes the target column

  • target_weights (Optional[Dict[str, float]]) – Percentage of total population represented by each category (between [0, 1]).

  • mode – True uses an extra dimension to account for unknown/out-of-distribution categories

decode(encoded_data)[source]

Decodes OHE mapping into the original categories. Since this approach uses an argmax, decoding flexibly works either on logits or an explicitly OHE vector.

Param:

encoded_data:

:returns Returns the original category names for encoded data.

decode_probabilities(encoded_data)[source]

Provides decoded answers, as well as a probability assignment to each data point.

Parameters:

encoded_data (Tensor) – the output of a mixer model

Return type:

Tuple[List[str], List[List[float]], Dict[int, str]]

:returns Decoded values for each data point, Probability vector for each category, and the reverse map of dimension to category name

encode(column_data)[source]

Encodes pre-processed data into OHE. Unknown/unrecognized classes vector of all 0s.

Parameters:

column_data (Iterable[str]) – Pre-processed data to encode

Return type:

Tensor

Returns:

Encoded data of form \(N_{rows} x N_{categories}\)

prepare(priming_data)[source]

Prepares the OHE Encoder by creating a dictionary mapping.

Unknown categories must be explicitly handled as python None types.

class encoder.PretrainedLangEncoder(stop_after, is_target=False, batch_size=10, max_position_embeddings=None, frozen=False, epochs=1, output_type=None, embed_mode=True, device='')[source]
Parameters:
  • is_target (bool) – Whether this encoder represents the target. NOT functional for text generation yet.

  • batch_size (int) – size of batch while fine-tuning

  • max_position_embeddings (Optional[int]) – max sequence length of input text

  • frozen (bool) – If True, freezes transformer layers during training.

  • epochs (int) – number of epochs to train model with

  • output_type (Optional[str]) – Data dtype of the target; if categorical/binary, the option to return logits is possible.

  • embed_mode (bool) – If True, assumes the output of the encode() step is the CLS embedding (this can be trained or not). If False, returns the logits of the tuned task.

  • device (str) – name of the device that get_device_from_name will attempt to use.

decode(encoded_values_tensor, max_length=100)[source]

Text generation via decoding is not supported.

encode(column_data)[source]

Converts each text example in a column into encoded state. This can be either a vector embedding of the [CLS] token (represents the full text input) OR the logits prediction of the output.

The transformer model is of form: transformer base + pre-classifier linear layer + classifier layer

The embedding returned is of the [CLS] token after the pre-classifier layer; from internal testing, we found the latent space most highly separated across classes.

If the encoder represents the logits in classification, returns a soft-maxed output of the class vector.

Parameters:

column_data (Iterable[str]) – List of text data as strings

Return type:

Tensor

Returns:

Embedded vector N_rows x Nembed_dim OR logits vector N_rows x N_classes depending on if embed_mode is True or not.

is_trainable_encoder: bool = True

//arxiv.org/abs/1910.01108).

In certain text tasks, this model can use a transformer to automatically fine-tune on a class of interest (providing there is a 2 column dataset, where the input column is text).

Type:

Creates a contextualized embedding to represent input text via the [CLS] token vector from DistilBERT (transformers). (Sanh et al. 2019 - https

prepare(train_priming_data, dev_priming_data, encoded_target_values)[source]

Fine-tunes a transformer on the priming data.

Transformer is fine-tuned with weight-decay on training split.

Train + Dev are concatenated together and a transformer is then fine tuned with weight-decay applied on the transformer parameters. The option to freeze the underlying transformer and only train a linear layer exists if frozen=True. This trains faster, with the exception that the performance is often lower than fine-tuning on internal benchmarks.

Parameters:
  • train_priming_data (Series) – Text data in the train set

  • dev_priming_data (Series) – Text data in the dev set

  • encoded_target_values (Tensor) – Encoded target labels in Nrows x N_output_dimension

to(device, available_devices)[source]

Converts encoder models to device specified (CPU/GPU)

Transformers are LARGE models, please run on GPU for fastest implementation.

class encoder.ShortTextEncoder(is_target=False, mode=None, device='')[source]
Parameters:
  • is_target

  • mode – None or “concat” or “mean”. When None, it will be set automatically based on is_target: (is_target) -> ‘concat’ (not is_target) -> ‘mean’

  • device – name of the device that get_device_from_name will attempt to use.

decode(vectors)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters:

encoded_data – The input representation in encoded format

Returns:

The decoded representation of data, per column, in the original data-type presented.

encode(column_data)[source]

Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.

After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.

Parameters:

column_data (List[str]) – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

Return type:

Tensor

Returns:

The encoded representation of data, per column

prepare(priming_data)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters:

priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

class encoder.SimpleLabelEncoder(is_target=False, normalize=True)[source]

Simple encoder that assigns a unique integer to every observed label.

Allocates an unknown label by default to index 0.

Labels must be exact matches between inference and training (e.g. no .lower() on strings is performed here).

decode(encoded_values, normalize=True)[source]
Parameters:

normalize – can be used to temporarily return unnormalized values

Return type:

List[object]

encode(data, normalize=True)[source]
Parameters:

normalize – can be used to temporarily return unnormalized values

Return type:

Tensor

prepare(priming_data)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters:

priming_data (Union[list, Series]) – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

Return type:

None

class encoder.TimeSeriesEncoder(stop_after, window=None, is_target=False, original_type=None)[source]

Time series encoder. This module will pass the normalized series values, along with moving averages taken from the series’ last window values. :type stop_after: float :param stop_after: time budget in seconds. :type window: Optional[int] :param window: expected length of array data. :type original_type: Optional[dtype] :param original_type: element-wise data type

decode(data)[source]

Converts data as a list of arrays. Removes all encoded moving average information.

Parameters:

data (Tensor) – Encoded data prepared by this array encoder

Return type:

List[Iterable]

Returns:

A list of iterable sequences in the original data space

encode(column_data)[source]

Encodes time series data.

Parameters:

column_data (Iterable[Iterable]) – Input column data to be encoded

Return type:

Tensor

Returns:

a torch tensor representing the encoded time series.

class encoder.TsArrayNumericEncoder(timesteps, is_target=False, positive_domain=False, grouped_by=None, nan=0)[source]

This encoder handles arrays of numerical time series data by wrapping the numerical encoder with behavior specific to time series tasks.

Parameters:
  • timesteps (int) – length of forecasting horizon, as defined by TimeseriesSettings.window.

  • is_target (bool) – whether this encoder corresponds to the target column.

  • positive_domain (bool) – whether the column domain is expected to be positive numbers.

  • grouped_by – what columns, if any, are considered to group the original column and yield multiple time series.

decode(encoded_values, dependency_data=None)[source]

Decodes a list of encoded arrays into values in their original domains.

Parameters:
  • encoded_values – encoded slices of numerical time series.

  • dependency_data – used to determine the correct normalizer for the input.

Return type:

List[List]

Returns:

a list of decoded time series arrays.

decode_one(encoded_value, dependency_data={})[source]

Decodes a single window of a time series into its original domain.

Parameters:
  • encoded_value – encoded slice of a numerical time series.

  • dependency_data – used to determine the correct normalizer for the input.

Return type:

List

Returns:

a list of length TimeseriesSettings.window with decoded values for the forecasted time series.

encode(data, dependency_data={})[source]

Encodes a list of time series arrays using the underlying time series numerical encoder.

Parameters:
  • data (Iterable[Iterable]) – list of numerical values to encode. Its length is determined by the tss.window parameter, and all data points belong to the same time series.

  • dependency_data (Optional[Dict[str, str]]) – dict with values of each group_by column for the time series, used to retrieve the correct normalizer.

Return type:

Tensor

Returns:

list of encoded time series arrays. Tensor is (len(data), N x K)-shaped, where N: self.data_window and K: sub-encoder # of output features.

prepare(priming_data)[source]

This method prepares the underlying time series numerical encoder.

class encoder.TsCatArrayEncoder(timesteps, is_target=False, grouped_by=None)[source]

This encoder handles arrays of categorical time series data by wrapping the OHE encoder with behavior specific to time series tasks.

Parameters:
  • timesteps (int) – length of forecasting horizon, as defined by TimeseriesSettings.window.

  • is_target (bool) – whether this encoder corresponds to the target column.

  • grouped_by – what columns, if any, are considered to group the original column and yield multiple time series.

decode(encoded_values, dependency_data=None)[source]

Decodes a list of encoded arrays into values in their original domains.

Parameters:
  • encoded_values – encoded slices of numerical time series.

  • dependency_data – used to determine the correct normalizer for the input.

Return type:

List[List]

Returns:

a list of decoded time series arrays.

decode_one(encoded_value)[source]

Decodes a single window of a time series into its original domain.

Parameters:
  • encoded_value – encoded slice of a numerical time series.

  • dependency_data – used to determine the correct normalizer for the input.

Return type:

List

Returns:

a list of length TimeseriesSettings.window with decoded values for the forecasted time series.

encode(data, dependency_data={})[source]

Encodes a list of time series arrays using the underlying time series numerical encoder.

Parameters:
  • data (Iterable[Iterable]) – list of numerical values to encode. Its length is determined by the tss.window parameter, and all data points belong to the same time series.

  • dependency_data (Optional[Dict[str, str]]) – dict with values of each group_by column for the time series, used to retrieve the correct normalizer.

Return type:

Tensor

Returns:

list of encoded time series arrays. Tensor is (len(data), N x K)-shaped, where N: self.data_window and K: sub-encoder # of output features.

encode_one(data)[source]

Encodes a single windowed slice of any given time series.

Parameters:

data (Iterable) – windowed slice of a numerical time series.

Return type:

Tensor

Returns:

an encoded time series array, as per the underlying TsNumericEncoder object.

The output of this encoder for all time steps is concatenated, so the final shape of the tensor is (1, NxK), where N: self.data_window and K: sub-encoder # of output features.

prepare(priming_data)[source]

This method prepares the underlying time series numerical encoder.

class encoder.TsNumericEncoder(is_target=False, positive_domain=False, grouped_by=None)[source]

Variant of vanilla numerical encoder, supports dynamic mean re-scaling

Parameters:
  • data_type – The data type of the number (integer, float, quantity)

  • target_weights – a dictionary of weights to use on the examples.

  • is_target (bool) – Indicates whether the encoder refers to a target column or feature column (True==target)

  • positive_domain (bool) – Forces the encoder to always output positive values

decode(encoded_values, decode_log=None, dependency_data=None)[source]
Parameters:
  • encoded_values (Tensor) – The encoded values to decode into single numbers

  • decode_log (Optional[bool]) – Whether to decode the log or linear part of the representation, since the encoded vector contains both a log and a linear part

Returns:

The decoded array

encode(data, dependency_data={})[source]
Parameters:
  • data (Union[ndarray, Series]) – A pandas series containing the numbers to be encoded

  • dependency_data (Dict[str, List[Series]]) – dict with grouped_by column info, to retrieve the correct normalizer for each datum

Returns:

A torch tensor with the representations of each number

class encoder.VocabularyEncoder(is_target=False)[source]
decode(encoded_values_tensor)[source]

Given an encoded representation, returns the decoded value. Decoded values may not exist for all encoders (ex: rich text, audio, etc.)

Parameters:

encoded_data – The input representation in encoded format

Returns:

The decoded representation of data, per column, in the original data-type presented.

encode(column_data)[source]

Given the approach defined in prepare(), encodes column data into a numerical representation to form part of the feature vector.

After all columns are featurized, each encoded vector is concatenated to form a feature vector per row in the dataset.

Parameters:

column_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.

Returns:

The encoded representation of data, per column

prepare(priming_data)[source]

Given ‘priming_data’ (i.e. training data), prepares encoders either through a rule-based (ex: one-hot encoding) or learned (ex: DistilBERT for text) model. This works explicitly on only training data.

Parameters:

priming_data – An iterable data structure where all the elements have type that is compatible with the encoder processing type; this may differ per encoder.