Lightwood API Types
Lightwood consists of several high level abstractions to enable the data science/machine learning (DS/ML) pipeline in a step-by-step procedure.
- class api.types.Module[source]
Modules are the blocks of code that end up being called from the JSON AI, representing either object instantiations or function calls.
- Parameters:
module – Name of the module (function or class name)
args – Argument to pass to the function or constructor
- class api.types.TimeseriesSettings(is_timeseries, order_by=None, window=None, group_by=None, use_previous_target=True, horizon=None, historical_columns=None, target_type='', allow_incomplete_history=True, eval_incomplete=False, interval_periods=())[source]
For time-series specific problems, more specific treatment of the data is necessary. The following attributes enable time-series tasks to be carried out properly.
- Parameters:
is_timeseries (
bool
) – Whether the input data should be treated as time series; if true, this flag is checked in subsequent internal steps to ensure processing is appropriate for time-series data.order_by (
Optional
[str
]) – Column by which the data should be ordered.group_by (
Optional
[List
[str
]]) – Optional list of columns by which the data should be grouped. Each different combination of values for these columns will yield a different series.window (
Optional
[int
]) – The temporal horizon (number of rows) that a model intakes to “look back” into when making a prediction, after the rows are ordered by the order_by column and split into groups if applicable.horizon (
Optional
[int
]) – The number of points in the future that predictions should be made for, defaults to 1. Once trained, the model will be able to predict up to this many points into the future.historical_columns (
Optional
[List
[str
]]) – The temporal dynamics of these columns will be used as additional context to train the time series predictor. Note that a non-historical column shall still be used to forecast, but without considering their change through time.target_type (
str
) – Automatically inferred dtype of the target (e.g. dtype.integer, dtype.float).use_previous_target (
bool
) – Use the previous values of the target column to generate predictions. Defaults to True.allow_incomplete_history (
bool
) – whether predictions can be made for rows with incomplete historical context (i.e. less than window rows have been observed for the datetime that has to be forecasted).eval_incomplete (
bool
) – whether to consider predictions with incomplete history or target information when evaluating mixer accuracy with the validation dataset.interval_periods (
tuple
) – tuple of tuples with user-provided period lengths for time intervals. Default values will be added for intervals left unspecified. For interval options, check the timeseries_analyzer.detect_period() method documentation. e.g.: ((‘daily’, 7),).
- static from_dict(obj)[source]
Creates a TimeseriesSettings object from python dictionary specifications.
- Param:
obj: A python dictionary with the necessary representation for time-series. The only mandatory columns are
order_by
andwindow
.- Returns:
A populated
TimeseriesSettings
object.
- static from_json(data)[source]
Creates a TimeseriesSettings object from JSON specifications via python dictionary.
- Param:
data: JSON-config file with necessary Time-series specifications
- Returns:
A populated
TimeseriesSettings
object.
- class api.types.ProblemDefinition(target, pct_invalid, unbias_target, seconds_per_mixer, seconds_per_encoder, expected_additional_time, time_aim, target_weights, positive_domain, timeseries_settings, anomaly_detection, use_default_analysis, embedding_only, dtype_dict, ignore_features, fit_on_all, strict_mode, seed_nr)[source]
The
ProblemDefinition
object indicates details on how the models that predict the target are prepared. The only required specification from a user is thetarget
, which indicates the column within the input data that the user is trying to predict. Within theProblemDefinition
, the user can specify aspects about how long the feature-engineering preparation may take, and nuances about training the models.- Parameters:
target (
str
) – The name of the target column; this is the column that will be used as the goal of the prediction.pct_invalid (
float
) – Number of data points maximally tolerated as invalid/missing/unknown. If the data cleaning process exceeds this number, no subsequent steps will be taken.unbias_target (
bool
) – all classes are automatically weighted inverse to how often they occurseconds_per_mixer (
Optional
[int
]) – Number of seconds maximum to spend PER mixer trained in the list of possible mixers.seconds_per_encoder (
Optional
[int
]) – Number of seconds maximum to spend when training an encoder that requires data to learn a representation.expected_additional_time (
Optional
[int
]) – Time budget for non-encoder/mixer tasks (ex: data analysis, pre-processing, model ensembling or model analysis)time_aim (
Optional
[float
]) – Time budget (in seconds) to train all needed components for the predictive tasks, including encoders and models.target_weights (
Optional
[List
[float
]]) – indicates to the accuracy functions how much to weight every target class.positive_domain (
bool
) – For numerical taks, force predictor output to be positive (integer or float).timeseries_settings (
TimeseriesSettings
) – TimeseriesSettings object for time-series tasks, refer to its documentation for available settings.anomaly_detection (
bool
) – Whether to conduct unsupervised anomaly detection; currently supported only for time- series.dtype_dict (
Optional
[dict
]) – Mapping of features to types (see mindsdb.type_infer for all possible values). This will override the automated type inference results.ignore_features (
List
[str
]) – The names of the columns the user wishes to ignore in the ML pipeline. Any column name found in this list will be automatically removed from subsequent steps in the ML pipeline.use_default_analysis (
bool
) – whether default analysis blocks are enabled.fit_on_all (
bool
) – Whether to fit the model on the held-out validation data. Validation data is strictly used to evaluate how well a model is doing and is NEVER trained. However, in cases where users anticipate new incoming data over time, the user may train the model further using the entire dataset.strict_mode (
bool
) – crash if an unstable block (mixer, encoder, etc.) fails to run.seed_nr (
int
) – custom seed to use when generating a predictor from this problem definition.
- static from_dict(obj)[source]
Creates a ProblemDefinition object from a python dictionary with necessary specifications.
- Parameters:
obj (
Dict
) – A python dictionary with the necessary features for theProblemDefinition
class.
Only requires
target
to be specified.- Returns:
A populated
ProblemDefinition
object.
- static from_json(data)[source]
Creates a ProblemDefinition Object from JSON config file.
- Parameters:
data (
str
) –- Returns:
A populated ProblemDefinition object.
- class api.types.JsonAI(encoders, dtype_dict, dependency_dict, model, problem_definition, identifiers, cleaner=None, splitter=None, analyzer=None, explainer=None, imputers=None, analysis_blocks=None, timeseries_transformer=None, timeseries_analyzer=None, accuracy_functions=None)[source]
The JsonAI Class allows users to construct flexible JSON config to specify their ML pipeline. JSON-AI follows a recipe of how to pre-process data, construct features, and train on the target column. To do so, the following specifications are required internally.
- Parameters:
encoders (
Dict
[str
,Module
]) – A dictionary of the form: column_name -> encoder moduledtype_dict (
Dict
[str
,dtype
]) – A dictionary of the form: column_name -> data typedependency_dict (
Dict
[str
,List
[str
]]) – A dictionary of the form: column_name -> list of columns it depends onmodel (
Dict
[str
,Module
]) – The ensemble and its submodelsproblem_definition (
ProblemDefinition
) – TheProblemDefinition
criteria.identifiers (
Dict
[str
,str
]) – A dictionary of column names and respective data types that are likely identifiers/IDs within the data. Through the default cleaning process, these are ignored.cleaner (
Optional
[Module
]) – The Cleaner object represents the pre-processing step on a dataframe. The user can specify custom subroutines, if they choose, on how to handle preprocessing. Alternatively, “None” suggests the default approach indataprep_ml.cleaners
.splitter (
Optional
[Module
]) – The Splitter object is the method in which the input data is split into training/validation/testing data. For more details, refer to the dataprep_ml package documentation.analyzer (
Optional
[Module
]) – The Analyzer object is used to evaluate how well a model performed on the predictive task.explainer (
Optional
[Module
]) – The Explainer object deploys explainability tools of interest on a model to indicate how well a model generalizes its predictions.imputers (
Optional
[List
[Module
]]) – A list of objects that will impute missing data on each column. They are called inside the cleaner.analysis_blocks (
Optional
[List
[Module
]]) – The blocks that get used in both analysis and inference inside the analyzer and explainer blocks.timeseries_transformer (
Optional
[Module
]) – Procedure used to transform any timeseries task dataframe into the format that lightwood expects for the rest of the pipeline.timeseries_analyzer (
Optional
[Module
]) – Procedure that extracts key insights from any timeseries in the data (e.g. measurement frequency, target distribution, etc).accuracy_functions (
Optional
[List
[Union
[str
,Module
]]]) – A list of performance metrics used to evaluate the best mixers.
- static from_dict(obj)[source]
Creates a JSON-AI object from dictionary specifications of the JSON-config.
- to_dict(encode_json=False)[source]
Creates a python dictionary with necessary modules within the ML pipeline specified from the JSON-AI object.
- Return type:
Dict
[str
,Union
[dict
,list
,str
,int
,float
,bool
,None
]]- Returns:
A python dictionary that has the necessary components of the ML pipeline for a given dataset.
- class api.types.ModelAnalysis(accuracies, accuracy_histogram, accuracy_samples, train_sample_size, test_sample_size, column_importances, confusion_matrix, histograms, dtypes, submodel_data)[source]
The
ModelAnalysis
class stores useful information to describe a model and understand its predictive performance on a validation dataset. For each trained ML algorithm, we store:- Parameters:
accuracies (
Dict
[str
,float
]) – Dictionary with obtained values for each accuracy function (specified in JsonAI)accuracy_histogram (
Dict
[str
,list
]) – Dictionary with histograms of reported accuracy by target value.accuracy_samples (
Dict
[str
,list
]) – Dictionary with sampled pairs of observed target values and respective predictions.train_sample_size (
int
) – Size of the training set (data that parameters are updated on)test_sample_size (
int
) – Size of the testing set (explicitly held out)column_importances (
Dict
[str
,float
]) – Dictionary with the importance of each column for the model, as estimated by an approach that closely follows a leave-one-covariate-out strategy.confusion_matrix (
object
) – A confusion matrix for the validation dataset.histograms (
object
) – Histogram for each dataset feature.dtypes (
object
) – Inferred data types for each dataset feature.
- class api.types.PredictionArguments(predict_proba=True, all_mixers=False, mixer_weights=None, fixed_confidence=None, anomaly_cooldown=1, forecast_offset=0, simple_ts_bounds=False, time_format='', force_ts_infer=False, return_embedding=False)[source]
This class contains all possible arguments that can be passed to a Lightwood predictor at inference time. On each predict call, all arguments included in a parameter dictionary will update the respective fields in the PredictionArguments instance that the predictor will have.
- Parameters:
predict_proba (
bool
) – triggers (where supported) predictions in raw probability output form. I.e. for classifiers,
instead of returning only the predicted class, the output additionally includes the assigned probability for each class. :type all_mixers:
bool
:param all_mixers: forces an ensemble to return predictions emitted by all its internal mixers. :type mixer_weights:Optional
[list
] :param mixer_weights: a list with coefficients that are normalized into 0-1 bounded scores to mix the output of all mixers available to a compatible ensemble (e.g. [0.5, 0.5] for an ensemble with two mixers would yield the mean prediction). Can be used with WeightedMeanEnsemble, StackedEnsemble or TsStackedEnsemble. :type fixed_confidence:Union
[int
,float
,None
] :param fixed_confidence: Used in the ICP analyzer module, specifies an alpha fixed confidence so that predictions, in average, are correct alpha percent of the time. For unsupervised anomaly detection, this also translates into the expected error rate. Bounded between 0.01 and 0.99 (respectively implies wider and tighter bounds, all other parameters being equal). :type anomaly_cooldown:int
:param anomaly_cooldown: Sets the minimum amount of timesteps between consecutive firings of the the anomaly detector. :type simple_ts_bounds:bool
:param simple_ts_bounds: in forecasting contexts, enabling this parameter disables the usual conformal-based bounds (with Bonferroni correction) and resorts to a simpler way of scaling bounds through the horizon based on the uncertainty estimation for the first value in the forecast (see helpers.ts.add_tn_num_conf_bounds for the implementation). :param anomaly_cooldown: Sets the minimum amount of timesteps between consecutive firings of the the anomaly detector. :type time_format:str
:param time_format: For time series predictors. If set to infer, predicted order_by timestamps will be formatted back to the original dataset’s order_by format. Any other string value will be used as a formatting string, unless empty (‘’), which disables the feature (this is the default behavior). :type force_ts_infer:bool
:param force_ts_infer: For time series predictors. If set to true, an additional row will be produced per each group in the input DF, corresponding to an out-of-sample forecast w.r.t. to the input timestamps.