{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial - Introduction to Lightwood's statistical analysis\n", "\n", "\n", "As you might already know, Lightwood is designed to be a flexible machine learning (ML) library that is able to abstract and automate the entire ML pipeline. Crucially, it is also designed to be extended or modified very easily according to your needs, essentially offering the entire spectrum between fully automated AutoML and a lightweight wrapper for customized ML pipelines.\n", "\n", "As such, we can identify several different customizable \"phases\" in the process. The relevant phase for this tutorial is the \"statistical analysis\" that is normally ran in two different places:\n", "\n", "* To generate a Json AI object from some dataset and a problem definition\n", "* To train a Lightwood predictor\n", "\n", "In both cases, we leverage the `StatisticalAnalyzer` object from `dataprep_ml` (another ML package in the MindsDB ecosystem) to store key facts about the data we are using, and refer to them afterwards.\n", "\n", "## Objective\n", "\n", "In this tutorial, we will take a look at the automatically generated statistical analysis for a sample dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: load the dataset and define the predictive task\n", "\n", "The first thing we need is a dataset to analyze. Let's use Human Development Index information:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2024-05-07T17:15:27.263683Z", "iopub.status.busy": "2024-05-07T17:15:27.263490Z", "iopub.status.idle": "2024-05-07T17:15:27.584105Z", "shell.execute_reply": "2024-05-07T17:15:27.583456Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | Population | \n", "Area (sq. mi.) | \n", "Pop. Density | \n", "GDP ($ per capita) | \n", "Literacy (%) | \n", "Infant mortality | \n", "Development Index | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "9944201 | \n", "1284000 | \n", "7.7 | \n", "1200 | \n", "47.5 | \n", "93.82 | \n", "2 | \n", "
1 | \n", "5450661 | \n", "43094 | \n", "126.5 | \n", "31100 | \n", "100.0 | \n", "4.56 | \n", "4 | \n", "
2 | \n", "26783383 | \n", "437072 | \n", "61.3 | \n", "1500 | \n", "40.4 | \n", "50.25 | \n", "2 | \n", "
3 | \n", "9439 | \n", "102 | \n", "92.5 | \n", "3400 | \n", "97.0 | \n", "7.35 | \n", "4 | \n", "
4 | \n", "3431932 | \n", "176220 | \n", "19.5 | \n", "12800 | \n", "98.0 | \n", "11.95 | \n", "3 | \n", "