Open In Colab


Problem Statement

In this notebook, we will explore the relationship between height and weight using Bayesian linear regression. Our goal is to fit a linear model of the form:

$$ y = \alpha + \beta x + \varepsilon $$

where:

  • $y$ represents the weight,
  • $x$ represents the height,
  • $\alpha$ is the intercept,
  • $\beta$ is the slope,
  • $\varepsilon$ is the error term, modeled as Gaussian white noise, i.e., $\varepsilon \sim \mathcal{N}(0, \sigma)$, where $\sigma$ is the standard deviation of the noise.

We will use Bayesian inference to estimate the posterior distributions of $\alpha$ and $\beta$ given our data and prior assumptions. Bayesian methods provide a natural way to quantify uncertainty in our parameter estimates and predictions.

Approach

To achieve our goal, we will:

  1. Load Real Data: We will use an actual dataset representing the heights and weights of individuals, sourced from Kaggle.
  2. Define the Bayesian Model: Using the probabilistic programming package PyMC, we will define our Bayesian linear regression model, specifying our priors for $\alpha$, $\beta$, and $\sigma$.
  3. Perform Inference: We will use Markov Chain Monte Carlo (MCMC) algorithms, such as the No-U-Turn Sampler (NUTS), to sample from the posterior distributions of our model parameters.
  4. Visualization and Prediction: We will visualize the results, including the regression lines sampled from the posterior, the uncertainty intervals, and make predictions on new, unobserved data points.

Reference

This notebook is inspired by examples from the PyMC documentation, specifically the Generalized Linear Regression tutorial. It also builds upon a similar implementation in Julia using Turing.jl. This PyMC recreation aims at providing a more complete illustration of the use of probabilistic programming languages.

Initial setup

Import the necessary packages.

Additionally, this notebook is supposed to be used in Google Colab. The data set (CSV) file is hosted in a private github repo. Therefore, include the github cloning to the temporary session so that the data can be accessed and used in the Colab session.

import os
import arviz as az
import pymc as pm
import pandas as pd
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams['text.usetex'] = True
plt.rcParams['font.family'] = 'STIXGeneral'

Bayesian Workflow

For this exercise, I will implement the following workflow:

  • Collect data: this will be implemented by downloading the relevant data set
  • Build a Bayesian model: this will be built using PyMC
  • Infer the posterior distributions of the parameters $\alpha$ and $\beta$, as well as the model noise
  • Evaluate the fit of the model

Collecting the data

The data to be analyzed will be the height vs. weight data from https://www.kaggle.com/datasets/burnoutminer/heights-and-weights-dataset

# load the data and print the header
csv_path = 'data/SOCR-HeightWeight.csv'

data = pd.read_csv(csv_path)
data.head()

IndexHeight(Inches)Weight(Pounds)
0165.78331112.9925
1271.51521136.4873
2369.39874153.0269
3468.21660142.3354
4567.78781144.2971

Let’s instead work with the International System.

Convert the values to centimeters and kilograms.

# Renaming columns 2 and 3
new_column_names = {data.columns[1]: 'Height (cm)', data.columns[2]: 'Weight (kg)'}
data.rename(columns = new_column_names, inplace = True)

# convert the values to SI units
data[data.columns[1]] = data[data.columns[1]]*2.54
data[data.columns[2]] = data[data.columns[2]]*0.454


# assign the relevant data to variables for easier manipulation
height = data['Height (cm)'][:1000]
weight = data['Weight (kg)'][:1000]

data.head()

IndexHeight (cm)Weight (kg)
01167.08960751.298595
12181.64863361.965234
23176.27280069.474213
34173.27016464.620272
45172.18103765.510883

Visualize the data

# scatter plot of the data

plt.scatter(height, weight, s = 20, edgecolor = 'black', alpha = 0.5)
plt.title('Height vs. Weight')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.grid(True, linestyle='--', linewidth=0.5, color='gray', alpha=0.5)
# plt.show()

png

Building a Bayesian model with PyMC

First, we assume that the weight is a variable dependent on the height. Thus, we can express the Bayesian model as:

$$y \sim \mathcal{N}(\alpha + \beta \mathbf{X}, \sigma^2)$$

Since we want to infer the posterior distribution of the parameters $\theta = {\alpha, \beta, \sigma }$, we need to assign priors to those variables. Remember that $\sigma$ is a measure of the uncertainty in the model.

$$ \begin{align*} \alpha &\sim \mathcal{N}(0,10) \\ \beta &\sim \mathcal{N}(0,1) \\ \sigma &\sim \mathcal{TN}(0,100; 0, \infty) \end{align*} $$ The last distribution is a truncated normal distribution bounded from 0 to $\infty$.

Note: Here, we define the input data height as a MutableData container. The reason for this is because, later, we will want to change this input data, to make predictions. This will become clear a bit later.

with pm.Model() as blr_model:

    x = pm.MutableData('height', height)

    # define the priors
    alpha = pm.Normal('alpha', 0, 10)
    beta = pm.Normal('beta', 0, 10)
    sigma = pm.TruncatedNormal('sigma', mu = 0, sigma = 100, lower = 0)

    # define the likelihood - assign the variable name "y" to the observations
    y = pm.Normal('y', mu = alpha + (beta * x), sigma = sigma, observed = weight, shape = x.shape)

    # inference - crank up the bayes!
    trace = pm.sample(1000, chains = 4)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta, sigma]
100.00% [8000/8000 00:37<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 53 seconds.

We can explore the trace object.

trace.to_dataframe().columns
Index([                                  'chain',
                                          'draw',
                          ('posterior', 'alpha'),
                           ('posterior', 'beta'),
                          ('posterior', 'sigma'),
           ('sample_stats', 'perf_counter_diff'),
          ('sample_stats', 'perf_counter_start'),
             ('sample_stats', 'smallest_eigval'),
               ('sample_stats', 'step_size_bar'),
         ('sample_stats', 'index_in_trajectory'),
                      ('sample_stats', 'energy'),
            ('sample_stats', 'max_energy_error'),
                ('sample_stats', 'energy_error'),
             ('sample_stats', 'acceptance_rate'),
                  ('sample_stats', 'tree_depth'),
           ('sample_stats', 'process_time_diff'),
                   ('sample_stats', 'step_size'),
                     ('sample_stats', 'n_steps'),
              ('sample_stats', 'largest_eigval'),
                   ('sample_stats', 'diverging'),
                          ('sample_stats', 'lp'),
       ('sample_stats', 'reached_max_treedepth')],
      dtype='object')

Visualize the inference diagnostics

Now that we have performed Bayesian inference using the NUTS() algorithm, we can visualize the results. Additionally, call for a summary of the statistics of the inferred posterior distributions of $\theta$.

# visualize the results
# az.style.use('arviz-darkgrid')

labeller = az.labels.MapLabeller(var_name_map = {'alpha': r'$\alpha$',
                                                'beta': r'$\beta$',
                                                'sigma': r'$\sigma$'})

az.plot_trace(trace, var_names = ['alpha', 'beta', 'sigma'], labeller = labeller, compact = False)
plt.tight_layout()
# plt.show()

png

Interpreting the MCMC Diagnostics Plots

Trace plots are crucial for diagnosing the performance of Markov Chain Monte Carlo (MCMC) algorithms. These plots typically consist of two parts for each parameter: the trace plot and the posterior density plot.

The trace plot shows the sampled values of a parameter across iterations. A well-behaved trace plot should look like a “hairy caterpillar,” indicating good mixing. This means the trace should move around the parameter space without getting stuck and should not display any apparent patterns or trends. If the trace shows a clear trend or drift, it suggests that the chain has not yet converged. For the parameters $\alpha$ (intercept), $\beta$ (slope), and $\sigma$ (standard deviation of noise), we want to see the traces for different chains mixing well and stabilizing around a constant mean.

The posterior density plot shows the distribution of the sampled values of a parameter. This plot helps visualize the posterior distribution of the parameter. A good density plot should be smooth and unimodal, indicating that the parameter has a well-defined posterior distribution. If multiple chains are used, their density plots should overlap significantly, suggesting that all chains are sampling from the same distribution. For $\alpha$, $\beta$, and $\sigma$, overlapping density plots indicate that the chains have converged to the same posterior distribution.

Next, we can visualize the posterior distributions of the inferred parameters.eters.

# visualize the posterior distributions
az.plot_posterior(trace, var_names = ['alpha', 'beta', 'sigma'], labeller = labeller)
plt.show()

png

After visualizing the inference diagnostics and the posterior distributions of the paramters, we can also obtain the summary statistics.

# get the summary statistics of the posterior distributions
pm.summary(trace, kind = "stats")

meansdhdi_3%hdi_97%
alpha-28.5574.558-36.650-19.619
beta0.5000.0260.4490.548
sigma4.6570.1004.4744.850

Visualize the results

Now that we have posterior distributions for the parameters $\theta$, we can plot the the resulting linear regression functions. The following is an excerpt from PyMC’s Generalized Linear Regression tutorial:

In GLMs, we do not only have one best fitting regression line, but many. A posterior predictive plot takes multiple samples from the posterior (intercepts and slopes) and plots a regression line for each of them. We can manually generate these regression lines using the posterior samples directly.

Below, what we will effectively be doing is:

$$ y_i = \alpha_i + \beta_i \mathbf{X} \ \ \ , \ \ \ {i = 1, \ldots , N_{samples}}$$

where $N_{samples}$ are the number of samples from the posterior. This number comes from the inference procedure, and in practical terms is the umber of samples we asked PyMC to produce.

In other words, plotting the samples from the posterior distribution involves plotting the regression lines sampled from the posterior. Each sample represents a possible realization of the regression line based on the sampled values of the parameters $\alpha$ (intercept) and $\beta$ (slope).

These sample regression lines ullustrate the uncertainty in the regression model’s parameters and how this uncertainty propagates into the predictions (of the regression line).

# use the posterior to create regression line samples
# equivalent to: y[i]  = alpha[i] + beta[i]*X
trace.posterior["y_posterior"] = trace.posterior["alpha"] + trace.posterior["beta"]*xr.DataArray(height)

# plot the regression lines
_, ax = plt.subplots(figsize=(7,7))
az.plot_lm(idata = trace, y = weight, x = height, axes=ax, y_model="y_posterior",
           y_kwargs={"color":"b", "alpha":0.2, "markeredgecolor":"k", "label":"Observed Data", "markersize":10},
           y_model_plot_kwargs={"alpha": 0.2, "zorder": 10, "color":"#00cc99"},
           y_model_mean_kwargs={"color":"red"}
          )

plt.show()

png

Using the Linear Regression Model to Make Predictions

Now that we have a fitted Bayesian linear regression model, we can use it to make predictions. This involves sampling from the posterior predictive distribution, which allows us to generate predictions for new data points while incorporating the uncertainty from the posterior distribution of the parameters.

Sample from the Posterior Predictive Distribution:

  • This step involves using the inferred trace from our Bayesian linear regression model blr_model to generate predictions. The pm.sample_posterior_predictive function in PyMC allows us to do this. It uses the posterior samples of the parameters to compute the predicted values of the outcome variable.
# now predict the outcomes using the inferred trace
with blr_model:
    # use the updated values and predict outcomes and probabilities:
    pm.sample_posterior_predictive(
        trace,
        var_names = ['y'],
        return_inferencedata=True,
        extend_inferencedata=True,
    )
Sampling: [y]
100.00% [4000/4000 00:00<00:00]

Exploring the Trace Object

The trace object stores the results of our inference. Initially, it contained the posterior samples of the model parameters (e.g., intercept and slope).

After running pm.sample_posterior_predictive, the trace object is extended to include the posterior predictive samples. These are the predicted values for the outcome variable, given the posterior distribution of the model parameters.

# explore the trace object again
trace.to_dataframe().columns
Index([                                  'chain',
                                          'draw',
                          ('posterior', 'alpha'),
                           ('posterior', 'beta'),
                          ('posterior', 'sigma'),
              ('posterior', 'y_posterior[0]', 0),
          ('posterior', 'y_posterior[100]', 100),
          ('posterior', 'y_posterior[101]', 101),
          ('posterior', 'y_posterior[102]', 102),
          ('posterior', 'y_posterior[103]', 103),
       ...
                ('sample_stats', 'energy_error'),
             ('sample_stats', 'acceptance_rate'),
                  ('sample_stats', 'tree_depth'),
           ('sample_stats', 'process_time_diff'),
                   ('sample_stats', 'step_size'),
                     ('sample_stats', 'n_steps'),
              ('sample_stats', 'largest_eigval'),
                   ('sample_stats', 'diverging'),
                          ('sample_stats', 'lp'),
       ('sample_stats', 'reached_max_treedepth')],
      dtype='object', length=2022)

We can observe how now we have another inference data container: posterior_predictive. This was generated by passing the extend_inferencedata argument to the pm.sample_posterior_predictive function above.

This data contains predictions by passing the observed heights through our linear model and making predictions. Note that these “predictions” are made on observed data. This is similar to using validating the predictions on training data in machine learning, i.e. comparing the model predictions to the actual data on an observed input.

We can use the linear regression model to make predictions. It should be noted that, again, the linear regression model is not a single regression line, but rather a set of regression lines generated from the posterior probability of $\theta$.

Visualize the Prediction Confidence Interval

After we sampled from the posterior, we might want to visualize this to understand the posterior predictive distribution.

In the code below, there are two things going on, let’s go through them.

  1. Plotting the samples from the posterior distribution

This part is exactly what we did before, which is plotting the sample posteriors of the regression line. These sample regression lines are a natural product of propagating the uncertainty from the parameters unto the prediction line.

  1. Plotting the uncertainty in the mean and the observations

Now we can add a ribbon to show the uncertainty not only in the regression line, but in the prediction points themselves. That is, that ribbon will tell us where we might expect a prediction point $i+1$, i.e.

$$ y_{i+1} = \alpha_{i+1} + \beta_{i+1} x^* $$

where $x^*$ is a test input point. In other words, and more specific to this demonstration:

what is the interval where we would expect a predicted weight $y_{i+1}$ of an individual with a height $x*$.

# use the posterior to create regression line samples
# trace.posterior["y_posterior"] = trace.posterior["alpha"] + trace.posterior["beta"]*xr.DataArray(height)  # y_posterior = alpha + beta*x
_, ax = plt.subplots(figsize=(7,7))
az.plot_lm(idata = trace, y = weight, x = height, axes=ax, y_model="y_posterior",
           y_kwargs={"color":"b", "alpha":0.2, "markeredgecolor":"k", "label":"Observed Data", "markersize":10},
           y_model_plot_kwargs={"alpha": 0.2, "zorder": 10, "color":"#00cc99"},
           y_model_mean_kwargs={"color":"red"}
          );

# plot the prediction interval
az.plot_hdi(
    height,
    trace.posterior_predictive["y"],
    hdi_prob=0.6,
    fill_kwargs={"alpha": 0.8},
)

plt.show()

png

Making Predictions on Unobserved Data Inputs

Now, how about the case when we want to make predictions on test data that we have not seen? That is, predict the weight of an individual whose height/weight we have not observed (measured)

In other words, we have some test input data, i.e. some heights for which we want to predict the weights.

Some references of where I learned how to do this:

  1. In this example and this other example it says that we can generate out-of-sample predictions by using pm.sample_posterior_predictive and it shows an example of how to use the syntax.

  2. More recently, this demo blog post clarifies how to make predictions on out-of-model samples.

Let’s do just that now. First, we will define the test inputs we want to predict for, pred_height. Then, inside the model, we replace the data (which was defined as MutableData, with the new data we want to make predictions on. This is done as follows:

# set new data inputs:
pred_height = np.array([ 'new_data' ])

with blr_model:
  pm.set_data({'height': pred_height})

What this is effectively doing is telling sample_posterior_predictive that we need to make predictions on height which now happens to be different.

# define the out-of-sample predictors
pred_height = [158.0, 185.5, 165.2, 178.0,  180.0, 170.2]

print(pred_height)

with blr_model:
    # set the new data we want to make predictions for
    pm.set_data({'height': pred_height})

    post_pred = pm.sample_posterior_predictive(
        trace,
        predictions = True
    )
Sampling: [y]


[158.0, 185.5, 165.2, 178.0, 180.0, 170.2]
100.00% [4000/4000 00:00<00:00]

What we have done above is create an inference data object called post_pred. This object contains the samples of the predictions on the new data. Specifically, it includes two containers: predictions and predictions_constant_data.

The predictions container holds the predicted samples for our new heights. The predictions_constant_data holds the new heights we passed into the model.

post_pred.to_dataframe()

chaindraw(y[0], 0)(y[1], 1)(y[2], 2)(y[3], 3)(y[4], 4)(y[5], 5)
00048.98193062.97118662.14338559.30074256.10023754.329348
10155.48119265.13287654.76187761.31225459.22012451.817360
20249.47155066.01691060.64627357.87634456.20372060.318281
30353.37373766.59365353.08579963.43794964.33662645.372830
40452.98130969.32005951.59068660.37204662.21073848.188656
...........................
3995399552.30381461.93111747.54421660.82440161.46954562.353284
3996399656.03229556.97904054.58483755.89421665.94390850.929285
3997399756.06235250.88949951.44100357.84153362.89865452.749139
3998399848.22877265.98338352.38116455.28394665.46804970.367514
3999399958.43418454.73936356.77326053.12811261.69546954.874142

4000 rows × 8 columns

We can visualize the posterior distributions of the predictions.

az.plot_posterior(post_pred, group="predictions");

png

We can obtain point estimates by taking the mean of each prediction distribution. This is done by taking the mean of the predictions over the chain and draw dimensions, as follows:

pred_weight = post_pred.predictions['y'].mean(dim = ['chain', 'draw'])
print("Predicted weights: ", pred_weight.values)
Predicted weights:  [50.37415152 64.29241929 54.02070975 60.60276731 61.36759368 56.53983895]

Finally, we can visualize where the predictions fall by adding a scatter plot with the new ${x^, y^}$ data.

# use the posterior to create regression line samples
# trace.posterior["y_posterior"] = trace.posterior["alpha"] + trace.posterior["beta"]*xr.DataArray(height)  # y_posterior = alpha + beta*x
_, ax = plt.subplots(figsize=(7,7))

az.plot_lm(idata = trace, y = weight, x = height, axes=ax, y_model="y_posterior",
           y_kwargs={"color":"b", "alpha":0.2, "markeredgecolor":"k", "label":"Observed Data", "markersize":10},
           y_model_plot_kwargs={"alpha": 0.2, "zorder": 10, "color":"#00cc99"},
           y_model_mean_kwargs={"color":"red"}
          );

# plot the prediction interval
az.plot_hdi(
    height,
    trace.posterior_predictive["y"],
    hdi_prob=0.6,
    fill_kwargs={"alpha": 0.8},
)

# add predicted weights to the plot

ax.scatter(pred_height,
           pred_weight.values,
           color = 'blue',
           label = 'Predicted Weights',
           zorder = 15
           )

ax.legend()

plt.show()

png

Thank you!

This demo focused on a relatively simple task. Here, however, we focused more on what a Bayesian approach means in the context of a linear regression. Additionally, we focused on using PyMC for developing the model, visualizing the results and, just as importantly, on making predictions using those results.

Victor