Bayesian Poisson Regression with Julia and Turing.jl

In this example, I am following the tutorials found in:

Both examples show the interaction between some variables and a discrete outcome. In this case, the outcome is the number of sneezes per day (i.e. a discrete outcome) in some study subjects, and whether or not they take antihistamine medicine and whether or not they drink alcohol.

This example explores how these factors, and more specifically, the combination of these factors, affect the number of times a person sneezes.

using CSV, DataFrames, Turing, StatsPlots, Plots, Random
Random.seed!(42)

TaskLocalRNG()

Collect (generate) the data

In this example, we will generate the data in the same way as in the tutorials:

	No Alcohol	Alcohol
No Meds	6	36
Meds	1	3

Those values will be used to create the artificial data by generating Poisson-distributed random samples.

theta_noalc_nomed = 6
theta_noalc_med = 1
theta_alc_nomed = 36
theta_alc_med = 3

ns = 500    # number of samples

# create a data frame

data = DataFrame(
    hcat(
        vcat(
        rand(Poisson(theta_noalc_med), ns),
        rand(Poisson(theta_alc_med), ns),
        rand(Poisson(theta_noalc_nomed), ns),
        rand(Poisson(theta_alc_nomed), ns)
        ),
        vcat(
            falses(ns),
            trues(ns),
            falses(ns),
            trues(ns)
        ),
        vcat(
            falses(ns),
            falses(ns),
            trues(ns),
            trues(ns)
        )
    ), :auto
)

# assign names to headers

head_names = [:n_sneezes, :alcohol, :nomeds]

sneeze_data = DataFrame(data, head_names)

first(sneeze_data, 10)

10×3 DataFrame

Row	n_sneezes	alcohol	nomeds
	Int64	Int64	Int64
1	0	0	0
2	1	0	0
3	0	0	0
4	0	0	0
5	1	0	0
6	1	0	0
7	1	0	0
8	1	0	0
9	2	0	0
10	2	0	0

Visualize the data

Now that we have “collected” some data on the number of sneezes per day from a number of people, we visualize the data.

The way we are collecting and plotting these data sub-sets is as follows:

Call the histogram function
Create a histogram of the dataframe “sneeze_data” we “collected” previously
Select a subset of that dataframe
All the rows of the columns where alcohol is false i.e. 0 AND all the rows where no medicine was taken is also false
All the rows of the columns where alcohol is false AND all the rows of where medicine is true
… and so on

# create separate histograms for each case

p1 = histogram(sneeze_data[(sneeze_data[:,:alcohol] .== 0) .& (sneeze_data[:,:nomeds] .== 0), :n_sneezes]; title = "No alcohol + No Meds", ylabel="People Count")
p2 = histogram(sneeze_data[(sneeze_data[:,:alcohol] .== 1) .& (sneeze_data[:,:nomeds] .== 0), :n_sneezes]; title = "No alcohol + Meds")
p3 = histogram(sneeze_data[(sneeze_data[:,:alcohol] .== 0) .& (sneeze_data[:,:nomeds] .== 1), :n_sneezes]; title = "Alcohol + No Meds", xlabel = "Sneezes/Day", ylabel="People Count")
p4 = histogram(sneeze_data[(sneeze_data[:,:alcohol] .== 1) .& (sneeze_data[:,:nomeds] .== 1), :n_sneezes]; title = "Alcohol + Meds", xlabel = "Sneezes/Day")

plot(p1, p2, p3, p4; layout=(2,2), legend = false)

svg

Interpreting the data

The histograms show that the data from the “study” resembles a Poisson distribution (as mentioned in the PyMC tutorial, this is obvious, because that’s how the data is generated!). Furthermore, the data is telling us something:

Looking at the plot for “no alcohol and medicine” it is clear that most people reported very few sneezes; notice how the histogram skews towards large counts (of people) for very few sneezes
On the other hand, notice how the “alcohol and no medicine” seems to tell us that many reported somewhere around 35 sneezes per day

Again, we can start thinking of a pattern just by looking at the data, and it seems like the data is telling us that if you don’t drink alcohol and take antihistamines, you are less likely to be sneezing around than if you drink alcohol and don’t take any allergy meds. Makes sense, right?

Model

We established that the data looks like it could be modelled as a Poisson distribution. Thus, we can define our probabilistic model as follows:

$$Y_{obs} \sim Poisson(\lambda)$$

$$\log(\lambda) = \theta’\mathbf{x} = \alpha + \beta’ \mathbf{x}$$

What the above means is that we assume that the observed data outcomes, i.e., the number of sneezes per day, follow a Poisson distribution, which is a discrete probability distribution that models the number of events that occur in a fixed interval of time or space. The rate or intensity of the events, $\lambda$, depends on the predictor variables (the input data) $\mathcal{x}$, such as the season, the temperature, or, in our case, whether a person ingested alcohol and whether the person took antihistamines.

The linear predictor $\theta’ \mathcal{x}$ is the function that links the predictor variables to the rate parameter, where $\theta = {\alpha, \beta’}$ are the parameters of the model.

Looking at the structure of the linear relationship between the paramters of the model, and the predictors:

$$\log(\lambda) = \alpha + \beta’ \mathcal{x}$$

we can understand that the parameter $\alpha$ is the intercept, which is the expected number of sneezes when all the predictor variables are zero. The parameter $\beta’$ is a vector of coefficients, which measure the effect of each predictor variable $\mathcal{x}$ on the number of sneezes. The log link function ensures that the rate parameter $\lambda$ is always positive and allows for multiplicative effects of the predictor variables on the response variable.

Define the model with `Turing.jl`

Now that we know how we are modeling our data, we use the package Turing.jl to define the model. Turing.jl is a tool that helps us write models in Julia and find the best parameters for them.

The model has two parts: the prior and the likelihood. The prior is what we think or guess about the parameters before we see the data. The likelihood is how likely the data is under the parameters. The parameters are the numbers that control the model, such as the rate of sneezes.

We use the Poisson distribution for the likelihood, because it is good for counting things, like sneezes. The Poisson distribution has one parameter, the rate of sneezes. The higher the rate, the more sneezes we expect.

We use any distribution for the prior, depending on how much we know about the parameters. If we know nothing, we use a flat prior, which does not favor any value. The prior affects the final answer, because it is our starting point.

We use Bayes’ theorem to combine the prior and the likelihood and get the final answer. The final answer is the posterior, which is what we believe about the parameters after we see the data. The posterior is the best fit for the model and the data.

Let’s crank up the Bayes!

@model function poisson(x, y)

		# define the priors
		alpha ~ Normal(0,1)
		alcohol ~ Normal(0, 1)
		nomeds ~ Normal(0, 1)
		# alc_med ~ Normal(0,1)

		# define the likelihood
		for i in 1:length(y)
	        log_lambda = alpha + alcohol * x[i, 1] + nomeds * x[i, 2] 
	        lambda = exp(log_lambda)
	        y[i] ~ Poisson(lambda)
	    end
	
	end

poisson (generic function with 2 methods)

# pass the data to the model function
	# pass the predictor data as a Matrix for efficiency
model = poisson(Matrix(sneeze_data[!,[:alcohol, :nomeds] ]), sneeze_data[!, :n_sneezes])

# select the sampler
sampler = NUTS()

# define the number of sampler
samples = 1000

# set number of chains
num_chains = 8
	
# crank up the Bayes!
chain = sample(model, sampler, MCMCThreads(), samples, num_chains)

[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mFound initial step size
[36m[1m└ [22m[39m  ϵ = 0.00625
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mFound initial step size
[36m[1m└ [22m[39m  ϵ = 0.0125
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mFound initial step size
[36m[1m└ [22m[39m  ϵ = 0.00625
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mFound initial step size
[36m[1m└ [22m[39m  ϵ = 0.0125
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mFound initial step size
[36m[1m└ [22m[39m  ϵ = 0.0125
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mFound initial step size
[36m[1m└ [22m[39m  ϵ = 0.00625
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mFound initial step size
[36m[1m└ [22m[39m  ϵ = 0.00625
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mFound initial step size
[36m[1m└ [22m[39m  ϵ = 0.0125
[32mSampling (8 threads): 100%|█████████████████████████████| Time: 0:00:00[39m





Chains MCMC chain (1000×15×8 Array{Float64, 3}):

Iterations        = 501:1:1500
Number of chains  = 8
Samples per chain = 1000
Wall duration     = 13.66 seconds
Compute duration  = 100.67 seconds
parameters        = alpha, alcohol, nomeds
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
 [1m parameters [0m [1m    mean [0m [1m     std [0m [1m    mcse [0m [1m  ess_bulk [0m [1m  ess_tail [0m [1m    rhat [0m [1m[0m ⋯
 [90m     Symbol [0m [90m Float64 [0m [90m Float64 [0m [90m Float64 [0m [90m   Float64 [0m [90m   Float64 [0m [90m Float64 [0m [90m[0m ⋯

       alpha   -0.5025    0.0277    0.0005   2943.5608   2841.2874    1.0030   ⋯
     alcohol    1.7333    0.0186    0.0003   3801.1996   3652.2403    1.0022   ⋯
      nomeds    2.3348    0.0236    0.0004   2901.3750   3410.6453    1.0020   ⋯
[36m                                                                1 column omitted[0m

Quantiles
 [1m parameters [0m [1m    2.5% [0m [1m   25.0% [0m [1m   50.0% [0m [1m   75.0% [0m [1m   97.5% [0m
 [90m     Symbol [0m [90m Float64 [0m [90m Float64 [0m [90m Float64 [0m [90m Float64 [0m [90m Float64 [0m

       alpha   -0.5568   -0.5212   -0.5023   -0.4839   -0.4486
     alcohol    1.6974    1.7205    1.7331    1.7458    1.7698
      nomeds    2.2891    2.3189    2.3346    2.3506    2.3824

NOTE: The above routine employs the MCMCThreads method to sample multiple chains. However, in order to implement this, one needs to change the environment variables for the number of threads Julia can use. These two threads might shed some light as to how to achieve this:

Of course, if you don’t want to bother, then just change the last two functional lines in the cell above so that they read:

	# set number of chains - comment this out:
	# num_chains = 8
	
	# crank up the Bayes! - delete MCMCThreads() and num_chains
	chain = sample(model, sampler, samples)

Visualize the results

We can see above that we have obtained a sample pool of the posterior distribution of the parameters. This is what we were looking for. What this means is that now we have a posterior distribution (in the form of a sample pool), which we can also summarize with summary statistics.

Let’s look at the diagnostics plots and the summary statistics.

plot(chain)

svg

DataFrame(summarystats(chain))

3×8 DataFrame

Row	parameters	mean	std	mcse	ess_bulk	ess_tail	rhat	ess_per_sec
	Symbol	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	alpha	-0.502519	0.0276553	0.000511069	2943.56	2841.29	1.00298	29.2397
2	alcohol	1.7333	0.0186097	0.000301611	3801.2	3652.24	1.00224	37.759
3	nomeds	2.3348	0.0236269	0.000436385	2901.38	3410.65	1.00197	28.8207

# taking the first chain
c1 = chain[:, :, 1]

# Calculating the exponentiated means
b0_exp = exp(mean(c1[:alpha]))
b1_exp = exp(mean(c1[:alcohol]))
b2_exp = exp(mean(c1[:nomeds]))

println("The exponent of the mean of the weights (or coefficients) are: \n")
println("b0: ", b0_exp)
println("b1: ", b1_exp)
println("b2: ", b2_exp)

The exponent of the mean of the weights (or coefficients) are: 

b0: 0.604415461752317
b1: 5.658573583760772
b2: 10.342642711232362

Notice how we are not recovering the original $\lambda$ values that were used to create this data set, i.e.:

	theta_noalc_nomed = 6
	theta_noalc_med = 1
	theta_alc_nomed = 36
	theta_alc_med = 3

Instead, we are recovering the parameters of the linear function, in other words, $\theta = {\alpha, \beta’}$ in the linear relation:

$$\log(\lambda) = \alpha + \beta_1 x_{alc} + \beta_2 x_{meds}$$

where $x_{(\cdot)}$ represents the binary variable of whether the subject took alcohol/medicine or not.

Conclusion

This tutorial shows how to perform Bayesian inference on discrete data, e.g. the record of how many sneezes per day a group of people had, and classified according to their alcohol and medication consumption.

In real-world scenarios, we would obviously not know the parameter values, since this is precisely what we want to find out by incorporating whatever we knew about them into what we observed.

Collect (generate) the data¶

Visualize the data¶

Interpreting the data¶

Model¶

Define the model with Turing.jl¶

Visualize the results¶

Conclusion¶