Logistic regression with GLM.jl
The purpose of this tutorial is to explain how to embed a logistic regression model from GLM.jl into JuMP.
The data and example in this tutorial comes from the paper: David Bergman, Teng Huang, Philip Brooks, Andrea Lodi, Arvind U. Raghunathan (2021) JANOS: An Integrated Predictive and Prescriptive Modeling Framework. INFORMS Journal on Computing 34(2):807-816. https://doi.org/10.1287/ijoc.2020.1023
Required packages
This tutorial uses the following packages.
julia> using JuMP
julia> import CSV
julia> import DataFrames
julia> import Downloads
julia> import GLM
julia> import Ipopt
julia> import MathOptAI
julia> import Statistics
Data
Here is a function to load the data directly from the JANOS repository:
julia> function read_df(filename) url = "https://raw.githubusercontent.com/INFORMSJoC/2020.1023/master/data/" data = Downloads.download(url * filename) return CSV.read(data, DataFrames.DataFrame) end
read_df (generic function with 1 method)
There are two important files. The first, college_student_enroll-s1-1.csv
, contains historical admissions data on anonymized students, their SAT score, their GPA, their merit scholarships, and whether the enrolled in the college.
julia> train_df = read_df("college_student_enroll-s1-1.csv")
20000×6 DataFrame Row │ Column1 StudentID SAT GPA merit enroll │ Int64 Int64 Int64 Float64 Float64 Int64 ───────┼───────────────────────────────────────────────────── 1 │ 1 1 1507 3.72 1.64 0 2 │ 2 2 1532 3.93 0.52 0 3 │ 3 3 1487 3.77 1.67 0 4 │ 4 4 1259 3.05 1.21 1 5 │ 5 5 1354 3.39 1.65 1 6 │ 6 6 1334 3.22 0.0 0 7 │ 7 7 1125 2.73 1.68 1 8 │ 8 8 1180 2.82 0.0 1 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 19994 │ 19994 19994 1185 3.09 1.16 1 19995 │ 19995 19995 1471 3.7 1.05 0 19996 │ 19996 19996 1139 3.03 1.21 1 19997 │ 19997 19997 1371 3.39 1.26 0 19998 │ 19998 19998 1424 3.72 0.85 0 19999 │ 19999 19999 1170 3.01 0.73 1 20000 │ 20000 20000 1389 3.57 0.55 0 19985 rows omitted
The second, college_applications6000.csv
, contains the SAT and GPA data of students who are currently applying:
julia> evaluate_df = read_df("college_applications6000.csv")
6000×3 DataFrame Row │ StudentID SAT GPA │ Int64 Int64 Float64 ──────┼─────────────────────────── 1 │ 1 1240 3.22 2 │ 2 1206 2.87 3 │ 3 1520 3.74 4 │ 4 1238 3.11 5 │ 5 1142 2.74 6 │ 6 1086 2.77 7 │ 7 1367 3.28 8 │ 8 1034 2.41 ⋮ │ ⋮ ⋮ ⋮ 5994 │ 5994 1121 2.96 5995 │ 5995 1564 4.1 5996 │ 5996 1332 3.14 5997 │ 5997 1228 2.95 5998 │ 5998 1165 2.81 5999 │ 5999 1400 3.43 6000 │ 6000 1097 2.65 5985 rows omitted
There are 6,000 prospective students:
julia> n_students = size(evaluate_df, 1)
6000
Prediction model
The first step is to train a logistic regression model to predict the Boolean enroll
column based on the SAT
, GPA
, and merit
columns.
julia> predictor = GLM.glm( GLM.@formula(enroll ~ 0 + SAT + GPA + merit), train_df, GLM.Bernoulli(), )
StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Distributions.Bernoulli{Float64}, GLM.LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}} enroll ~ 0 + SAT + GPA + merit Coefficients: ────────────────────────────────────────────────────────────────────────── Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95% ────────────────────────────────────────────────────────────────────────── SAT 0.00243882 0.000309312 7.88 <1e-14 0.00183258 0.00304506 GPA -1.09868 0.123796 -8.87 <1e-18 -1.34132 -0.856045 merit 0.248294 0.0191086 12.99 <1e-37 0.210842 0.285747 ──────────────────────────────────────────────────────────────────────────
Decision model
Now that we have a trained logistic regression model, we want a decision model that chooses the optimal merit scholarship for each student in
julia> evaluate_df
6000×3 DataFrame Row │ StudentID SAT GPA │ Int64 Int64 Float64 ──────┼─────────────────────────── 1 │ 1 1240 3.22 2 │ 2 1206 2.87 3 │ 3 1520 3.74 4 │ 4 1238 3.11 5 │ 5 1142 2.74 6 │ 6 1086 2.77 7 │ 7 1367 3.28 8 │ 8 1034 2.41 ⋮ │ ⋮ ⋮ ⋮ 5994 │ 5994 1121 2.96 5995 │ 5995 1564 4.1 5996 │ 5996 1332 3.14 5997 │ 5997 1228 2.95 5998 │ 5998 1165 2.81 5999 │ 5999 1400 3.43 6000 │ 6000 1097 2.65 5985 rows omitted
Here's an empty JuMP model to start:
julia> model = Model()
A JuMP Model ├ solver: none ├ objective_sense: FEASIBILITY_SENSE ├ num_variables: 0 ├ num_constraints: 0 └ Names registered in the model: none
First, we add a new column to evaluate_df
, with one JuMP decision variable for each row. It is important the .merit
column name in evaluate_df
matches the name in train_df
.
julia> evaluate_df.merit = @variable(model, 0 <= x_merit[1:n_students] <= 2.5);
julia> evaluate_df
6000×4 DataFrame Row │ StudentID SAT GPA merit │ Int64 Int64 Float64 GenericV… ──────┼────────────────────────────────────────── 1 │ 1 1240 3.22 x_merit[1] 2 │ 2 1206 2.87 x_merit[2] 3 │ 3 1520 3.74 x_merit[3] 4 │ 4 1238 3.11 x_merit[4] 5 │ 5 1142 2.74 x_merit[5] 6 │ 6 1086 2.77 x_merit[6] 7 │ 7 1367 3.28 x_merit[7] 8 │ 8 1034 2.41 x_merit[8] ⋮ │ ⋮ ⋮ ⋮ ⋮ 5994 │ 5994 1121 2.96 x_merit[5994] 5995 │ 5995 1564 4.1 x_merit[5995] 5996 │ 5996 1332 3.14 x_merit[5996] 5997 │ 5997 1228 2.95 x_merit[5997] 5998 │ 5998 1165 2.81 x_merit[5998] 5999 │ 5999 1400 3.43 x_merit[5999] 6000 │ 6000 1097 2.65 x_merit[6000] 5985 rows omitted
Then, we use MathOptAI.add_predictor
to embed predictor
into the JuMP model
. MathOptAI.add_predictor
returns a vector of variables, one for each row inn evaluate_df
, corresponding to the output enroll
of our logistic regression.
julia> evaluate_df.enroll, _ = MathOptAI.add_predictor(model, predictor, evaluate_df);
julia> evaluate_df
6000×5 DataFrame Row │ StudentID SAT GPA merit enroll │ Int64 Int64 Float64 GenericV… GenericV… ──────┼─────────────────────────────────────────────────────────── 1 │ 1 1240 3.22 x_merit[1] moai_Sigmoid[1] 2 │ 2 1206 2.87 x_merit[2] moai_Sigmoid[1] 3 │ 3 1520 3.74 x_merit[3] moai_Sigmoid[1] 4 │ 4 1238 3.11 x_merit[4] moai_Sigmoid[1] 5 │ 5 1142 2.74 x_merit[5] moai_Sigmoid[1] 6 │ 6 1086 2.77 x_merit[6] moai_Sigmoid[1] 7 │ 7 1367 3.28 x_merit[7] moai_Sigmoid[1] 8 │ 8 1034 2.41 x_merit[8] moai_Sigmoid[1] ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ 5994 │ 5994 1121 2.96 x_merit[5994] moai_Sigmoid[1] 5995 │ 5995 1564 4.1 x_merit[5995] moai_Sigmoid[1] 5996 │ 5996 1332 3.14 x_merit[5996] moai_Sigmoid[1] 5997 │ 5997 1228 2.95 x_merit[5997] moai_Sigmoid[1] 5998 │ 5998 1165 2.81 x_merit[5998] moai_Sigmoid[1] 5999 │ 5999 1400 3.43 x_merit[5999] moai_Sigmoid[1] 6000 │ 6000 1097 2.65 x_merit[6000] moai_Sigmoid[1] 5985 rows omitted
The .enroll
column name in evaluate_df
is just a name. It doesn't have to match the name in train_df
.
The objective of our problem is to maximize the expected number of students who enroll:
julia> @objective(model, Max, sum(evaluate_df.enroll))
moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + [[...5940 terms omitted...]] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1]
Subject to the constraint that we can spend at most 0.2 * n_students
on merit scholarships:
julia> @constraint(model, sum(evaluate_df.merit) <= 0.2 * n_students)
x_merit[1] + x_merit[2] + x_merit[3] + x_merit[4] + x_merit[5] + x_merit[6] + x_merit[7] + x_merit[8] + x_merit[9] + x_merit[10] + x_merit[11] + x_merit[12] + x_merit[13] + x_merit[14] + x_merit[15] + x_merit[16] + x_merit[17] + x_merit[18] + x_merit[19] + x_merit[20] + x_merit[21] + x_merit[22] + x_merit[23] + x_merit[24] + x_merit[25] + x_merit[26] + x_merit[27] + x_merit[28] + x_merit[29] + x_merit[30] + [[...5940 terms omitted...]] + x_merit[5971] + x_merit[5972] + x_merit[5973] + x_merit[5974] + x_merit[5975] + x_merit[5976] + x_merit[5977] + x_merit[5978] + x_merit[5979] + x_merit[5980] + x_merit[5981] + x_merit[5982] + x_merit[5983] + x_merit[5984] + x_merit[5985] + x_merit[5986] + x_merit[5987] + x_merit[5988] + x_merit[5989] + x_merit[5990] + x_merit[5991] + x_merit[5992] + x_merit[5993] + x_merit[5994] + x_merit[5995] + x_merit[5996] + x_merit[5997] + x_merit[5998] + x_merit[5999] + x_merit[6000] ≤ 1200
Because logistic regression involves a Sigmoid
layer, we need to use a smooth nonlinear optimizer. A common choice is Ipopt. Solve and check the optimizer found a feasible solution:
julia> set_optimizer(model, Ipopt.Optimizer)
julia> set_silent(model)
julia> optimize!(model)
julia> @assert is_solved_and_feasible(model)
julia> solution_summary(model)
* Solver : Ipopt * Status Result count : 1 Termination status : LOCALLY_SOLVED Message from the solver: "Solve_Succeeded" * Candidate solution (result #1) Primal status : FEASIBLE_POINT Dual status : FEASIBLE_POINT Objective value : 2.48820e+03 Dual objective value : -4.91918e+02 * Work counters Solve time (sec) : 6.21398e+00 Barrier iterations : 142
Let's store the solution in evaluate_df
for analysis:
julia> evaluate_df.merit_sol = value.(evaluate_df.merit);
julia> evaluate_df.enroll_sol = value.(evaluate_df.enroll);
julia> evaluate_df
6000×7 DataFrame Row │ StudentID SAT GPA merit enroll merit_sol ⋯ │ Int64 Int64 Float64 GenericV… GenericV… Float64 ⋯ ──────┼───────────────────────────────────────────────────────────────────────── 1 │ 1 1240 3.22 x_merit[1] moai_Sigmoid[1] 6.37389e-7 ⋯ 2 │ 2 1206 2.87 x_merit[2] moai_Sigmoid[1] 1.08151 3 │ 3 1520 3.74 x_merit[3] moai_Sigmoid[1] 1.03717e-6 4 │ 4 1238 3.11 x_merit[4] moai_Sigmoid[1] 1.06046e-6 5 │ 5 1142 2.74 x_merit[5] moai_Sigmoid[1] 1.13489 ⋯ 6 │ 6 1086 2.77 x_merit[6] moai_Sigmoid[1] 1.0759e-6 7 │ 7 1367 3.28 x_merit[7] moai_Sigmoid[1] 2.33948e-6 8 │ 8 1034 2.41 x_merit[8] moai_Sigmoid[1] 0.735482 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 5994 │ 5994 1121 2.96 x_merit[5994] moai_Sigmoid[1] 6.26387e-7 ⋯ 5995 │ 5995 1564 4.1 x_merit[5995] moai_Sigmoid[1] 3.58792e-7 5996 │ 5996 1332 3.14 x_merit[5996] moai_Sigmoid[1] 1.03863 5997 │ 5997 1228 2.95 x_merit[5997] moai_Sigmoid[1] 1.21941 5998 │ 5998 1165 2.81 x_merit[5998] moai_Sigmoid[1] 1.21872 ⋯ 5999 │ 5999 1400 3.43 x_merit[5999] moai_Sigmoid[1] 1.33971e-6 6000 │ 6000 1097 2.65 x_merit[6000] moai_Sigmoid[1] 1.17865 1 column and 5985 rows omitted
Solution analysis
We expect that just under 2,500 students will enroll:
julia> sum(evaluate_df.enroll_sol)
2488.1951671444017
We awarded merit scholarships to approximately 1 in 6 students:
julia> count(evaluate_df.merit_sol .> 1e-5)
1152
The average merit scholarship was worth just over $1,000:
julia> 1_000 * Statistics.mean(evaluate_df.merit_sol[evaluate_df.merit_sol.>1e-5])
1041.6622990671967
This page was generated using Literate.jl.