Logistic regression with GLM.jl

The purpose of this tutorial is to explain how to embed a logistic regression model from GLM.jl into JuMP.

The data and example in this tutorial comes from the paper: David Bergman, Teng Huang, Philip Brooks, Andrea Lodi, Arvind U. Raghunathan (2021) JANOS: An Integrated Predictive and Prescriptive Modeling Framework. INFORMS Journal on Computing 34(2):807-816. https://doi.org/10.1287/ijoc.2020.1023

Required packages

This tutorial uses the following packages.

julia> using JuMP
julia> import CSV
julia> import DataFrames
julia> import Downloads
julia> import GLM
julia> import Ipopt
julia> import MathOptAI
julia> import Statistics

Data

Here is a function to load the data directly from the JANOS repository:

julia> function read_df(filename)
           url = "https://raw.githubusercontent.com/INFORMSJoC/2020.1023/master/data/"
           data = Downloads.download(url * filename)
           return CSV.read(data, DataFrames.DataFrame)
       endread_df (generic function with 1 method)

There are two important files. The first, college_student_enroll-s1-1.csv, contains historical admissions data on anonymized students, their SAT score, their GPA, their merit scholarships, and whether the enrolled in the college.

julia> train_df = read_df("college_student_enroll-s1-1.csv")20000×6 DataFrame
   Row  Column1  StudentID  SAT    GPA      merit    enroll  Int64    Int64      Int64  Float64  Float64  Int64  
───────┼─────────────────────────────────────────────────────
     1 │       1          1   1507     3.72     1.64       0
     2 │       2          2   1532     3.93     0.52       0
     3 │       3          3   1487     3.77     1.67       0
     4 │       4          4   1259     3.05     1.21       1
     5 │       5          5   1354     3.39     1.65       1
     6 │       6          6   1334     3.22     0.0        0
     7 │       7          7   1125     2.73     1.68       1
     8 │       8          8   1180     2.82     0.0        1
   ⋮   │    ⋮         ⋮        ⋮       ⋮        ⋮       ⋮
 19994 │   19994      19994   1185     3.09     1.16       1
 19995 │   19995      19995   1471     3.7      1.05       0
 19996 │   19996      19996   1139     3.03     1.21       1
 19997 │   19997      19997   1371     3.39     1.26       0
 19998 │   19998      19998   1424     3.72     0.85       0
 19999 │   19999      19999   1170     3.01     0.73       1
 20000 │   20000      20000   1389     3.57     0.55       0
                                           19985 rows omitted

The second, college_applications6000.csv, contains the SAT and GPA data of students who are currently applying:

julia> evaluate_df = read_df("college_applications6000.csv")6000×3 DataFrame
  Row  StudentID  SAT    GPA      Int64      Int64  Float64 
──────┼───────────────────────────
    1 │         1   1240     3.22
    2 │         2   1206     2.87
    3 │         3   1520     3.74
    4 │         4   1238     3.11
    5 │         5   1142     2.74
    6 │         6   1086     2.77
    7 │         7   1367     3.28
    8 │         8   1034     2.41
  ⋮   │     ⋮        ⋮       ⋮
 5994 │      5994   1121     2.96
 5995 │      5995   1564     4.1
 5996 │      5996   1332     3.14
 5997 │      5997   1228     2.95
 5998 │      5998   1165     2.81
 5999 │      5999   1400     3.43
 6000 │      6000   1097     2.65
                 5985 rows omitted

There are 6,000 prospective students:

julia> n_students = size(evaluate_df, 1)6000

Prediction model

The first step is to train a logistic regression model to predict the Boolean enroll column based on the SAT, GPA, and merit columns.

julia> predictor = GLM.glm(
           GLM.@formula(enroll ~ 0 + SAT + GPA + merit),
           train_df,
           GLM.Bernoulli(),
       )StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Distributions.Bernoulli{Float64}, GLM.LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

enroll ~ 0 + SAT + GPA + merit

Coefficients:
──────────────────────────────────────────────────────────────────────────
             Coef.   Std. Error      z  Pr(>|z|)    Lower 95%    Upper 95%
──────────────────────────────────────────────────────────────────────────
SAT     0.00243882  0.000309312   7.88    <1e-14   0.00183258   0.00304506
GPA    -1.09868     0.123796     -8.87    <1e-18  -1.34132     -0.856045
merit   0.248294    0.0191086    12.99    <1e-37   0.210842     0.285747
──────────────────────────────────────────────────────────────────────────

Decision model

Now that we have a trained logistic regression model, we want a decision model that chooses the optimal merit scholarship for each student in

julia> evaluate_df6000×3 DataFrame
  Row  StudentID  SAT    GPA      Int64      Int64  Float64 
──────┼───────────────────────────
    1 │         1   1240     3.22
    2 │         2   1206     2.87
    3 │         3   1520     3.74
    4 │         4   1238     3.11
    5 │         5   1142     2.74
    6 │         6   1086     2.77
    7 │         7   1367     3.28
    8 │         8   1034     2.41
  ⋮   │     ⋮        ⋮       ⋮
 5994 │      5994   1121     2.96
 5995 │      5995   1564     4.1
 5996 │      5996   1332     3.14
 5997 │      5997   1228     2.95
 5998 │      5998   1165     2.81
 5999 │      5999   1400     3.43
 6000 │      6000   1097     2.65
                 5985 rows omitted

Here's an empty JuMP model to start:

julia> model = Model()A JuMP Model
├ solver: none
├ objective_sense: FEASIBILITY_SENSE
├ num_variables: 0
├ num_constraints: 0
└ Names registered in the model: none

First, we add a new column to evaluate_df, with one JuMP decision variable for each row. It is important the .merit column name in evaluate_df matches the name in train_df.

julia> evaluate_df.merit = @variable(model, 0 <= x_merit[1:n_students] <= 2.5);
julia> evaluate_df6000×4 DataFrame Row StudentID SAT GPA merit Int64 Int64 Float64 GenericV… ──────┼────────────────────────────────────────── 1 │ 1 1240 3.22 x_merit[1] 2 │ 2 1206 2.87 x_merit[2] 3 │ 3 1520 3.74 x_merit[3] 4 │ 4 1238 3.11 x_merit[4] 5 │ 5 1142 2.74 x_merit[5] 6 │ 6 1086 2.77 x_merit[6] 7 │ 7 1367 3.28 x_merit[7] 8 │ 8 1034 2.41 x_merit[8] ⋮ │ ⋮ ⋮ ⋮ ⋮ 5994 │ 5994 1121 2.96 x_merit[5994] 5995 │ 5995 1564 4.1 x_merit[5995] 5996 │ 5996 1332 3.14 x_merit[5996] 5997 │ 5997 1228 2.95 x_merit[5997] 5998 │ 5998 1165 2.81 x_merit[5998] 5999 │ 5999 1400 3.43 x_merit[5999] 6000 │ 6000 1097 2.65 x_merit[6000] 5985 rows omitted

Then, we use MathOptAI.add_predictor to embed predictor into the JuMP model. MathOptAI.add_predictor returns a vector of variables, one for each row inn evaluate_df, corresponding to the output enroll of our logistic regression.

julia> evaluate_df.enroll, _ = MathOptAI.add_predictor(model, predictor, evaluate_df);
julia> evaluate_df6000×5 DataFrame Row StudentID SAT GPA merit enroll Int64 Int64 Float64 GenericV… GenericV… ──────┼─────────────────────────────────────────────────────────── 1 │ 1 1240 3.22 x_merit[1] moai_Sigmoid[1] 2 │ 2 1206 2.87 x_merit[2] moai_Sigmoid[1] 3 │ 3 1520 3.74 x_merit[3] moai_Sigmoid[1] 4 │ 4 1238 3.11 x_merit[4] moai_Sigmoid[1] 5 │ 5 1142 2.74 x_merit[5] moai_Sigmoid[1] 6 │ 6 1086 2.77 x_merit[6] moai_Sigmoid[1] 7 │ 7 1367 3.28 x_merit[7] moai_Sigmoid[1] 8 │ 8 1034 2.41 x_merit[8] moai_Sigmoid[1] ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ 5994 │ 5994 1121 2.96 x_merit[5994] moai_Sigmoid[1] 5995 │ 5995 1564 4.1 x_merit[5995] moai_Sigmoid[1] 5996 │ 5996 1332 3.14 x_merit[5996] moai_Sigmoid[1] 5997 │ 5997 1228 2.95 x_merit[5997] moai_Sigmoid[1] 5998 │ 5998 1165 2.81 x_merit[5998] moai_Sigmoid[1] 5999 │ 5999 1400 3.43 x_merit[5999] moai_Sigmoid[1] 6000 │ 6000 1097 2.65 x_merit[6000] moai_Sigmoid[1] 5985 rows omitted

The .enroll column name in evaluate_df is just a name. It doesn't have to match the name in train_df.

The objective of our problem is to maximize the expected number of students who enroll:

julia> @objective(model, Max, sum(evaluate_df.enroll))moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + [[...5940 terms omitted...]] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1]

Subject to the constraint that we can spend at most 0.2 * n_students on merit scholarships:

julia> @constraint(model, sum(evaluate_df.merit) <= 0.2 * n_students)x_merit[1] + x_merit[2] + x_merit[3] + x_merit[4] + x_merit[5] + x_merit[6] + x_merit[7] + x_merit[8] + x_merit[9] + x_merit[10] + x_merit[11] + x_merit[12] + x_merit[13] + x_merit[14] + x_merit[15] + x_merit[16] + x_merit[17] + x_merit[18] + x_merit[19] + x_merit[20] + x_merit[21] + x_merit[22] + x_merit[23] + x_merit[24] + x_merit[25] + x_merit[26] + x_merit[27] + x_merit[28] + x_merit[29] + x_merit[30] + [[...5940 terms omitted...]] + x_merit[5971] + x_merit[5972] + x_merit[5973] + x_merit[5974] + x_merit[5975] + x_merit[5976] + x_merit[5977] + x_merit[5978] + x_merit[5979] + x_merit[5980] + x_merit[5981] + x_merit[5982] + x_merit[5983] + x_merit[5984] + x_merit[5985] + x_merit[5986] + x_merit[5987] + x_merit[5988] + x_merit[5989] + x_merit[5990] + x_merit[5991] + x_merit[5992] + x_merit[5993] + x_merit[5994] + x_merit[5995] + x_merit[5996] + x_merit[5997] + x_merit[5998] + x_merit[5999] + x_merit[6000] ≤ 1200

Because logistic regression involves a Sigmoid layer, we need to use a smooth nonlinear optimizer. A common choice is Ipopt. Solve and check the optimizer found a feasible solution:

julia> set_optimizer(model, Ipopt.Optimizer)
julia> set_silent(model)
julia> optimize!(model)
julia> @assert is_solved_and_feasible(model)
julia> solution_summary(model)* Solver : Ipopt * Status Result count : 1 Termination status : LOCALLY_SOLVED Message from the solver: "Solve_Succeeded" * Candidate solution (result #1) Primal status : FEASIBLE_POINT Dual status : FEASIBLE_POINT Objective value : 2.48820e+03 Dual objective value : -4.91918e+02 * Work counters Solve time (sec) : 6.21398e+00 Barrier iterations : 142

Let's store the solution in evaluate_df for analysis:

julia> evaluate_df.merit_sol = value.(evaluate_df.merit);
julia> evaluate_df.enroll_sol = value.(evaluate_df.enroll);
julia> evaluate_df6000×7 DataFrame Row StudentID SAT GPA merit enroll merit_sol ⋯ │ Int64 Int64 Float64 GenericV… GenericV… Float64 ⋯ ──────┼───────────────────────────────────────────────────────────────────────── 1 │ 1 1240 3.22 x_merit[1] moai_Sigmoid[1] 6.37389e-7 ⋯ 2 │ 2 1206 2.87 x_merit[2] moai_Sigmoid[1] 1.08151 3 │ 3 1520 3.74 x_merit[3] moai_Sigmoid[1] 1.03717e-6 4 │ 4 1238 3.11 x_merit[4] moai_Sigmoid[1] 1.06046e-6 5 │ 5 1142 2.74 x_merit[5] moai_Sigmoid[1] 1.13489 ⋯ 6 │ 6 1086 2.77 x_merit[6] moai_Sigmoid[1] 1.0759e-6 7 │ 7 1367 3.28 x_merit[7] moai_Sigmoid[1] 2.33948e-6 8 │ 8 1034 2.41 x_merit[8] moai_Sigmoid[1] 0.735482 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 5994 │ 5994 1121 2.96 x_merit[5994] moai_Sigmoid[1] 6.26387e-7 ⋯ 5995 │ 5995 1564 4.1 x_merit[5995] moai_Sigmoid[1] 3.58792e-7 5996 │ 5996 1332 3.14 x_merit[5996] moai_Sigmoid[1] 1.03863 5997 │ 5997 1228 2.95 x_merit[5997] moai_Sigmoid[1] 1.21941 5998 │ 5998 1165 2.81 x_merit[5998] moai_Sigmoid[1] 1.21872 ⋯ 5999 │ 5999 1400 3.43 x_merit[5999] moai_Sigmoid[1] 1.33971e-6 6000 │ 6000 1097 2.65 x_merit[6000] moai_Sigmoid[1] 1.17865 1 column and 5985 rows omitted

Solution analysis

We expect that just under 2,500 students will enroll:

julia> sum(evaluate_df.enroll_sol)2488.1951671444017

We awarded merit scholarships to approximately 1 in 6 students:

julia> count(evaluate_df.merit_sol .> 1e-5)1152

The average merit scholarship was worth just over $1,000:

julia> 1_000 * Statistics.mean(evaluate_df.merit_sol[evaluate_df.merit_sol.>1e-5])1041.6622990671967

This page was generated using Literate.jl.