Logistic regression with GLM.jl

The purpose of this tutorial is to explain how to embed a logistic regression model from GLM.jl into JuMP.

The data and example in this tutorial comes from the paper: David Bergman, Teng Huang, Philip Brooks, Andrea Lodi, Arvind U. Raghunathan (2021) JANOS: An Integrated Predictive and Prescriptive Modeling Framework. INFORMS Journal on Computing 34(2):807-816. https://doi.org/10.1287/ijoc.2020.1023

Required packages

This tutorial uses the following packages.

julia> using JuMP
julia> import CSV
julia> import DataFrames
julia> import Downloads
julia> import GLM
julia> import Ipopt
julia> import MathOptAI
julia> import Statistics

Data

Here is a function to load the data directly from the JANOS repository:

julia> function read_df(filename)
           url = "https://raw.githubusercontent.com/INFORMSJoC/2020.1023/master/data/"
           data = Downloads.download(url * filename)
           return CSV.read(data, DataFrames.DataFrame)
       endread_df (generic function with 1 method)

There are two important files. The first, college_student_enroll-s1-1.csv, contains historical admissions data on anonymized students, their SAT score, their GPA, their merit scholarships, and whether the enrolled in the college.

julia> train_df = read_df("college_student_enroll-s1-1.csv")20000×6 DataFrame
   Row │ Column1  StudentID  SAT    GPA      merit    enroll 
       │ Int64    Int64      Int64  Float64  Float64  Int64  
───────┼─────────────────────────────────────────────────────
     1 │       1          1   1507     3.72     1.64       0
     2 │       2          2   1532     3.93     0.52       0
     3 │       3          3   1487     3.77     1.67       0
     4 │       4          4   1259     3.05     1.21       1
     5 │       5          5   1354     3.39     1.65       1
     6 │       6          6   1334     3.22     0.0        0
     7 │       7          7   1125     2.73     1.68       1
     8 │       8          8   1180     2.82     0.0        1
   ⋮   │    ⋮         ⋮        ⋮       ⋮        ⋮       ⋮
 19994 │   19994      19994   1185     3.09     1.16       1
 19995 │   19995      19995   1471     3.7      1.05       0
 19996 │   19996      19996   1139     3.03     1.21       1
 19997 │   19997      19997   1371     3.39     1.26       0
 19998 │   19998      19998   1424     3.72     0.85       0
 19999 │   19999      19999   1170     3.01     0.73       1
 20000 │   20000      20000   1389     3.57     0.55       0
                                           19985 rows omitted

The second, college_applications6000.csv, contains the SAT and GPA data of students who are currently applying:

julia> evaluate_df = read_df("college_applications6000.csv")6000×3 DataFrame
  Row │ StudentID  SAT    GPA     
      │ Int64      Int64  Float64 
──────┼───────────────────────────
    1 │         1   1240     3.22
    2 │         2   1206     2.87
    3 │         3   1520     3.74
    4 │         4   1238     3.11
    5 │         5   1142     2.74
    6 │         6   1086     2.77
    7 │         7   1367     3.28
    8 │         8   1034     2.41
  ⋮   │     ⋮        ⋮       ⋮
 5994 │      5994   1121     2.96
 5995 │      5995   1564     4.1
 5996 │      5996   1332     3.14
 5997 │      5997   1228     2.95
 5998 │      5998   1165     2.81
 5999 │      5999   1400     3.43
 6000 │      6000   1097     2.65
                 5985 rows omitted

There are 6,000 prospective students:

julia> n_students = size(evaluate_df, 1)6000

Prediction model

The first step is to train a logistic regression model to predict the Boolean enroll column based on the SAT, GPA, and merit columns.

julia> predictor = GLM.glm(
           GLM.@formula(enroll ~ 0 + SAT + GPA + merit),
           train_df,
           GLM.Bernoulli(),
       )StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Distributions.Bernoulli{Float64}, GLM.LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

enroll ~ 0 + SAT + GPA + merit

Coefficients:
──────────────────────────────────────────────────────────────────────────
             Coef.   Std. Error      z  Pr(>|z|)    Lower 95%    Upper 95%
──────────────────────────────────────────────────────────────────────────
SAT     0.00243882  0.000309312   7.88    <1e-14   0.00183258   0.00304506
GPA    -1.09868     0.123796     -8.87    <1e-18  -1.34132     -0.856045
merit   0.248294    0.0191086    12.99    <1e-37   0.210842     0.285747
──────────────────────────────────────────────────────────────────────────

Decision model

Now that we have a trained logistic regression model, we want a decision model that chooses the optimal merit scholarship for each student in

julia> evaluate_df6000×3 DataFrame
  Row │ StudentID  SAT    GPA     
      │ Int64      Int64  Float64 
──────┼───────────────────────────
    1 │         1   1240     3.22
    2 │         2   1206     2.87
    3 │         3   1520     3.74
    4 │         4   1238     3.11
    5 │         5   1142     2.74
    6 │         6   1086     2.77
    7 │         7   1367     3.28
    8 │         8   1034     2.41
  ⋮   │     ⋮        ⋮       ⋮
 5994 │      5994   1121     2.96
 5995 │      5995   1564     4.1
 5996 │      5996   1332     3.14
 5997 │      5997   1228     2.95
 5998 │      5998   1165     2.81
 5999 │      5999   1400     3.43
 6000 │      6000   1097     2.65
                 5985 rows omitted

Here's an empty JuMP model to start:

julia> model = Model()A JuMP Model
├ solver: none
├ objective_sense: FEASIBILITY_SENSE
├ num_variables: 0
├ num_constraints: 0
└ Names registered in the model: none

First, we add a new column to evaluate_df, with one JuMP decision variable for each row. It is important the .merit column name in evaluate_df matches the name in train_df.

julia> evaluate_df.merit = @variable(model, 0 <= x_merit[1:n_students] <= 2.5);
julia> evaluate_df6000×4 DataFrame
  Row │ StudentID  SAT    GPA      merit         
      │ Int64      Int64  Float64  GenericV…     
──────┼──────────────────────────────────────────
    1 │         1   1240     3.22  x_merit[1]
    2 │         2   1206     2.87  x_merit[2]
    3 │         3   1520     3.74  x_merit[3]
    4 │         4   1238     3.11  x_merit[4]
    5 │         5   1142     2.74  x_merit[5]
    6 │         6   1086     2.77  x_merit[6]
    7 │         7   1367     3.28  x_merit[7]
    8 │         8   1034     2.41  x_merit[8]
  ⋮   │     ⋮        ⋮       ⋮           ⋮
 5994 │      5994   1121     2.96  x_merit[5994]
 5995 │      5995   1564     4.1   x_merit[5995]
 5996 │      5996   1332     3.14  x_merit[5996]
 5997 │      5997   1228     2.95  x_merit[5997]
 5998 │      5998   1165     2.81  x_merit[5998]
 5999 │      5999   1400     3.43  x_merit[5999]
 6000 │      6000   1097     2.65  x_merit[6000]
                                5985 rows omitted

Then, we use MathOptAI.add_predictor to embed predictor into the JuMP model. MathOptAI.add_predictor returns a vector of variables, one for each row inn evaluate_df, corresponding to the output enroll of our logistic regression.

julia> evaluate_df.enroll, _ = MathOptAI.add_predictor(model, predictor, evaluate_df);
julia> evaluate_df6000×5 DataFrame
  Row │ StudentID  SAT    GPA      merit          enroll          
      │ Int64      Int64  Float64  GenericV…      GenericV…       
──────┼───────────────────────────────────────────────────────────
    1 │         1   1240     3.22  x_merit[1]     moai_Sigmoid[1]
    2 │         2   1206     2.87  x_merit[2]     moai_Sigmoid[1]
    3 │         3   1520     3.74  x_merit[3]     moai_Sigmoid[1]
    4 │         4   1238     3.11  x_merit[4]     moai_Sigmoid[1]
    5 │         5   1142     2.74  x_merit[5]     moai_Sigmoid[1]
    6 │         6   1086     2.77  x_merit[6]     moai_Sigmoid[1]
    7 │         7   1367     3.28  x_merit[7]     moai_Sigmoid[1]
    8 │         8   1034     2.41  x_merit[8]     moai_Sigmoid[1]
  ⋮   │     ⋮        ⋮       ⋮           ⋮               ⋮
 5994 │      5994   1121     2.96  x_merit[5994]  moai_Sigmoid[1]
 5995 │      5995   1564     4.1   x_merit[5995]  moai_Sigmoid[1]
 5996 │      5996   1332     3.14  x_merit[5996]  moai_Sigmoid[1]
 5997 │      5997   1228     2.95  x_merit[5997]  moai_Sigmoid[1]
 5998 │      5998   1165     2.81  x_merit[5998]  moai_Sigmoid[1]
 5999 │      5999   1400     3.43  x_merit[5999]  moai_Sigmoid[1]
 6000 │      6000   1097     2.65  x_merit[6000]  moai_Sigmoid[1]
                                                 5985 rows omitted

The .enroll column name in evaluate_df is just a name. It doesn't have to match the name in train_df.

The objective of our problem is to maximize the expected number of students who enroll:

julia> @objective(model, Max, sum(evaluate_df.enroll))moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + [[...5940 terms omitted...]] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1] + moai_Sigmoid[1]

Subject to the constraint that we can spend at most 0.2 * n_students on merit scholarships:

julia> @constraint(model, sum(evaluate_df.merit) <= 0.2 * n_students)x_merit[1] + x_merit[2] + x_merit[3] + x_merit[4] + x_merit[5] + x_merit[6] + x_merit[7] + x_merit[8] + x_merit[9] + x_merit[10] + x_merit[11] + x_merit[12] + x_merit[13] + x_merit[14] + x_merit[15] + x_merit[16] + x_merit[17] + x_merit[18] + x_merit[19] + x_merit[20] + x_merit[21] + x_merit[22] + x_merit[23] + x_merit[24] + x_merit[25] + x_merit[26] + x_merit[27] + x_merit[28] + x_merit[29] + x_merit[30] + [[...5940 terms omitted...]] + x_merit[5971] + x_merit[5972] + x_merit[5973] + x_merit[5974] + x_merit[5975] + x_merit[5976] + x_merit[5977] + x_merit[5978] + x_merit[5979] + x_merit[5980] + x_merit[5981] + x_merit[5982] + x_merit[5983] + x_merit[5984] + x_merit[5985] + x_merit[5986] + x_merit[5987] + x_merit[5988] + x_merit[5989] + x_merit[5990] + x_merit[5991] + x_merit[5992] + x_merit[5993] + x_merit[5994] + x_merit[5995] + x_merit[5996] + x_merit[5997] + x_merit[5998] + x_merit[5999] + x_merit[6000] ≤ 1200

Because logistic regression involves a Sigmoid layer, we need to use a smooth nonlinear optimizer. A common choice is Ipopt. Solve and check the optimizer found a feasible solution:

julia> set_optimizer(model, Ipopt.Optimizer)
julia> set_silent(model)
julia> optimize!(model)
julia> @assert is_solved_and_feasible(model)
julia> solution_summary(model)* Solver : Ipopt

* Status
  Result count       : 1
  Termination status : LOCALLY_SOLVED
  Message from the solver:
  "Solve_Succeeded"

* Candidate solution (result #1)
  Primal status      : FEASIBLE_POINT
  Dual status        : FEASIBLE_POINT
  Objective value    : 2.48820e+03
  Dual objective value : -4.91918e+02

* Work counters
  Solve time (sec)   : 6.21398e+00
  Barrier iterations : 142

Let's store the solution in evaluate_df for analysis:

julia> evaluate_df.merit_sol = value.(evaluate_df.merit);
julia> evaluate_df.enroll_sol = value.(evaluate_df.enroll);
julia> evaluate_df6000×7 DataFrame
  Row │ StudentID  SAT    GPA      merit          enroll           merit_sol   ⋯
      │ Int64      Int64  Float64  GenericV…      GenericV…        Float64     ⋯
──────┼─────────────────────────────────────────────────────────────────────────
    1 │         1   1240     3.22  x_merit[1]     moai_Sigmoid[1]  6.37389e-7  ⋯
    2 │         2   1206     2.87  x_merit[2]     moai_Sigmoid[1]  1.08151
    3 │         3   1520     3.74  x_merit[3]     moai_Sigmoid[1]  1.03717e-6
    4 │         4   1238     3.11  x_merit[4]     moai_Sigmoid[1]  1.06046e-6
    5 │         5   1142     2.74  x_merit[5]     moai_Sigmoid[1]  1.13489     ⋯
    6 │         6   1086     2.77  x_merit[6]     moai_Sigmoid[1]  1.0759e-6
    7 │         7   1367     3.28  x_merit[7]     moai_Sigmoid[1]  2.33948e-6
    8 │         8   1034     2.41  x_merit[8]     moai_Sigmoid[1]  0.735482
  ⋮   │     ⋮        ⋮       ⋮           ⋮               ⋮             ⋮       ⋱
 5994 │      5994   1121     2.96  x_merit[5994]  moai_Sigmoid[1]  6.26387e-7  ⋯
 5995 │      5995   1564     4.1   x_merit[5995]  moai_Sigmoid[1]  3.58792e-7
 5996 │      5996   1332     3.14  x_merit[5996]  moai_Sigmoid[1]  1.03863
 5997 │      5997   1228     2.95  x_merit[5997]  moai_Sigmoid[1]  1.21941
 5998 │      5998   1165     2.81  x_merit[5998]  moai_Sigmoid[1]  1.21872     ⋯
 5999 │      5999   1400     3.43  x_merit[5999]  moai_Sigmoid[1]  1.33971e-6
 6000 │      6000   1097     2.65  x_merit[6000]  moai_Sigmoid[1]  1.17865
                                                  1 column and 5985 rows omitted

Solution analysis

We expect that just under 2,500 students will enroll:

julia> sum(evaluate_df.enroll_sol)2488.1951671444017

We awarded merit scholarships to approximately 1 in 6 students:

julia> count(evaluate_df.merit_sol .> 1e-5)1152

The average merit scholarship was worth just over $1,000:

julia> 1_000 * Statistics.mean(evaluate_df.merit_sol[evaluate_df.merit_sol.>1e-5])1041.6622990671967

This page was generated using Literate.jl.