Classification problems with DecisionTree.jl

The purpose of this tutorial is to explain how to embed a decision tree model from DecisionTree.jl into JuMP.

The data and example in this tutorial comes from the paper: David Bergman, Teng Huang, Philip Brooks, Andrea Lodi, Arvind U. Raghunathan (2021) JANOS: An Integrated Predictive and Prescriptive Modeling Framework. INFORMS Journal on Computing 34(2):807-816. https://doi.org/10.1287/ijoc.2020.1023

Required packages

This tutorial uses the following packages.

julia> using JuMP
julia> import CSV
julia> import DataFrames
julia> import Downloads
julia> import DecisionTree
julia> import HiGHS
julia> import MathOptAI
julia> import Statistics

Data

Here is a function to load the data directly from the JANOS repository:

julia> function read_df(filename)
           url = "https://raw.githubusercontent.com/INFORMSJoC/2020.1023/master/data/"
           data = Downloads.download(url * filename)
           return CSV.read(data, DataFrames.DataFrame)
       endread_df (generic function with 1 method)

There are two important files. The first, college_student_enroll-s1-1.csv, contains historical admissions data on anonymized students, their SAT score, their GPA, their merit scholarships, and whether the enrolled in the college.

julia> train_df = read_df("college_student_enroll-s1-1.csv")20000×6 DataFrame
   Row │ Column1  StudentID  SAT    GPA      merit    enroll 
       │ Int64    Int64      Int64  Float64  Float64  Int64  
───────┼─────────────────────────────────────────────────────
     1 │       1          1   1507     3.72     1.64       0
     2 │       2          2   1532     3.93     0.52       0
     3 │       3          3   1487     3.77     1.67       0
     4 │       4          4   1259     3.05     1.21       1
     5 │       5          5   1354     3.39     1.65       1
     6 │       6          6   1334     3.22     0.0        0
     7 │       7          7   1125     2.73     1.68       1
     8 │       8          8   1180     2.82     0.0        1
   ⋮   │    ⋮         ⋮        ⋮       ⋮        ⋮       ⋮
 19994 │   19994      19994   1185     3.09     1.16       1
 19995 │   19995      19995   1471     3.7      1.05       0
 19996 │   19996      19996   1139     3.03     1.21       1
 19997 │   19997      19997   1371     3.39     1.26       0
 19998 │   19998      19998   1424     3.72     0.85       0
 19999 │   19999      19999   1170     3.01     0.73       1
 20000 │   20000      20000   1389     3.57     0.55       0
                                           19985 rows omitted

The second, college_applications6000.csv, contains the SAT and GPA data of students who are currently applying:

julia> evaluate_df = read_df("college_applications6000.csv")6000×3 DataFrame
  Row │ StudentID  SAT    GPA     
      │ Int64      Int64  Float64 
──────┼───────────────────────────
    1 │         1   1240     3.22
    2 │         2   1206     2.87
    3 │         3   1520     3.74
    4 │         4   1238     3.11
    5 │         5   1142     2.74
    6 │         6   1086     2.77
    7 │         7   1367     3.28
    8 │         8   1034     2.41
  ⋮   │     ⋮        ⋮       ⋮
 5994 │      5994   1121     2.96
 5995 │      5995   1564     4.1
 5996 │      5996   1332     3.14
 5997 │      5997   1228     2.95
 5998 │      5998   1165     2.81
 5999 │      5999   1400     3.43
 6000 │      6000   1097     2.65
                 5985 rows omitted

There are 6,000 prospective students:

julia> n_students = size(evaluate_df, 1)6000

Prediction model

The first step is to train a logistic regression model to predict the Boolean enroll column based on the SAT, GPA, and merit columns.

julia> train_features = Matrix(train_df[:, [:SAT, :GPA, :merit]])20000×3 Matrix{Float64}:
 1507.0  3.72  1.64
 1532.0  3.93  0.52
 1487.0  3.77  1.67
 1259.0  3.05  1.21
 1354.0  3.39  1.65
 1334.0  3.22  0.0
 1125.0  2.73  1.68
 1180.0  2.82  0.0
 1098.0  2.91  0.0
 1116.0  2.95  0.0
    ⋮
 1048.0  2.51  0.73
 1157.0  2.79  2.11
 1185.0  3.09  1.16
 1471.0  3.7   1.05
 1139.0  3.03  1.21
 1371.0  3.39  1.26
 1424.0  3.72  0.85
 1170.0  3.01  0.73
 1389.0  3.57  0.55
julia> train_labels = train_df[:, :enroll]20000-element Vector{Int64}:
 0
 0
 0
 1
 1
 0
 1
 1
 1
 1
 ⋮
 1
 1
 1
 0
 1
 0
 0
 1
 0
julia> predictor = DecisionTree.DecisionTreeClassifier(; max_depth = 3)DecisionTreeClassifier
max_depth:                3
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  nothing
root:                     nothing
julia> DecisionTree.fit!(predictor, train_features, train_labels)DecisionTreeClassifier
max_depth:                3
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  [0, 1]
root:                     Decision Tree
Leaves: 8
Depth:  3
julia> DecisionTree.print_tree(predictor)Feature 1 < 1228.0 ?
├─ Feature 2 < 3.125 ?
    ├─ Feature 1 < 1214.0 ?
        ├─ 1 : 7336/7336
        └─ 1 : 334/361
    └─ Feature 3 < 0.255 ?
        ├─ 0 : 161/220
        └─ 1 : 121/121
└─ Feature 1 < 1412.0 ?
    ├─ Feature 3 < 1.005 ?
        ├─ 0 : 3600/4046
        └─ 1 : 1603/2338
    └─ Feature 3 < 1.865 ?
        ├─ 0 : 4530/4530
        └─ 0 : 947/1048

Decision model

Now that we have a trained decision tree, we want a decision model that chooses the optimal merit scholarship for each student in:

julia> evaluate_df6000×3 DataFrame
  Row │ StudentID  SAT    GPA     
      │ Int64      Int64  Float64 
──────┼───────────────────────────
    1 │         1   1240     3.22
    2 │         2   1206     2.87
    3 │         3   1520     3.74
    4 │         4   1238     3.11
    5 │         5   1142     2.74
    6 │         6   1086     2.77
    7 │         7   1367     3.28
    8 │         8   1034     2.41
  ⋮   │     ⋮        ⋮       ⋮
 5994 │      5994   1121     2.96
 5995 │      5995   1564     4.1
 5996 │      5996   1332     3.14
 5997 │      5997   1228     2.95
 5998 │      5998   1165     2.81
 5999 │      5999   1400     3.43
 6000 │      6000   1097     2.65
                 5985 rows omitted

Here's an empty JuMP model to start:

julia> model = Model()A JuMP Model
├ solver: none
├ objective_sense: FEASIBILITY_SENSE
├ num_variables: 0
├ num_constraints: 0
└ Names registered in the model: none

First, we add a new column to evaluate_df, with one JuMP decision variable for each row. It is important the .merit column name in evaluate_df matches the name in train_df.

julia> evaluate_df.merit = @variable(model, 0 <= x_merit[1:n_students] <= 2.5);
julia> evaluate_df6000×4 DataFrame
  Row │ StudentID  SAT    GPA      merit         
      │ Int64      Int64  Float64  GenericV…     
──────┼──────────────────────────────────────────
    1 │         1   1240     3.22  x_merit[1]
    2 │         2   1206     2.87  x_merit[2]
    3 │         3   1520     3.74  x_merit[3]
    4 │         4   1238     3.11  x_merit[4]
    5 │         5   1142     2.74  x_merit[5]
    6 │         6   1086     2.77  x_merit[6]
    7 │         7   1367     3.28  x_merit[7]
    8 │         8   1034     2.41  x_merit[8]
  ⋮   │     ⋮        ⋮       ⋮           ⋮
 5994 │      5994   1121     2.96  x_merit[5994]
 5995 │      5995   1564     4.1   x_merit[5995]
 5996 │      5996   1332     3.14  x_merit[5996]
 5997 │      5997   1228     2.95  x_merit[5997]
 5998 │      5998   1165     2.81  x_merit[5998]
 5999 │      5999   1400     3.43  x_merit[5999]
 6000 │      6000   1097     2.65  x_merit[6000]
                                5985 rows omitted

Then, we use MathOptAI.add_predictor to embed model_ml into the JuMP model. MathOptAI.add_predictor returns a vector of variables, one for each row in evaluate_df, corresponding to the output enroll of our logistic regression.

julia> evaluate_features = Matrix(evaluate_df[:, [:SAT, :GPA, :merit]])6000×3 Matrix{JuMP.AffExpr}:
 1240  3.22  x_merit[1]
 1206  2.87  x_merit[2]
 1520  3.74  x_merit[3]
 1238  3.11  x_merit[4]
 1142  2.74  x_merit[5]
 1086  2.77  x_merit[6]
 1367  3.28  x_merit[7]
 1034  2.41  x_merit[8]
 1547  3.82  x_merit[9]
 1453  3.65  x_merit[10]
 ⋮
 1299  3.16  x_merit[5992]
 1400  3.59  x_merit[5993]
 1121  2.96  x_merit[5994]
 1564  4.1   x_merit[5995]
 1332  3.14  x_merit[5996]
 1228  2.95  x_merit[5997]
 1165  2.81  x_merit[5998]
 1400  3.43  x_merit[5999]
 1097  2.65  x_merit[6000]
julia> evaluate_df.enroll = mapreduce(vcat, 1:size(evaluate_features, 1)) do i
           y, _ = MathOptAI.add_predictor(model, predictor, evaluate_features[i, :])
           return y
       end6000-element Vector{JuMP.VariableRef}:
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 ⋮
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
 moai_BinaryDecisionTree_value
julia> evaluate_df6000×5 DataFrame
  Row │ StudentID  SAT    GPA      merit          enroll                       ⋯
      │ Int64      Int64  Float64  GenericV…      GenericV…                    ⋯
──────┼─────────────────────────────────────────────────────────────────────────
    1 │         1   1240     3.22  x_merit[1]     moai_BinaryDecisionTree_valu ⋯
    2 │         2   1206     2.87  x_merit[2]     moai_BinaryDecisionTree_valu
    3 │         3   1520     3.74  x_merit[3]     moai_BinaryDecisionTree_valu
    4 │         4   1238     3.11  x_merit[4]     moai_BinaryDecisionTree_valu
    5 │         5   1142     2.74  x_merit[5]     moai_BinaryDecisionTree_valu ⋯
    6 │         6   1086     2.77  x_merit[6]     moai_BinaryDecisionTree_valu
    7 │         7   1367     3.28  x_merit[7]     moai_BinaryDecisionTree_valu
    8 │         8   1034     2.41  x_merit[8]     moai_BinaryDecisionTree_valu
  ⋮   │     ⋮        ⋮       ⋮           ⋮                      ⋮              ⋱
 5994 │      5994   1121     2.96  x_merit[5994]  moai_BinaryDecisionTree_valu ⋯
 5995 │      5995   1564     4.1   x_merit[5995]  moai_BinaryDecisionTree_valu
 5996 │      5996   1332     3.14  x_merit[5996]  moai_BinaryDecisionTree_valu
 5997 │      5997   1228     2.95  x_merit[5997]  moai_BinaryDecisionTree_valu
 5998 │      5998   1165     2.81  x_merit[5998]  moai_BinaryDecisionTree_valu ⋯
 5999 │      5999   1400     3.43  x_merit[5999]  moai_BinaryDecisionTree_valu
 6000 │      6000   1097     2.65  x_merit[6000]  moai_BinaryDecisionTree_valu
                                                  1 column and 5985 rows omitted

The .enroll column name in evaluate_df is just a name. It doesn't have to match the name in train_df.

The objective of our problem is to maximize the expected number of students who enroll:

julia> @objective(model, Max, sum(evaluate_df.enroll))moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + [[...5940 terms omitted...]] + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value

Subject to the constraint that we can spend at most 0.2 * n_students on merit scholarships:

julia> @constraint(model, sum(evaluate_df.merit) <= 0.2 * n_students)x_merit[1] + x_merit[2] + x_merit[3] + x_merit[4] + x_merit[5] + x_merit[6] + x_merit[7] + x_merit[8] + x_merit[9] + x_merit[10] + x_merit[11] + x_merit[12] + x_merit[13] + x_merit[14] + x_merit[15] + x_merit[16] + x_merit[17] + x_merit[18] + x_merit[19] + x_merit[20] + x_merit[21] + x_merit[22] + x_merit[23] + x_merit[24] + x_merit[25] + x_merit[26] + x_merit[27] + x_merit[28] + x_merit[29] + x_merit[30] + [[...5940 terms omitted...]] + x_merit[5971] + x_merit[5972] + x_merit[5973] + x_merit[5974] + x_merit[5975] + x_merit[5976] + x_merit[5977] + x_merit[5978] + x_merit[5979] + x_merit[5980] + x_merit[5981] + x_merit[5982] + x_merit[5983] + x_merit[5984] + x_merit[5985] + x_merit[5986] + x_merit[5987] + x_merit[5988] + x_merit[5989] + x_merit[5990] + x_merit[5991] + x_merit[5992] + x_merit[5993] + x_merit[5994] + x_merit[5995] + x_merit[5996] + x_merit[5997] + x_merit[5998] + x_merit[5999] + x_merit[6000] ≤ 1200

Because logistic regression involves a Sigmoid layer, we need to use a smooth nonlinear optimizer. A common choice is Ipopt. Solve and check the optimizer found a feasible solution:

julia> set_optimizer(model, HiGHS.Optimizer)
julia> set_silent(model)
julia> optimize!(model)
julia> @assert is_solved_and_feasible(model)

Let's store the solution in evaluate_df for analysis:

julia> evaluate_df.merit_sol = value.(evaluate_df.merit);
julia> evaluate_df.enroll_sol = value.(evaluate_df.enroll);
julia> evaluate_df6000×7 DataFrame
  Row │ StudentID  SAT    GPA      merit          enroll                       ⋯
      │ Int64      Int64  Float64  GenericV…      GenericV…                    ⋯
──────┼─────────────────────────────────────────────────────────────────────────
    1 │         1   1240     3.22  x_merit[1]     moai_BinaryDecisionTree_valu ⋯
    2 │         2   1206     2.87  x_merit[2]     moai_BinaryDecisionTree_valu
    3 │         3   1520     3.74  x_merit[3]     moai_BinaryDecisionTree_valu
    4 │         4   1238     3.11  x_merit[4]     moai_BinaryDecisionTree_valu
    5 │         5   1142     2.74  x_merit[5]     moai_BinaryDecisionTree_valu ⋯
    6 │         6   1086     2.77  x_merit[6]     moai_BinaryDecisionTree_valu
    7 │         7   1367     3.28  x_merit[7]     moai_BinaryDecisionTree_valu
    8 │         8   1034     2.41  x_merit[8]     moai_BinaryDecisionTree_valu
  ⋮   │     ⋮        ⋮       ⋮           ⋮                      ⋮              ⋱
 5994 │      5994   1121     2.96  x_merit[5994]  moai_BinaryDecisionTree_valu ⋯
 5995 │      5995   1564     4.1   x_merit[5995]  moai_BinaryDecisionTree_valu
 5996 │      5996   1332     3.14  x_merit[5996]  moai_BinaryDecisionTree_valu
 5997 │      5997   1228     2.95  x_merit[5997]  moai_BinaryDecisionTree_valu
 5998 │      5998   1165     2.81  x_merit[5998]  moai_BinaryDecisionTree_valu ⋯
 5999 │      5999   1400     3.43  x_merit[5999]  moai_BinaryDecisionTree_valu
 6000 │      6000   1097     2.65  x_merit[6000]  moai_BinaryDecisionTree_valu
                                                 3 columns and 5985 rows omitted

Solution analysis

We expect that just under 2,500 students will enroll:

julia> sum(evaluate_df.enroll_sol)3530.0

We awarded merit scholarships to approximately 1 in 6 students:

julia> count(evaluate_df.merit_sol .> 1e-5)1278

The average merit scholarship was worth just under $1,000:

julia> 1_000 * Statistics.mean(evaluate_df.merit_sol[evaluate_df.merit_sol.>1e-5])938.6854460093895

This page was generated using Literate.jl.