Classification problems with DecisionTree.jl
The purpose of this tutorial is to explain how to embed a decision tree model from DecisionTree.jl into JuMP.
The data and example in this tutorial comes from the paper: David Bergman, Teng Huang, Philip Brooks, Andrea Lodi, Arvind U. Raghunathan (2021) JANOS: An Integrated Predictive and Prescriptive Modeling Framework. INFORMS Journal on Computing 34(2):807-816. https://doi.org/10.1287/ijoc.2020.1023
Required packages
This tutorial uses the following packages.
julia> using JuMP
julia> import CSV
julia> import DataFrames
julia> import Downloads
julia> import DecisionTree
julia> import HiGHS
julia> import MathOptAI
julia> import Statistics
Data
Here is a function to load the data directly from the JANOS repository:
julia> function read_df(filename) url = "https://raw.githubusercontent.com/INFORMSJoC/2020.1023/master/data/" data = Downloads.download(url * filename) return CSV.read(data, DataFrames.DataFrame) end
read_df (generic function with 1 method)
There are two important files. The first, college_student_enroll-s1-1.csv
, contains historical admissions data on anonymized students, their SAT score, their GPA, their merit scholarships, and whether the enrolled in the college.
julia> train_df = read_df("college_student_enroll-s1-1.csv")
20000×6 DataFrame Row │ Column1 StudentID SAT GPA merit enroll │ Int64 Int64 Int64 Float64 Float64 Int64 ───────┼───────────────────────────────────────────────────── 1 │ 1 1 1507 3.72 1.64 0 2 │ 2 2 1532 3.93 0.52 0 3 │ 3 3 1487 3.77 1.67 0 4 │ 4 4 1259 3.05 1.21 1 5 │ 5 5 1354 3.39 1.65 1 6 │ 6 6 1334 3.22 0.0 0 7 │ 7 7 1125 2.73 1.68 1 8 │ 8 8 1180 2.82 0.0 1 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 19994 │ 19994 19994 1185 3.09 1.16 1 19995 │ 19995 19995 1471 3.7 1.05 0 19996 │ 19996 19996 1139 3.03 1.21 1 19997 │ 19997 19997 1371 3.39 1.26 0 19998 │ 19998 19998 1424 3.72 0.85 0 19999 │ 19999 19999 1170 3.01 0.73 1 20000 │ 20000 20000 1389 3.57 0.55 0 19985 rows omitted
The second, college_applications6000.csv
, contains the SAT and GPA data of students who are currently applying:
julia> evaluate_df = read_df("college_applications6000.csv")
6000×3 DataFrame Row │ StudentID SAT GPA │ Int64 Int64 Float64 ──────┼─────────────────────────── 1 │ 1 1240 3.22 2 │ 2 1206 2.87 3 │ 3 1520 3.74 4 │ 4 1238 3.11 5 │ 5 1142 2.74 6 │ 6 1086 2.77 7 │ 7 1367 3.28 8 │ 8 1034 2.41 ⋮ │ ⋮ ⋮ ⋮ 5994 │ 5994 1121 2.96 5995 │ 5995 1564 4.1 5996 │ 5996 1332 3.14 5997 │ 5997 1228 2.95 5998 │ 5998 1165 2.81 5999 │ 5999 1400 3.43 6000 │ 6000 1097 2.65 5985 rows omitted
There are 6,000 prospective students:
julia> n_students = size(evaluate_df, 1)
6000
Prediction model
The first step is to train a logistic regression model to predict the Boolean enroll
column based on the SAT
, GPA
, and merit
columns.
julia> train_features = Matrix(train_df[:, [:SAT, :GPA, :merit]])
20000×3 Matrix{Float64}: 1507.0 3.72 1.64 1532.0 3.93 0.52 1487.0 3.77 1.67 1259.0 3.05 1.21 1354.0 3.39 1.65 1334.0 3.22 0.0 1125.0 2.73 1.68 1180.0 2.82 0.0 1098.0 2.91 0.0 1116.0 2.95 0.0 ⋮ 1048.0 2.51 0.73 1157.0 2.79 2.11 1185.0 3.09 1.16 1471.0 3.7 1.05 1139.0 3.03 1.21 1371.0 3.39 1.26 1424.0 3.72 0.85 1170.0 3.01 0.73 1389.0 3.57 0.55
julia> train_labels = train_df[:, :enroll]
20000-element Vector{Int64}: 0 0 0 1 1 0 1 1 1 1 ⋮ 1 1 1 0 1 0 0 1 0
julia> predictor = DecisionTree.DecisionTreeClassifier(; max_depth = 3)
DecisionTreeClassifier max_depth: 3 min_samples_leaf: 1 min_samples_split: 2 min_purity_increase: 0.0 pruning_purity_threshold: 1.0 n_subfeatures: 0 classes: nothing root: nothing
julia> DecisionTree.fit!(predictor, train_features, train_labels)
DecisionTreeClassifier max_depth: 3 min_samples_leaf: 1 min_samples_split: 2 min_purity_increase: 0.0 pruning_purity_threshold: 1.0 n_subfeatures: 0 classes: [0, 1] root: Decision Tree Leaves: 8 Depth: 3
julia> DecisionTree.print_tree(predictor)
Feature 1 < 1228.0 ? ├─ Feature 2 < 3.125 ? ├─ Feature 1 < 1214.0 ? ├─ 1 : 7336/7336 └─ 1 : 334/361 └─ Feature 3 < 0.255 ? ├─ 0 : 161/220 └─ 1 : 121/121 └─ Feature 1 < 1412.0 ? ├─ Feature 3 < 1.005 ? ├─ 0 : 3600/4046 └─ 1 : 1603/2338 └─ Feature 3 < 1.865 ? ├─ 0 : 4530/4530 └─ 0 : 947/1048
Decision model
Now that we have a trained decision tree, we want a decision model that chooses the optimal merit scholarship for each student in:
julia> evaluate_df
6000×3 DataFrame Row │ StudentID SAT GPA │ Int64 Int64 Float64 ──────┼─────────────────────────── 1 │ 1 1240 3.22 2 │ 2 1206 2.87 3 │ 3 1520 3.74 4 │ 4 1238 3.11 5 │ 5 1142 2.74 6 │ 6 1086 2.77 7 │ 7 1367 3.28 8 │ 8 1034 2.41 ⋮ │ ⋮ ⋮ ⋮ 5994 │ 5994 1121 2.96 5995 │ 5995 1564 4.1 5996 │ 5996 1332 3.14 5997 │ 5997 1228 2.95 5998 │ 5998 1165 2.81 5999 │ 5999 1400 3.43 6000 │ 6000 1097 2.65 5985 rows omitted
Here's an empty JuMP model to start:
julia> model = Model()
A JuMP Model ├ solver: none ├ objective_sense: FEASIBILITY_SENSE ├ num_variables: 0 ├ num_constraints: 0 └ Names registered in the model: none
First, we add a new column to evaluate_df
, with one JuMP decision variable for each row. It is important the .merit
column name in evaluate_df
matches the name in train_df
.
julia> evaluate_df.merit = @variable(model, 0 <= x_merit[1:n_students] <= 2.5);
julia> evaluate_df
6000×4 DataFrame Row │ StudentID SAT GPA merit │ Int64 Int64 Float64 GenericV… ──────┼────────────────────────────────────────── 1 │ 1 1240 3.22 x_merit[1] 2 │ 2 1206 2.87 x_merit[2] 3 │ 3 1520 3.74 x_merit[3] 4 │ 4 1238 3.11 x_merit[4] 5 │ 5 1142 2.74 x_merit[5] 6 │ 6 1086 2.77 x_merit[6] 7 │ 7 1367 3.28 x_merit[7] 8 │ 8 1034 2.41 x_merit[8] ⋮ │ ⋮ ⋮ ⋮ ⋮ 5994 │ 5994 1121 2.96 x_merit[5994] 5995 │ 5995 1564 4.1 x_merit[5995] 5996 │ 5996 1332 3.14 x_merit[5996] 5997 │ 5997 1228 2.95 x_merit[5997] 5998 │ 5998 1165 2.81 x_merit[5998] 5999 │ 5999 1400 3.43 x_merit[5999] 6000 │ 6000 1097 2.65 x_merit[6000] 5985 rows omitted
Then, we use MathOptAI.add_predictor
to embed model_ml
into the JuMP model
. MathOptAI.add_predictor
returns a vector of variables, one for each row in evaluate_df
, corresponding to the output enroll
of our logistic regression.
julia> evaluate_features = Matrix(evaluate_df[:, [:SAT, :GPA, :merit]])
6000×3 Matrix{JuMP.AffExpr}: 1240 3.22 x_merit[1] 1206 2.87 x_merit[2] 1520 3.74 x_merit[3] 1238 3.11 x_merit[4] 1142 2.74 x_merit[5] 1086 2.77 x_merit[6] 1367 3.28 x_merit[7] 1034 2.41 x_merit[8] 1547 3.82 x_merit[9] 1453 3.65 x_merit[10] ⋮ 1299 3.16 x_merit[5992] 1400 3.59 x_merit[5993] 1121 2.96 x_merit[5994] 1564 4.1 x_merit[5995] 1332 3.14 x_merit[5996] 1228 2.95 x_merit[5997] 1165 2.81 x_merit[5998] 1400 3.43 x_merit[5999] 1097 2.65 x_merit[6000]
julia> evaluate_df.enroll = mapreduce(vcat, 1:size(evaluate_features, 1)) do i y, _ = MathOptAI.add_predictor(model, predictor, evaluate_features[i, :]) return y end
6000-element Vector{JuMP.VariableRef}: moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value ⋮ moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value moai_BinaryDecisionTree_value
julia> evaluate_df
6000×5 DataFrame Row │ StudentID SAT GPA merit enroll ⋯ │ Int64 Int64 Float64 GenericV… GenericV… ⋯ ──────┼───────────────────────────────────────────────────────────────────────── 1 │ 1 1240 3.22 x_merit[1] moai_BinaryDecisionTree_valu ⋯ 2 │ 2 1206 2.87 x_merit[2] moai_BinaryDecisionTree_valu 3 │ 3 1520 3.74 x_merit[3] moai_BinaryDecisionTree_valu 4 │ 4 1238 3.11 x_merit[4] moai_BinaryDecisionTree_valu 5 │ 5 1142 2.74 x_merit[5] moai_BinaryDecisionTree_valu ⋯ 6 │ 6 1086 2.77 x_merit[6] moai_BinaryDecisionTree_valu 7 │ 7 1367 3.28 x_merit[7] moai_BinaryDecisionTree_valu 8 │ 8 1034 2.41 x_merit[8] moai_BinaryDecisionTree_valu ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 5994 │ 5994 1121 2.96 x_merit[5994] moai_BinaryDecisionTree_valu ⋯ 5995 │ 5995 1564 4.1 x_merit[5995] moai_BinaryDecisionTree_valu 5996 │ 5996 1332 3.14 x_merit[5996] moai_BinaryDecisionTree_valu 5997 │ 5997 1228 2.95 x_merit[5997] moai_BinaryDecisionTree_valu 5998 │ 5998 1165 2.81 x_merit[5998] moai_BinaryDecisionTree_valu ⋯ 5999 │ 5999 1400 3.43 x_merit[5999] moai_BinaryDecisionTree_valu 6000 │ 6000 1097 2.65 x_merit[6000] moai_BinaryDecisionTree_valu 1 column and 5985 rows omitted
The .enroll
column name in evaluate_df
is just a name. It doesn't have to match the name in train_df
.
The objective of our problem is to maximize the expected number of students who enroll:
julia> @objective(model, Max, sum(evaluate_df.enroll))
moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + [[...5940 terms omitted...]] + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value + moai_BinaryDecisionTree_value
Subject to the constraint that we can spend at most 0.2 * n_students
on merit scholarships:
julia> @constraint(model, sum(evaluate_df.merit) <= 0.2 * n_students)
x_merit[1] + x_merit[2] + x_merit[3] + x_merit[4] + x_merit[5] + x_merit[6] + x_merit[7] + x_merit[8] + x_merit[9] + x_merit[10] + x_merit[11] + x_merit[12] + x_merit[13] + x_merit[14] + x_merit[15] + x_merit[16] + x_merit[17] + x_merit[18] + x_merit[19] + x_merit[20] + x_merit[21] + x_merit[22] + x_merit[23] + x_merit[24] + x_merit[25] + x_merit[26] + x_merit[27] + x_merit[28] + x_merit[29] + x_merit[30] + [[...5940 terms omitted...]] + x_merit[5971] + x_merit[5972] + x_merit[5973] + x_merit[5974] + x_merit[5975] + x_merit[5976] + x_merit[5977] + x_merit[5978] + x_merit[5979] + x_merit[5980] + x_merit[5981] + x_merit[5982] + x_merit[5983] + x_merit[5984] + x_merit[5985] + x_merit[5986] + x_merit[5987] + x_merit[5988] + x_merit[5989] + x_merit[5990] + x_merit[5991] + x_merit[5992] + x_merit[5993] + x_merit[5994] + x_merit[5995] + x_merit[5996] + x_merit[5997] + x_merit[5998] + x_merit[5999] + x_merit[6000] ≤ 1200
Because logistic regression involves a Sigmoid
layer, we need to use a smooth nonlinear optimizer. A common choice is Ipopt. Solve and check the optimizer found a feasible solution:
julia> set_optimizer(model, HiGHS.Optimizer)
julia> set_silent(model)
julia> optimize!(model)
julia> @assert is_solved_and_feasible(model)
Let's store the solution in evaluate_df
for analysis:
julia> evaluate_df.merit_sol = value.(evaluate_df.merit);
julia> evaluate_df.enroll_sol = value.(evaluate_df.enroll);
julia> evaluate_df
6000×7 DataFrame Row │ StudentID SAT GPA merit enroll ⋯ │ Int64 Int64 Float64 GenericV… GenericV… ⋯ ──────┼───────────────────────────────────────────────────────────────────────── 1 │ 1 1240 3.22 x_merit[1] moai_BinaryDecisionTree_valu ⋯ 2 │ 2 1206 2.87 x_merit[2] moai_BinaryDecisionTree_valu 3 │ 3 1520 3.74 x_merit[3] moai_BinaryDecisionTree_valu 4 │ 4 1238 3.11 x_merit[4] moai_BinaryDecisionTree_valu 5 │ 5 1142 2.74 x_merit[5] moai_BinaryDecisionTree_valu ⋯ 6 │ 6 1086 2.77 x_merit[6] moai_BinaryDecisionTree_valu 7 │ 7 1367 3.28 x_merit[7] moai_BinaryDecisionTree_valu 8 │ 8 1034 2.41 x_merit[8] moai_BinaryDecisionTree_valu ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 5994 │ 5994 1121 2.96 x_merit[5994] moai_BinaryDecisionTree_valu ⋯ 5995 │ 5995 1564 4.1 x_merit[5995] moai_BinaryDecisionTree_valu 5996 │ 5996 1332 3.14 x_merit[5996] moai_BinaryDecisionTree_valu 5997 │ 5997 1228 2.95 x_merit[5997] moai_BinaryDecisionTree_valu 5998 │ 5998 1165 2.81 x_merit[5998] moai_BinaryDecisionTree_valu ⋯ 5999 │ 5999 1400 3.43 x_merit[5999] moai_BinaryDecisionTree_valu 6000 │ 6000 1097 2.65 x_merit[6000] moai_BinaryDecisionTree_valu 3 columns and 5985 rows omitted
Solution analysis
We expect that just under 2,500 students will enroll:
julia> sum(evaluate_df.enroll_sol)
3530.0
We awarded merit scholarships to approximately 1 in 6 students:
julia> count(evaluate_df.merit_sol .> 1e-5)
1278
The average merit scholarship was worth just under $1,000:
julia> 1_000 * Statistics.mean(evaluate_df.merit_sol[evaluate_df.merit_sol.>1e-5])
938.6854460093895
This page was generated using Literate.jl.