Triage
An open source machine learning toolkit to help data scientists, machine learning developers, and analysts quickly prototype, build and evaluate end-to-end predictive risk modeling systems for public policy and social good problems.
Triage lets you focus on the problem you’re solving and guides you through design choices you need to make at each step of the machine learning pipeline.
Why we created Triage
We created Triage in response to commonly occurring challenges in the development of machine learning systems for public policy and social good problems. While many tools (sklearn, keras, pytorch, etc.) exist to build ML models, an end-to-end project requires a lot more than just building models.
Building systems with predictive models that are going to be used in production require making many design decisions that need to match with how the system is going to be used. These choices then get turned into modeling choices and code.
We need to answer questions such as:
- What should be included in the data (cohort selection)?
- What should our rows consist of (unit of analysis)?
- What is our label? What outcome are we predicting or estimating and over what period of time?
- How should we generate and incorporate time and space varying features (spatiotemporal explanatory variables)?
- How do we deal with time in our training, selection, and validation process?
- Which models/methods and associated hyper-parameters should we try?
- What evaluation metrics do we use to compare and select models?
- How do we compare and evaluate models over time?
- How should we interpret and explain the models and it’s predictions to the user in the loop?
- How do we ensure fairness and equity in our system?
- How to we understand and communicate intervention lists generated by these models?
These questions are critical but complicated and hard to answer apriori. Even when these design choices are made, we still have turn these choices into code throughout the course of a project. Triage is designed around these questions and generates a set of data matrices, models, predictions, evaluations, and analysis that makes it easier for data scientists to select the best models to use.
Triage aims to help solve these problems by:
- Guiding users (data scientists, analysts, researchers) through these design choices by highlighting operational use questions that are important.
- Providing interfaces to these different phases of a project, such as an
Experiment
. Each phase is defined by a configuration (corresponding to a design choice) specific to the needs of the project, and an arrangement of core data science components that work together to produce the output of that phase.
Each of these components require careful design choices to be made. Triage facilitates this decision-making process for programmers and developers, with a special focus on tackling data science, AI, and ML problems in public policy and social impact.
What components are inside Triage?
Timechop
Generate temporal validation time windows for matrix creationWhen is it helpful?
In predictive analytics, temporal validation can be complicated. Timechop takes in high-level time configuration (e.g. lists of train label spans, test data frequencies) and returns all matrix time definitions.
Collate
Aggregation SQL Query BuilderWhen is it helpful?
Collate allows you to vary both the spatial and temporal windows of a query to easily specify and execute statements like “find the number of restaurants in a given zip code that have had food safety violations within the past year.”
Architect
Plan, design and build train and test matricesWhen is this useful?
The Architect helps you properly organize data into design matrices in order to run classification algorithms
Catwalk
Training, testing, and evaluating machine learning classifier modelsWhen is it helpful?
Catwalk allows you to train classifiers on large set of design matrices, test and temporally cross-validate them, generating evaluation metrics about them.
Results Schema
Generate a database schema suitable for storing the results of modeling runsWhen is it helpful?
The results schema stores the results of modeling runs in a relational database with different options for table creation and for passing database credentials
Audition
When is it helpful?
Audition introduces a structured, semi-automated way of filtering models based on what you consider important. It formalizes this idea through selection rules that take in the data up to a given point in time, apply some rule to choose a model group, and then evaluate the performance (regret) of the chosen model group in the subsequent time window.
Bias and Fairness Audit
Audit models for fairness and bias to explore possible tradeoffs with efficiency and/or effectivenessWhen is it helpful?
Incorporates Aequitas, an open source bias audit toolkit for machine learning developers, analysts, and policymakers to audit machine learning models for discrimination and bias, and make informed and equitable decisions around developing and deploying predictive risk-assessment tools.
Post-Modeling
Allows the user to dig deeper and explore the models selected by auditionWhen is it helpful?
Gives the user a better understanding of and comparison across the models selected by Audition. Produces comparisons using feature importances and the predictions of the models.
Case Studies
Reducing Early School Dropout Rates in El Salvador
Improving outcomes for people living with HIV by increasing retention in care
Improving Workplace Safety through Proactive Inspections in Chile
The Team
Triage was initially created by the Center for Data Science and Public Policy at the University of Chicago and is now moved to Carnegie Mellon University. Our goal is to further the use of AI, ML, and data science in policy research and practice. Our work includes educating current and future policymakers, doing data science projects with government, nonprofit, academic, and foundation partners, and developing new methods and open-source tools that support and extend the use of data science for public policy and social impact in a measurable, fair, and equitable manner.
To contact the team, please email us at rayid @ cmu dot edu