Why You Should Give Us Direct Access to Your System or a Database Dump
We work with data from project partners in two ways:
1. Some partners prefer we do our work on their systems. From the partner’s perspective, this may have significant benefits: The partner retains control of the data, and it is easier to deploy our work at the end of the project.
Partners who choose this approach need to provide us with the computational resources necessary to handle our machine-learning pipeline. For most projects, we can do well with 2-4 cores, 16-32 GB of RAM, and 500 GB of disk space. The more computational resources we get, the faster we can build good models.
We use all free amd open-source software, including the following:
We use linux command-line tools
Python (numpy, pandas, scipy, scikit-learn at a minimum)
Postgres. We can use other database systems, but it will slow our work.
2. Most of them givs us an extract/copy of their internal data. We have strict protocols and security procedures for protecting the privacy and confidentiality of the data given to us. When giving us an extract of your data, many of our partners have worked with universities and have standard procedures for extracting and cleaning data for academic use. While those procedures might work well for one-off research projects, it doesn’t work well for the types of projects we work on and our goal of giving you a working system back.. Most DSaPP projects aim to build software that works on our partners’ systems, even after we stop working together. For that to happen, we typically need direct access to the partner’s computer system (where we log into the system as any employee would and do our work there) or a database dump.
Why direct access to your system is good
- Less work for your IT staff: To give us access to your system, your IT staff simply needs to create accounts and access controls as they would for any new employee. Dumping the database or extracting flat files typically requires more manual labor from your side. You can find our technical requirements for working on your system here.
- Easiest, cheapest deployment on your system: When we write code to run on your system, you can be ensured that it will work on your system. If you export your data to another format, our code will be written to use that format. You’d either need to write code to export the code repeatedly (which is inefficient because it creates multiple copies of your data) or the code would need to be rewritten or modified to work with the original data source.
- Control over security: When we do the work on our system, we take several steps to ensure the security of your data. If you prefer to keep the data on your system and to control our access to, and monitor our use of, the data, you can give us access to your system, as you do with any of your employees.
Several partners who have given us access to their systems hire us as interns. You can find a sample legal agreement for that here.
Why a database dump is also good
- Less work for your IT staff: Rather than trying to export a database table by table, standard databases offer relatively straightforward ways to dump entire databases at one time. We posted directions for the several popular databases here.
- Easier, cheaper deployment on your system: A database dump is as close to using your system without using your system. The code we write for your database dump will likely work on your system with minimal changes.
- Use all the information: Database dumps contain useful information that Excel and flat files don’t, such as indexes and constraints. Indexes provide us with already optimized code and tell us what information in the database is used most often. Constraints tell us what types of information we can find in each column (e.g. text up to 32 characters long) and how tables relate. For example, dBeaver created the graph below for our police early intervention system using nothing but database constraints. It would take a lot longer to figure out how the tables relate if we got a bunch of flat files.
Why other forms of data sharing are worse
- More work for your IT staff: Giving us Excel files or flat files may require your IT staff to write separate scripts for each one.
- Lower data quality: In addition to losing the database’s constraints and indexes, file formatting can get messy. For example, our partners have struggled with exporting CSVs that have quotes around the text fields, which can throw the entire file format off.
- Pre-cleaned data are bad: Some partners want to clean the data before giving it to us. That’s bad for a couple reasons:
- We can’t know what effect cleaning has on the outcome. We prefer to get the rawest data possible so we can test what effect various cleaning strategies have on the results.
- We’re trying to build tools our partners can use. Unless the partner scripts the cleaning, they will have to manually clean the data to use what we built.