By the Texas Criminal Justice Coalition and January Advisors | September 2019
An increased awareness about mass incarceration in the United States has pushed the issue into the national spotlight. However, many of the drivers of incarceration happen at the local level, with an over-reliance on arrests and prosecutions, and increasingly lengthy sentences. As locally elected District Attorneys and judges discuss hot-button issues like bail reform or drug diversion, it is more important than ever that officials, advocates, and citizens are offering informed policy proposals and making reform decisions grounded in local data. Unfortunately, data availability and quality are uneven and vary from jurisdiction to jurisdiction. This document outlines some of the key requirements – from necessary data fields to prioritize, to tactical approaches to cleaning and coding the data – to enable effective data analyses.
With funding from Microsoft Cities Team, the Texas Criminal Justice Coalition and January Advisors expanded a project to collect, standardize, and visualize criminal court data across Texas. The project’s goals have ultimately extended beyond creating an accessible data visualization tool for the public, and now aim to make criminal justice systems fairer and more equitable, reduce racial disparities and advance racial equity, and create safe and thriving communities throughout Texas.
Building our first data dashboard showed us that the process is iterative, and the benefits emanating from the project are ongoing. The preparation and launch of the dashboard showed us that it can bring together stakeholders around shared reform goals, while ongoing accountability and improvement efforts have strengthened the local reform ecosystem. Community outreach and data gathering processes have allowed us to gain valuable information and insights not contained in the data itself, and we have harnessed those insights to expand the capabilities and impact of the dashboard.
This document is a summary of the lessons we have learned through this project. We hope that others can use it as a guide to launching similar projects that harness public data aimed at holding policymakers and justice system practitioners accountable.
The Criminal Justice Data Dashboard Project in Texas
The Texas Criminal Justice Coalition (TCJC) is a nonprofit, non-partisan research and advocacy organization that seeks to reduce mass incarceration in Texas, through both state-level and county-based efforts. TCJC is located in Austin and Houston (Harris County), and will next be formally expanding staff to Dallas. Harris County is the state’s largest driver of people into jail and prison, and it is currently TCJC’s primary location of county-based advocacy; its local criminal justice system has faced federal litigation in regards to its unconstitutional bail system, as well as complaints regarding over-policing and over-prosecution.
In efforts to ensure that TCJC’s reform recommendations are data-driven, the organization continually collects and analyzes available datasets. Over the past few years, TCJC has begun considering how to visually represent large datasets in interactive, user-friendly ways that empower TCJC and community advocates to identify reform needs and effect change.
TCJC’s “dashboard project” began in earnest in 2015 with a compilation of daily jail booking reports from Harris County. TCJC ultimately discovered a treasure trove of Harris County court data available through the county and sought to establish new ways of displaying this data to community members and advocates. TCJC partnered with January Advisors, a data science consulting firm, to create a beta version of the Harris County Dashboard, which launched in 2017 and contained information from over 800,000 criminal case files dating back to 2010. It let users search criminal cases based on a number of criteria – Filing Date, Offense Category, Race, and Arresting Agency – and then would plot defendants’ addresses (removing specific identifying information) so that larger trends and outcome disparities were easily displayed. For instance, users could see case outcomes by race, or average bail amounts by offense.
Following a funding award by Microsoft Cities Team in 2018, TCJC and January Advisors set out to improve the Harris County Dashboard, as well as prepare to replicate the dashboard in Dallas County. Harris County Commissioner Rodney Ellis invited TCJC and January Advisors to roll out the second iteration of the Harris County Dashboard in September 2018 – an event well-attended by community members, academics, and elected officials alike. The beta version of the dashboard was enhanced with new capabilities, giving users the ability to see detailed breakdowns of bail amounts, sentencing, charging decisions, and individual judges’ performance. It also lets users layer racial and income-based Census data over the map of defendants, showing the degree to which arrests are skewed in low-income neighborhoods and communities of color.
Following the Harris County Dashboard rollout, January Advisors and TCJC co-hosted a webinar that walked viewers through the dashboard’s functionality and reflected on the impact of Harris County District Attorney Kim Ogg’s marijuana diversion policy.
As the Harris County Dashboard was undergoing its second iteration, TCJC and January Advisors were simultaneously laying the groundwork for the Dallas County Dashboard. The acquisition of data proved more difficult in Dallas than in Harris County. Despite multiple in-person visits to the Dallas court clerk’s office, the combination of persistence and charm that led to the successful acquisition of Harris County data initially failed to yield any of the data necessary to build a dashboard for Dallas County.
However, TCJC was eventually able to acquire the misdemeanor case data from a local community partner; over the previous year, TCJC had been building a presence in Dallas with the goal of ultimately establishing staff there, teaming up with local partners that are similarly interested in creating data transparency. For felony cases, TCJC and January Advisors went back to the drawing board, devising a multi-step “scraping” process that allowed for usable data dating back to 2017. After combining and cleaning both Dallas County datasets, the dashboard came together.
Equipped with new features to track changes in bail during the pendency of a criminal case, the Dallas County Dashboard launched in April 2019 with a presentation to investigative reporters at The Dallas Morning News; TCJC also authored a blog post demonstrating how the dashboard can inform discussions surrounding new policies announced by Dallas County District Attorney John Creuzot, while also giving reporters the ability to track implementation of policy changes going forward.
The Stakeholder and User Communities
The collection and analysis of criminal court data is critical to understanding the justice system and how policies affect real-world outcomes. It is also necessary for holding elected officials and justice system practitioners accountable.
For instance, after launching the Harris County Dashboard, January Advisors used the data to track the impact of the Harris County District Attorney’s marijuana diversion program. Data revealed that there was a sharp decline in cases filed, but that there is more work to be done. That finding served as the jumping off point for an editorial about the collateral consequences of a marijuana conviction, and how legalization would open the door for over 120,000 potential expunctions.
This is just one example of the many ways a dashboard can be used to inform policy reform discussions. Community activists and policy advocates who are seeking data to undergird their criminal justice reform recommendations – such as decriminalizing certain misdemeanor offenses, reducing disparities in sentence length and bail by race, and improving case outcomes across type of attorney (appointed, hired, etc.) – will benefit from the dashboard’s capabilities. But the dashboard’s utility is not limited to such actors; other user communities for the dashboards include:
- Voters who want to make informed decisions before elections.
- Academic researchers who are asking specific questions about criminal court outcomes and want to use the data in their own research.
- Policymakers who are looking at criminal court outcomes to understand the effects of current policies, and to inform the adjustments necessary to see different outcomes.
- Transparency advocates who want to use data as a necessary check on the police, prosecutors, appointed defense attorneys, and judges.
- Individuals impacted by the criminal justice system, such as current or formerly incarcerated individuals and their loved ones.
- Elected officials and budget staff who want to use the data to understand how resources are allocated.
- Students learning about the criminal justice system, whose professors can use these dashboards as real-world examples of the world around them.
- Defense attorneys who can use the dashboard to represent systemic inequities and judicial behavior.
- Members of the media who want to corroborate or refute statements made about the court and justice systems.
Data- and Tech-Related Lessons Learned
Launching the dashboards was an intensive process, and we learned valuable lessons through the duration of this project; we have gained additional insights from our preliminary examination of data in Bexar County (San Antonio) and Fort Bend County (Sugar Land). We hope the following information will help others who are seeking to develop a local criminal court dashboard for their jurisdiction, or who are generally interested in making data more available and useful in their community.
- Data is managed at the county level, so it is not standardized across the state. The different types of data availability include:
- Available as a flat file via FTP (Harris County)
- Available through web scraping (Dallas County)
- Available as multiple flat files via web (Bexar County – a potential location for a subsequent dashboard)
- Available through a Public Information Act request (Fort Bend County – a potential location for a subsequent dashboard)
- Data fidelity is different from county to county. Most notably, some counties include the full defendant address (Harris, Bexar) while others abbreviate the defendant address (Fort Bend). At a minimum, the following fields are needed for a dashboard:
- Case number
- Case filing date
- Case disposition date
- Defendant name
- Defendant zip code
- Defendant race
- Defendant date of birth
- Defendant sex
- Charge at filing
- Charge at disposition (if different)
- Bail at filing
- Bail at disposition (if different)
- Sentence length
- Sentence location
- Court number and/or judge name
- Attorney name and/or status
- Case status
- Charge information is non-standard, so charge groupings are customized for each county. Each dashboard groups similar charges into charge categories (e.g., all low-level marijuana charges can be accessed through one category selection, rather than by selecting individual charges from a long list). However, charges are stored as unstructured text fields that need extensive cleaning and a “crosswalk” to group them by charge categories. To build this crosswalk, we developed a process to identify similar charges with either (a) identical text descriptions, or (b) identical charge codes. We also worked with the Center for Science and Law at Baylor University of Medicine to use their charge groupings as a baseline. With the crosswalk in place, we are able to aggregate similar charges (e.g. “POSS MARIJ 0-2 OZ” and “POSS MARIJ 0-2 OZ (HSC)”) to provide a complete picture of activity.
- The data require geocoding. These data do not come with coordinates, and there are hundreds of thousands of records per county. This requires a modest geocoding budget in order to display the results on a map or do any geospatial analysis, such as clusters by Census tract. We used coordinates provided by the U.S. Census Bureau served through the commercial geocoding service Geocodio.
- Data may contain duplicates. Oftentimes, when a defendant has his or her probation revoked, it will appear as a new record using the same case number. In order to represent the data accurately, we scrutinized duplicate case numbers and defendant names to assemble a narrative based on each charge.
- Data collection can require public information requests and associated fees. Even when the data is available in digital format, most counties charge a nominal access fee that ranges from $15 to $100, again requiring a budget.
- Misdemeanor and felony data are often handled by different county agencies. In Fort Bend County, for example, misdemeanor data is handled by the District Clerk, while felony data is managed by the County Clerk. This requires separate requests and fees paid to separate agencies, along with a process for merging the data for analysis and visualization.
- Spelling mistakes are common. In any unstructured text field, there are often misspellings. Common misspellings include charge name, defendant name, street address, arresting agency name, and officer name. This can be problematic when subsetting the data for analysis, and it requires a comprehensive data cleaning and standardization process.
- Projects like this require a strategic data infrastructure. The original Harris and Dallas County dashboards had an ETL process that required manual supervision, and the dashboard would load each county’s data into memory. Although this was sufficient for performance, with the addition of new counties we needed to standardize the process. We migrated the entire data pipeline – from data collecting and scraping, to transformation and loading – to Microsoft Azure. We also moved the public-facing components (dashboard, website) to Azure. Having a central place for the entire application allows us to standardize processes for expansion, troubleshooting, and performance optimization.
Building on the launch of the Harris and Dallas County dashboards this year, TCJC and January Advisors will continue our dashboard work, with a Bexar County Dashboard and a statewide dashboard (allowing for comparisons across counties) planned for launch in late 2019 or early 2020. Having seen the value of integrating data availability and analyses into larger reform and accountability movements, we hope these experiences inform and inspire other organizations to develop similar projects.
Any questions or comments about the dashboard can be sent to firstname.lastname@example.org. Your feedback will help us improve as we build the next dashboard!
Government- and Transparency-Related Lessons Learned
To date, TCJC and January Advisors have looked at criminal court disposition data in four Texas counties (Harris, Dallas, Bexar, and Fort Bend). In three of four cases, obtaining the data was not free, and therefore the data should not be considered open. The lessons detailed in this section can be used by officials at every level of government, whether in or outside Texas, to improve quality and access to criminal court disposition data in their jurisdiction.
- Harris County provides court disposition data in a flat file, updated monthly, accessible through FTP download. The county requires a paid subscription to their FTP site for access. Additionally, the data available here contains major gaps, errors, and omissions.
- Dallas County provides individual lookups of case data, but not datasets. We built a web scraper to obtain the data, but it is also potentially available through at least two public information requests that require payment. The reason this requires two requests is because felony data is managed by the County Clerk, and misdemeanor data is managed by the District Clerk.
- Bexar County provides court disposition data in a flat file, accessible through a public website. This is the only county in our analysis that can meet the minimum criteria of open data. However, the data are arranged into flat files alphabetically sorted by defendant name and updated daily. We do not consider this a user-friendly format, since it requires technical sophistication to detect changes in the dataset.
- Fort Bend County provides court disposition data in a flat file, accessible through a Public Information Act request.
Specific county recommendations:
- Harris County
- Audit and resolve errors in the data.
- Remove fees for accessing the data.
- Include judge names, not just court numbers.
- Include ethnicity, not just race.
- Standardize sentencing information to days (currently text field).
- Dallas County
- Make the dataset accessible without a public request or web scraper.
- Bexar County
- Make the dataset available based on the last updated date.
- Fort Bend County
- Publish the full address for each defendant, not just the zip code. Alternatively, provide jittered coordinates.
- Establish a standard method for publishing the data that is accessible and free.
- Establish a standard set of fields for criminal court disposition data and ensure all counties collect and report these fields. These fields can be used for future Public Information Act requests for other counties in Texas.
- Provide a key for standard data fields (what they mean, how they are populated).
- Standardize charge descriptions, both within counties and across the state.
- Group charges into standardized charge categories (e.g., alcohol-related).
- Include a “last updated” date on all records.
- Geocode and jitter coordinates for defendants’ home addresses.