Mining Criminal Records based on HTML Data

Carpe Data

About the Project

The goal of this project is to build an automated tool to predict whether a given personal record includes prior arrest information. There are two objectives. The first is to build a binary classifier to predict whether a webpage contains a criminal record or not. The second objective is to provide information about the arrest (e.g arrest date). The team aims to extract the date of the arrest and the arrest code.

Some similar projects have already been put into production, with text mining algorithms generally supplemented with manual validation and checking to ensure data mining quality. Fairness and ethics become a point of concern. The team will need to discern whether any bias (e.g. inappropriate use of demographic information to predict criminal record) occurs. 


  • Crystal Shuijing Zhang, Sponsor
  • Joshua Bang, TA
  • Michael Ludkovski, Faculty

About Carpe Data

Providing insurance companies with next generation data solutions, Carpe Data gathers and refines a range of emerging and alternative data sources that spans social media, online content, and everything in between. The result? Insurers gain a deeper insight of risks, enhancing all facets of the insurance lifecycle.