View on GitHub

Data Commons

Enabling working with data at scale

The Data Commons Project

Data commons community is a group of passionate data engineers and data scientists at ThoughtWorks. Our goal is to provide a collection of rich, high-performance libraries to automate various data processing tasks at scale. We are currently building these tools to work with Apache Spark platform.

Active Projects

We are building a set of scalable, high-performance libraries that address an array of data processing concerns such as data quality assurance, data preparation for machine learning, data anonymization and data security. Here is a list of currently active projects

prep-buddy - A Scala / Java / Python library for cleansing, transforming and preparing large datasets for ML operations on Apache Spark.
protectr - A Scala / Java / Python library for anonymization, encryption and redaction operations for large datasets on Apache Spark.

Support or Contact

Catch up with us at our google group data-commons-toolchain@googlegroups.com