The Data Commons Project
Data commons community is a group of passionate data engineers and data scientists at ThoughtWorks. Our goal is to provide a collection of rich, high-performance libraries to automate various data processing tasks at scale. We are currently building these tools to work with Apache Spark platform.
Active Projects
We are building a set of scalable, high-performance libraries that address an array of data processing concerns such as data quality assurance, data preparation for machine learning, data anonymization and data security. Here is a list of currently active projects
- prep-buddy - A Scala / Java / Python library for cleansing, transforming and preparing large datasets for ML operations on Apache Spark.
- protectr - A Scala / Java / Python library for anonymization, encryption and redaction operations for large datasets on Apache Spark.
Support or Contact
Catch up with us at our google group data-commons-toolchain@googlegroups.com