Prep-Buddy

A library to clean, transform and prepare data at scale using Apache Spark

A Scala / Java / Python library for cleaning, transforming and executing other preparation tasks for large datasets on Apache Spark.

It is currently maintained by a team of developers from ThoughtWorks.

Our aim is to provide a set of algorithms for cleaning and transforming very large data sets, inspired by predecessors such as Open Refine, Pandas and Scikit-learn packages.