Getting Started

With Prep Buddy

Download the prep-buddy jar and add it to your project library.

You can start writing spark program with prep buddy by writing the following code

customRDD: TransformableRDD = new TransformableRDD(initialDataset)

Functionalities:

The TransformableRDD holds all the functionality that could be done over a RDD. It takes an RDD object as constructor parameter. The following functionalities are available on TransformableRDD.

Imputation

It is the process of replacing missing data with substituted values. The following imputation algorithms have been implemented:

Approx Mean Based Substitution
Mode Based Substitution
Naive Bayes Classifier Based Substitution
Univariate Linear Regression Based Substitution
Mean Based Substitution

Faceting

It is helpful to find the number of occurrences for each value of a field.

Clustering

It is the process Finding the group of different items but could be the alternative representation of the same item.The following imputation algorithms have been implemented

Simple Fingerprint algorithm
N-Gram Fingerprint algorithm
Levenshtein distance algorithm

Deduplicate

Is the process of removing duplicates from a dataset.

Duplicate

It provides only the records which has duplicates entry in the given dataset

Split Column

Is the process of split a field into multiple field.

By delimiter
By length

Join Column

Is the process of merging two or more fields into one field.

Flag

Mark rows by a symbol for a given condition.

Map By Flag

Map on marked (flag) row.

Remove Rows

Removes rows from dataset for a given condition

Normalization

Min Max Normalizer
Z Score Normalizer
Decimal Scaling Normalizer

Smoothing

Simple Moving Average
Weighted Moving Average

Examples

Let's take a sample CSV dataset with filename calls.csv consisting attributes :

User,other,direction,duration,timestamp
07681546436,07289049655,Missed,11,Sat Sep 18 01:54:03 +0100 2010
07681546436,07289049655,Missed,11,Sat Sep 18 01:54:03 +0100 2010
07122915122,07220374233,Missed,0,Sun Oct 24 08:13:45 +0100 2010
07166594208,07577423566,Outgoing,24,Thu Jan 27 14:23:39 +0000 2011
07166594208,07577423566,Outgoing,24,Thu Jan 27 14:23:39 +0000 2011
07102745960,07720520621,Incoming,22,Tue Oct 12 14:16:16 +0100 2010
07456622368,07331532487,Missed,24,Sat Sep 18 13:34:09 +0100 2010