Getting Started
With Prep Buddy
Download the prep-buddy jar and add it to your project library.
You can start writing spark program with prep buddy by writing the following code
customRDD: TransformableRDD = new TransformableRDD(initialDataset)
Functionalities:
The TransformableRDD holds all the functionality that could be done over a RDD. It takes an RDD object as constructor parameter. The following functionalities are available on TransformableRDD.
ImputationIt is the process of replacing missing data with substituted values. The following imputation algorithms have been implemented:
- Approx Mean Based Substitution
- Mode Based Substitution
- Naive Bayes Classifier Based Substitution
- Univariate Linear Regression Based Substitution
- Mean Based Substitution
It is helpful to find the number of occurrences for each value of a field.
ClusteringIt is the process Finding the group of different items but could be the alternative representation of the same item.The following imputation algorithms have been implemented
- Simple Fingerprint algorithm
- N-Gram Fingerprint algorithm
- Levenshtein distance algorithm
Is the process of removing duplicates from a dataset.
DuplicateIt provides only the records which has duplicates entry in the given dataset
Split ColumnIs the process of split a field into multiple field.
- By delimiter
- By length
Is the process of merging two or more fields into one field.
FlagMark rows by a symbol for a given condition.
Map By FlagMap on marked (flag) row.
Remove RowsRemoves rows from dataset for a given condition
Normalization- Min Max Normalizer
- Z Score Normalizer
- Decimal Scaling Normalizer
- Simple Moving Average
- Weighted Moving Average
Examples
Let's take a sample CSV dataset with filename
User,other,direction,duration,timestamp
07681546436,07289049655,Missed,11,Sat Sep 18 01:54:03 +0100 2010
07681546436,07289049655,Missed,11,Sat Sep 18 01:54:03 +0100 2010
07122915122,07220374233,Missed,0,Sun Oct 24 08:13:45 +0100 2010
07166594208,07577423566,Outgoing,24,Thu Jan 27 14:23:39 +0000 2011
07166594208,07577423566,Outgoing,24,Thu Jan 27 14:23:39 +0000 2011
07102745960,07720520621,Incoming,22,Tue Oct 12 14:16:16 +0100 2010
07456622368,07331532487,Missed,24,Sat Sep 18 13:34:09 +0100 2010