Top 6 Free ML Data Prep Tools of 2022

Stephen
3 min readJan 29, 2022

--

Data Prep…we all gotta do it

Are you looking for a data cleaning tool to make your next data science project? Below is a list of my top 5 picks for data cleaning tools of 2022 that will help you prep your data for that ML model faster and with less of a headache.

DataPrep

With a few lines of code you can explore your data, clean your data and even connect to databases and APIs.

Features:

  1. 10x faster than Pandas Profiling and report looks great
  2. 140+ cleaning and validating data functions
  3. Built on top of Dask to get the performance bump
  4. Automatically detects and highlights insights from your data (like outliers)
  5. Report generated to summarize changes made to the data (great for audits)

Miller

Command-line tool for querying, shaping, and reformatting CSV, TSV and JSON. Need something done quickly? This CLI makes quick work of simple jobs.

Features:

  1. Reduce large datasets with one command
  2. Using streaming interface so it only holds one record in memory for most operations
  3. Basic stats on columns
  4. Quick data querying without python or notebooks

Data Cleaner

Data cleaner is a great beginning tool to just start having a library help you out with your data cleaning. Very easy to use, but you might end up using a more advanced tool.

Features:

  1. Can drop any row with a missing value
  2. Replaces missing values with the mode (for categorical variables) or median (for continuous variables) on a column-by-column basis
  3. Encodes non-numerical variables (e.g., categorical variables with strings) with numerical equivalents
  4. Available through CLI or as a python library

BitRook

BitRook is a unique desktop app that is more like a Data Science swiss army knife. It uses ML to analyze and help clean your data — it even generates a python script to automate your cleaning. It helps you analyze your data, and instead of you searching for issues — it raises them to your attention and can tell you if a dataset is predictive. Worth checking out and the free version is robust and there is a free trial for the pro version.

Features:

  1. Generates a python data cleaning script for you
  2. Predictive data detection (Correlation Matrix & Predictive Power Score)
  3. Easily handles large data sets
  4. Column type detection & type standardization
  5. Common data cleaning functions built-in
  6. Unique values, missing values
  7. Outlier handling
  8. PII data detection
  9. Splitting data with a click
  10. Data profiling script generation
  11. A lot more…

Great Expectations

Data profiling and data validation in a pipeline is what Great Expectations brings to you. Think unit testing for your data.

Features:

  1. Easy to write assertions for your data validation
  2. Integrates with most major tools (Spark, Snowflake, Postgres, AWS, etc)
  3. Incredible amount of “expectation” assertion functions
  4. Generates data docs automatically
  5. Great documentation!

KLib

Klib helps with importing, cleaning, analyzing and preprocessing data and a lot more with a simple couple of lines.

Features:

  1. Missing value plot with a 3 lines of code
  2. Help with data cleaning and data aggregation
  3. Creates amazing correlation plots
  4. Simple numerical data distribution plot
  5. Categorical data plot

I hope you found these helpful and they save you time in your data prep. Let me know if I missed any that should be shared! There are so many new tools in data science everyday, but these are some to watch for 2022.

--

--

Stephen
Stephen

Written by Stephen

An Always Curious Software Engineer

Responses (1)