Data Prep…we all gotta do it
Are you looking for a data cleaning tool to make your next data science project? Below is a list of my top 5 picks for data cleaning tools of 2022 that will help you prep your data for that ML model faster and with less of a headache.
DataPrep
With a few lines of code you can explore your data, clean your data and even connect to databases and APIs.
Features:
- 10x faster than Pandas Profiling and report looks great
- 140+ cleaning and validating data functions
- Built on top of Dask to get the performance bump
- Automatically detects and highlights insights from your data (like outliers)
- Report generated to summarize changes made to the data (great for audits)
Miller
Command-line tool for querying, shaping, and reformatting CSV, TSV and JSON. Need something done quickly? This CLI makes quick work of simple jobs.
Features:
- Reduce large datasets with one command
- Using streaming interface so it only holds one record in memory for most operations
- Basic stats on columns
- Quick data querying without python or notebooks
Data Cleaner
Data cleaner is a great beginning tool to just start having a library help you out with your data cleaning. Very easy to use, but you might end up using a more advanced tool.
Features:
- Can drop any row with a missing value
- Replaces missing values with the mode (for categorical variables) or median (for continuous variables) on a column-by-column basis
- Encodes non-numerical variables (e.g., categorical variables with strings) with numerical equivalents
- Available through CLI or as a python library
BitRook
BitRook is a unique desktop app that is more like a Data Science swiss army knife. It uses ML to analyze and help clean your data — it even generates a python script to automate your cleaning. It helps you analyze your data, and instead of you searching for issues — it raises them to your attention and can tell you if a dataset is predictive. Worth checking out and the free version is robust and there is a free trial for the pro version.
Features:
- Generates a python data cleaning script for you
- Predictive data detection (Correlation Matrix & Predictive Power Score)
- Easily handles large data sets
- Column type detection & type standardization
- Common data cleaning functions built-in
- Unique values, missing values
- Outlier handling
- PII data detection
- Splitting data with a click
- Data profiling script generation
- A lot more…
Great Expectations
Data profiling and data validation in a pipeline is what Great Expectations brings to you. Think unit testing for your data.
Features:
- Easy to write assertions for your data validation
- Integrates with most major tools (Spark, Snowflake, Postgres, AWS, etc)
- Incredible amount of “expectation” assertion functions
- Generates data docs automatically
- Great documentation!
KLib
Klib helps with importing, cleaning, analyzing and preprocessing data and a lot more with a simple couple of lines.
Features:
- Missing value plot with a 3 lines of code
- Help with data cleaning and data aggregation
- Creates amazing correlation plots
- Simple numerical data distribution plot
- Categorical data plot
I hope you found these helpful and they save you time in your data prep. Let me know if I missed any that should be shared! There are so many new tools in data science everyday, but these are some to watch for 2022.