I used to think I was decent at data science. Then I started using Pandas-Profiling, BitRook, Mito, and SweetViz. Now I’m a 10x faster in my job.
Data science tools can help you quickly crunch through large datasets, allowing you to answer analytical questions and make data-driven decisions. Without these tools, such tasks would be difficult, if not impossible, to accomplish.
In this article, we take a look at 10 tools that you can use to speed up your data science workflow. Here are some of the best data science tools you can use right now, all either open source or have a free version.
Pandas Profiling creates an exploratory data analysis report that is full-featured and even handles large datasets. Basically extensive EDA report in 3 lines of code. Its a no-brainer way to get a sense of a dataset in seconds.
Features
- Column Type Detection
- Unique values, missing values
- Quantile statistics
- Descriptive statistics
- Most frequent values
- Correlations
- Missing values matrix and counts
- Duplicate rows
- Text analysis learn about categories
Installation
pip install pandas-profiling
Usage
With 3 lines of code, you can take a DataFrame and turn it into an interactive HTML report or even a notebook widget on your data.
2. BitRook
BitRook is a unique desktop app that is more like a Data Science swiss army knife. It uses ML to analyze and help clean your data — it even generates a python script to automate your cleaning. I used to spend a LOT of time copying and pasting code from other data cleaning projects and this has completely removed that issue. On top of that it helps you analyze your data, and instead of you searching for issues — it raises them to your attention and can tell you in seconds if a dataset is predictive. It handles large datasets and is 10x faster at loading data than Excel. Definetly worth checking out — even the free version is amazing.
Features
- Generates a python script for you
- Predictive Data Detection (Correlation Matrix & Predictive Power Score)
- Handles large data
- Column Type Detection & Type Standardization
- Common Data Cleaning Functions Built-In
- Unique values, missing values
- Quantile statistics
- Descriptive statistics
- Most frequent values (category, letter frequency & word frequency)
- Outlier handling
- PII Data Detection
- Data validation script generation
- Binning (including WOE binning)
- Viewing CSV is faster than Excel and EDA built in
- Splitting data
- Data profiling script generation
Installation
It’s a simple downloadable desktop app from bitrook.com
Usage
Great video tutorials and support
3. Mito
I like to think of Mito as Excel in your jupyter notebook. It gives you a lot of the capabilities and ease-of-use of Excel, but it generates python code of the changes you are doing. I can easily see this as a great way for people to even learn pandas.
Features
- Pivot tables
- Generates the code for each edit
- Exploratory graphs
- Dataframe merging
- Excel-like formulas
- Exploratory data analysis
- Data filtering
Installation
python -m pip install mitoinstaller
python -m mitoinstaller install
Usage
import mitosheet
mitosheet.sheet()
4. SweetViz
Sweetviz is a Python library that generates EDA visualizations in a fully self-contained HTML application. It has all the common data points like missing, distinct, duplicates but Sweetviz also can compare training data vs test data and shows how a target value relates to other features. Its really simple to use and in the couple of use cases its usually just 2 lines of code, so the docs are a little light due to that.
Features
- Overview of data
- Descriptive statistics
- Automatically detects data types
- Missing values charts
- Correlations
- Visual Comparisons (training vs test data)
- Target analysis
- Comparing 2 datasets together (training vs test data)
Installation
pip install sweetviz
Usage
import sweetvizreport = sweetviz.analyze(df)report.show_html('report.html')