Dealing with class imbalances in NLP classification problems through Text Augmentation techniques using the amazing NLPAug

Photo by Brett Jordan on Unsplash

What is Data Augmentation and why should we care about it?

Data Augmentation is the practice of synthesizing new data from data at hand. This could be applied to any form of data from numbers to images. Usually, the augmented data is similar to the data that is already available. In all Machine Learning problems the dataset determines how well the…

Understanding various regular expressions and applying them to frequently encountered situations in Natural Language Processing

Photo by Nathaniel Shuman on Unsplash

Why are regular expressions essential for NLP?

Whenever we deal with text data it is almost always never in the form we want it to be. The text may have words we want to remove, punctuation that is not needed, hyperlinks or HTML that can be done away with and dates or numerical entities that can be…

A hands-on tutorial on how to speed up document retrieval by reducing the search space through Locality Sensitive Hashing (LSH)

Photo by Markus Winkler on Unsplash

The problem at hand

During one of my recent projects, I needed to analyze whether the medical problems similar to complaints presented by a patient had been diagnosed and documented earlier and what ailments these complaints indicated. In order to compare the current patient’s symptoms and troubles with other records in the past, I…

Understanding mathematically how Ridge Regression helps in cases where the number of features exceeds data points

Photo by Franki Chamaki on Unsplash

The problem of more features than data

When we talk about regularisation, we almost always talk about it in the context of overfitting, but a lesser-known fact is that it can help solve the aforementioned problem.

Since data forms the very core of machine learning, you would never expect to run into such a problem. Having said…

Find out how well BERT and RoBERTa deal with questions from grammar quizzes

Photo by Ben Mullins on Unsplash


  1. A little bit about BERT and its training on the Cloze task.
  2. RoBERTa and how it differs from BERT
  3. Hugging Face and the fill mask pipeline
  4. Comparing BERT and RoBERTa’s ability to predict prepositions, articles, question tags, opposites and popular proverbs using questions from grammar quizzes. ( In English)


Tips and Tricks

A simple utility I use to address categorical features with many unique values

Photo by George Pagan III on Unsplash

What is high cardinality?

Almost all datasets now have categorical variables. Each categorical variable consists of unique values. A categorical feature is said to possess high cardinality when there are too many of these unique values. One-Hot Encoding becomes a big problem in such a case since we have a separate column for each…

Learn how to create interactive scatter maps to represent multiple features in your data with very little code

Photo by GeoJango Maps on Unsplash

What Plotly and Mapbox bring to the table

Plotly is a powerful visualisation library that provides amazing capabilities like interactive, dynamic, easy to use, and highly detailed plots. Plotly Express is a built-in part of the Plotly library that provides high-level APIs which require very little code to plot a variety of figures. We will be using Plotly…

Everything you can do using WhatsApp chats from analyzing texting habits to building generator models

Photo by Diego PH on Unsplash

A personalised touch

You’ll find tons of datasets from Kaggle for all sorts of analysis and model-prototyping. However, none of them are quite as fun as analysing your own data, making sense of your chatting habits, building models that can text like you or even predicting emojis you would use in a particular…

Using a single animated Bubble Chart to analyse data and observe trends

Photo by Wengang Zhai on Unsplash

Why use Plotly?

Plotly is a visualisation library which is fast becoming popular for its amazing capabilities. What sets Plotly apart from the other visualisation libraries is its ability to provide interactive, dynamic, easy to use and highly detailed plots. Using Plotly we can interact with our plots through button clicks, drop-downs, sliders…

Using PyCaret to generate a comprehensive EDA report in 1 line of code.

What is EDA and why is it essential?

Before we jump to model building, understanding the data at hand is essential. Analysing the data alone can give us valuable insights to solve our problems. Moreover understanding data is very useful to determine which features would help our model, which features can be done away with , how can…

Raj Sangani

Machine Learning Engineer at NextWealth , LinkedIn @

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store