Data And Beyond

Selected stories around Data Science, Machine Learning, Artificial Intelligence, Programming, and Technology topics. Writing guide: https://medium.com/data-and-beyond/how-to-write-for-data-and-beyond-b83ff0f3813e

Member-only story

A Step-to-Step Guide for Feature Engineering on Textual Data- NLP

Harshmeet Singh Chandhok
Data And Beyond
Published in
6 min readJan 9, 2023

--

Photo by Amador Loureiro on Unsplash

“Good features are not born, they are engineered.”

— Kaggle Grandmaster and Data Scientist, Dr. Ben Hamner

⚙️Feature engineering is the process of selecting and creating the most relevant and useful features to input into a machine learning model. It is a crucial step in the machine learning process that can significantly impact the model’s performance, complexity, and ability to generalize to new data. By carefully selecting and constructing the features used as input, it is possible to improve the accuracy and effectiveness of the model and avoid overfitting.

Meme made by me
Image by Author

One of the major sources of text is Twitter’s tweets. Tweet data is a rich source of information that can be used to build machine learning models for various tasks such as sentiment analysis, topic classification, and more. To train a machine learning model on tweet data, we first need to extract features from the tweets. In this blog post, we’ll look at different types of features that can be extracted from tweets and how to extract them in Python.

1. Text

The most prominent feature to extract from a tweet is the text itself. This can be used as-is or pre-processed in various ways depending on the needs of the model. For example, we can stem or lemmatize the text, or remove stop words. Here’s an example of pre-processing the text using the Natural Language Toolkit (nltk):

import nltk

def preprocess(text):
# Tokenize the text
tokens = nltk.word_tokenize(text)
# Stem the tokens
stemmer = nltk.PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
# Remove stop words
stop_words = nltk.corpus.stopwords.words('english')
filtered_tokens = [token for token in stemmed_tokens if token not in stop_words]
# Rejoin the tokens into a single string
processed_text = ' '.join(filtered_tokens)
return processed_text

2. Hashtags

Hashtags are a useful feature to extract from tweets, as they can provide insight into the topics being discussed. We can extract hashtags using a…

--

--

Data And Beyond
Data And Beyond

Published in Data And Beyond

Selected stories around Data Science, Machine Learning, Artificial Intelligence, Programming, and Technology topics. Writing guide: https://medium.com/data-and-beyond/how-to-write-for-data-and-beyond-b83ff0f3813e

Harshmeet Singh Chandhok
Harshmeet Singh Chandhok

Written by Harshmeet Singh Chandhok

AI Master's Student at @UNSW Australia 📈 Medium Blogger 🖋️ Future Skynet whisperer 🤖 Lets Collaborate 💡 https://linktr.ee/techno_paji_

Responses (2)