Member-only story
A Step-to-Step Guide for Feature Engineering on Textual Data- NLP
“Good features are not born, they are engineered.”
— Kaggle Grandmaster and Data Scientist, Dr. Ben Hamner
⚙️Feature engineering is the process of selecting and creating the most relevant and useful features to input into a machine learning model. It is a crucial step in the machine learning process that can significantly impact the model’s performance, complexity, and ability to generalize to new data. By carefully selecting and constructing the features used as input, it is possible to improve the accuracy and effectiveness of the model and avoid overfitting.
One of the major sources of text is Twitter’s tweets. Tweet data is a rich source of information that can be used to build machine learning models for various tasks such as sentiment analysis, topic classification, and more. To train a machine learning model on tweet data, we first need to extract features from the tweets. In this blog post, we’ll look at different types of features that can be extracted from tweets and how to extract them in Python.
1. Text
The most prominent feature to extract from a tweet is the text itself. This can be used as-is or pre-processed in various ways depending on the needs of the model. For example, we can stem or lemmatize the text, or remove stop words. Here’s an example of pre-processing the text using the Natural Language Toolkit (nltk
):
import nltk
def preprocess(text):
# Tokenize the text
tokens = nltk.word_tokenize(text)
# Stem the tokens
stemmer = nltk.PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
# Remove stop words
stop_words = nltk.corpus.stopwords.words('english')
filtered_tokens = [token for token in stemmed_tokens if token not in stop_words]
# Rejoin the tokens into a single string
processed_text = ' '.join(filtered_tokens)
return processed_text
2. Hashtags
Hashtags are a useful feature to extract from tweets, as they can provide insight into the topics being discussed. We can extract hashtags using a…