An Active Learning experiment with a NLP classification problem

Matteo Capitani
4 min readSep 19, 2023
Photo by Meg Jerrard on Unsplash

Introduction

In this article, I want to explore active learning for an NLP classification problem. Specifically, using the Spooky Author Identification competition dataset, I want to label, pretending to start from a completely unlabelled dataset.

The primary references for the active learning are the MIT Introduction to Data-Centric AI course, and, in particular, the third class) and Human in the Loop Machine Learning.

I face the problem of manually annotating a dataset for many projects, and becoming more systematic is the goal of this article. For this article, I assume annotators make no error in the data labelling; I think it is reasonable to assume that extracting text from books and annotating with the author is not error prone.

For this first article, I focus only on the first three steps of the annotation process — the data preparation/ingestion, the first annotation informed by domain knowledge and the training of the first, baseline model.

Preparing the Dataset

As a first step, I read the data and create an oracle returning the right label when queries; this simulates manual annotators reviewing the text.

import pandas as pd
from IPython.display import…

--

--

Matteo Capitani

Data Scientist, Mathematician and Physicist by Education