An Active Learning experiment with a NLP classification problem
Introduction
In this article, I want to explore active learning for an NLP classification problem. Specifically, using the Spooky Author Identification competition dataset, I want to label, pretending to start from a completely unlabelled dataset.
The primary references for the active learning are the MIT Introduction to Data-Centric AI course, and, in particular, the third class) and Human in the Loop Machine Learning.
I face the problem of manually annotating a dataset for many projects, and becoming more systematic is the goal of this article. For this article, I assume annotators make no error in the data labelling; I think it is reasonable to assume that extracting text from books and annotating with the author is not error prone.
For this first article, I focus only on the first three steps of the annotation process — the data preparation/ingestion, the first annotation informed by domain knowledge and the training of the first, baseline model.
Preparing the Dataset
As a first step, I read the data and create an oracle returning the right label when queries; this simulates manual annotators reviewing the text.
import pandas as pd
from IPython.display import…