Classification

Document and sentence classification is a fundamental language processing task that involves automatically assigning predefined labels to text fragments.

Core Process

Embedding models

Undersampling: Sample a subset of the data \(N\) times, ensuring each label has the same number of samples
Document Encoding: Convert subset samples into vectors
Classifier Training & Evaluation: Train (for each \(n\) in \(N\)) a logistic regression model with a small number of iterations. Evaluate the model on the test subset.
Average Metrics: Average metrics over all the subset runs.

Data Schema Specifications

Column	Type	Description	Required
`query`	string	Text fragment	✓
`label`	string	Label or labels for fragment	✓

Relevant Metrics

Metric	Description
Accuracy	Accuracy averaged over subset runs
Precision	Precision averaged over subset runs
Recall	Recall averaged over subset runs
F1	F1 score averaged over subset runs

The precision, recall and F1 scores are returned in macro-, micro- and weighted- average forms.

Supported Models

Only HuggingFace embedding-based models compatible with SentenceTransformer (e.g., SentenceTransformer, BERT, RoBERTa) are supported.