Skip to content

Classification

Document and sentence classification is a fundamental language processing task that involves automatically assigning predefined labels to text fragments.

Core Process

Embedding models

  • Undersampling: Sample a subset of the data \(N\) times, ensuring each label has the same number of samples
  • Document Encoding: Convert subset samples into vectors
  • Classifier Training & Evaluation: Train (for each \(n\) in \(N\)) a logistic regression model with a small number of iterations. Evaluate the model on the test subset.
  • Average Metrics: Average metrics over all the subset runs.

Data Schema Specifications

Column Type Description Required
query string Text fragment
label string Label or labels for fragment

Relevant Metrics

Metric Description
Accuracy Accuracy averaged over subset runs
Precision Precision averaged over subset runs
Recall Recall averaged over subset runs
F1 F1 score averaged over subset runs

The precision, recall and F1 scores are returned in macro-, micro- and weighted- average forms.

Supported Models

Only HuggingFace embedding-based models compatible with SentenceTransformer (e.g., SentenceTransformer, BERT, RoBERTa) are supported.