Classification
Document and sentence classification is a fundamental language processing task that involves automatically assigning predefined labels to text fragments.
Core Process
Embedding models
- Undersampling: Sample a subset of the data \(N\) times, ensuring each label has the same number of samples
- Document Encoding: Convert subset samples into vectors
- Classifier Training & Evaluation: Train (for each \(n\) in \(N\)) a logistic regression model with a small number of iterations. Evaluate the model on the test subset.
- Average Metrics: Average metrics over all the subset runs.
Data Schema Specifications
| Column | Type | Description | Required |
|---|---|---|---|
query |
string | Text fragment | ✓ |
label |
string | Label or labels for fragment | ✓ |
Relevant Metrics
| Metric | Description |
|---|---|
| Accuracy | Accuracy averaged over subset runs |
| Precision | Precision averaged over subset runs |
| Recall | Recall averaged over subset runs |
| F1 | F1 score averaged over subset runs |
The precision, recall and F1 scores are returned in macro-, micro- and weighted- average forms.
Supported Models
Only HuggingFace embedding-based models compatible with SentenceTransformer (e.g., SentenceTransformer, BERT, RoBERTa) are supported.