← All projects

tagger

Fine-tunes BERT for token-level sequence tagging on natural-language utterances.

A building block from the same text-to-SQL research line: tagging the meaningful spans inside a user's question (which tokens refer to entities, which to attributes, which are filler) so downstream parsers have a cleaner signal to work with. Built in 2020 when fine-tuning BERT for token classification was still hand-rolled rather than a one-liner.

The repo fine-tunes bert-base-uncased for 3-class token tagging using PyTorch and HuggingFace transformers, with a small CLI for train, predict, and threshold_suggest. The training loop uses differential learning rates (5e-6 for the BERT encoder, 1e-3 for the classifier head) and checkpoints on best validation loss - the right shape for the job, not framework boilerplate. A research utility rather than a polished tool, but the bones are clean.