← All projects

text2sql-utils

Shared preprocessing + log-parsing utilities across the text-to-SQL research line (Spider schema format).

Shared glue across the 2019–2020 text-to-SQL research line (BERTRAND-DR, IRNet, spider-schema-gnn experiments). The Spider dataset format had become the de facto interchange for cross-domain text-to-SQL, so most of the friction was on either side of it: getting messy CSV datasets into Spider JSON, and getting structured query data back out of training logs.

Two small Python packages handle each side. preprocessing/ converts ad-hoc CSVs into Spider JSON, splits into train/dev/test, and vendors Spider's own process_sql.py tokenizer for compatibility. log_parser/ extracts SQL queries from noisy training-run log lines using moz_sql_parser, with a graceful trim-and-retry on parse errors - a small pragmatic trick for recovering structured data when the logs aren't quite clean. MIT-licensed; best read alongside the BERTRAND-DR paper rather than as a standalone artifact.