hierarchical-predictive-coding-transformer
Research codebase exploring hierarchical predictive coding as a training principle for transformers (HERT and variants).
Can transformer layers be trained as a stack of biological-style predictors instead of a single end-to-end backprop graph? This research codebase (HERT) treats each layer as predicting the next layer's representation and propagates only the residual prediction error - a structural analogue to hierarchical predictive coding in cortex. The question is whether this changes what the network learns: does layer-local prediction induce different representations, calibration, or compute profiles than standard ViT/BERT training?
Built in PyTorch with five model variants exploring complementary mechanisms - confidence-gated early exit (CAE), difference target propagation (DTP), Hebbian updates, and locality constraints - against ViT and BERT baselines. CIFAR-10, ImageNet, and GLUE pipelines are in place with per-variant trainers, YAML configs, and 12+ analysis scripts covering attention, CKA, calibration, t-SNE, scaling laws, and residual-amplitude decay across stages. Initial staged HERT+CAE training runs on CPU, with the central open question being whether layer-local prediction can match standard ViT accuracy on CIFAR-10 at scale.