- Across 178 algorithmic problems and five LLMs, nearly 1.5× higher Lean executable correctness (pass@5) vs. direct baselines.
- Up to 2× better sample efficiency at inference time (fewer samples per verified solution at comparable generation lengths).
- Specification-driven prompting improves Python pass rates by up to 17.5%.
- Supervised fine-tuning on BRIDGE-style reasoning traces yields nearly 1.5× higher Lean pass success than code-only SFT.