| 8:00 - 9:30 | Breakfast and networking |  | 
| 9:30 - 10:00 | Opening Remarks, Workshop Overview | 60 seconds each from anyone who's hiring | 
| 10:00 - 10:45 | Keynote: How can proof scaling make AI safer (Max Tegmark) | AbstractI discuss how ML-turbocharged formal verification can improve AI safety, by enabling untrusted AI to create increasingly powerful formally verified tools. I also present a multilingual community-based open-source library of formally verified code, and discuss opportunities for ML and crowdsourcing to turbocharge adoption of formal verification. | 
| 10:45 - 11:15 | Break |  | 
| 11:15 - 11:45 | Proofs in the Wild (Mike Dodds) | AbstractProofs and other formal methods technologies are great, but the vast majority of systems deployed today are built without them. We’ve recently seen a few green shoots, for example in hardware, crypto, and big cloud providers. In my role as a PI at Galois, I’ve led a lot of commercial and government formal methods projects. This talk will look at three questions: how are proof technologies actually used in the wild today? What might be close to viability, for example with AI assistance? And what seems far off, even with AI? | 
| 11:45 - 12:15 | Compliance and AGI (Evan Miyazono) | Abstracthttps://atlascomputing.org/ | 
| 12:15 - 1:30 | Lunch and networking |  | 
| 1:30 - 2:00 | FVAPPS (Ronak Mehta and Quinn Dougherty) | AbstractWe introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4712 samples for writing programs and proving their correctness, the largest formal verification benchmark. Previously, APPS provided a benchmark and dataset for programming puzzles to be completed in Python and checked against unit tests, of the kind seen in technical assessments in the software engineering industry. Building upon recent approaches for benchmarks in interactive theorem proving, we generalize the unit tests to Lean 4 theorems given without proof (i.e., using Lean's ``sorry'' keyword). On the 406 theorems of 100 randomly selected samples, Sonnet accomplished 30% and Gemini accomplished 18%. We challenge the machine learning and program synthesis communities to solve both each general purpose programming problem and it's associated correctness specifications. It is available for play at https://huggingface.co/datasets/quinn-dougherty/fvapps | 
| 2:00 - 2:30 | FrontierMath (Elliot Glazer) | Abstracthttps://arxiv.org/abs/2411.04872 | 
| 2:30 - 3:00 | Break and Networking |  | 
| 3:00 - 3:30 | Pantograph (Leni Aniva) | Abstracthttps://arxiv.org/abs/2410.16429 | 
| 3:30 - 4:00 | LeanDojo Deep Dive (Aidan Swope) | Abstracthttps://leandojo.org/ | 
| 4:00 - 4:30 | Break and networking |  | 
| 4:30 - 5:00 | Compact Proofs of Model Performance via Mechanistic Interpretability (Jason Gross) | AbstractWe propose using mechanistic interpretability – techniques for reverse engineering model weights into human-interpretable algorithms – to derive and compactly prove formal guarantees on model performance. We've prototyped this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-K task. We created 102 different computer-assisted proof strategies and assessed their length and tightness of bound on each of our models. Using quantitative metrics, we found that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we found that more faithful mechanistic understanding leads to tighter performance bounds. We confirmed these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance. arXiv paper | 
| 5:00 - 5:30 | Closing Keynote: A Roadmap for Formal Mathematical Reasoning: a New Frontier in AI and Building Provably-Safe & Secure Systems (Dawn Song) | Abstract | 
| 5:30 - 6:00 | Day one wrap-up, day two orientation | Feedback form | 
| 6:00 - 9:00 | Dinner/drinks/social |  |