Skip to main content

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu


Ilya Sutskever



We explore the application of transformer-based language models to automated theorem proving. This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans – the generation of original mathematical terms – might be addressable via generation from language models. We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance. GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep learning based system has contributed proofs that were adopted by a formal mathematics community.

1        Introduction

Artificial neural networks have enjoyed a spectacularly successful decade, having made considerable advances in computer vision [1, 2], translation [3, 4, 5], speech recognition [6, 7], image generation [8, 9, 10, 11, 12], game playing [13, 14, 15], and robotics [16, 17]. Especially notable is the recent rapid progress in language understanding and generation capabilities [18, 19, 20, 21, 22].

With the possible exception of AlphaGo [13] and AlphaZero [23], reasoning tasks are conspicuously absent from the list above. In this work we take a step towards addressing this absence by applying a transformer language model to automated theorem proving.

Automated theorem proving [24] is an appealing domain for exploring reasoning in general and the reasoning capabilities of language models in particular for several reasons:

  • Reasoning-complete: Proving theorems very likely require general and flexible reasoning; thus an advance in theorem proving is also an advance in reasoning more broadly.
  • Search: Automated theorem proving systems can quickly check the correctness of proofs, making it a productive environment for the use and development of search methods.
  • Automated data generation: The ability to verify proofs makes it possible to automatically generate new problems that could then be used as training data. This is especially important, since collecting high quality data for reasoning tasks can be difficult.

Learning to prove theorems is somewhat analogous to learning to play Go: both offer an automated way of determining success (the game of Go is a miniature formal system), and both offer an automated way for generating new data via self play-type approaches. This similarity, together with the clear success of AlphaZero, suggests that automated theorem proving might prove to be a fruitful domain for the study of reasoning in neural networks where significant progress may be possible.

1.1       Contribution

Our contributions are the following:

  • We verify that generative pre-training substantially improves performance and that pre- training on mathematical data (such as arXiv) leads to better performance compared to pre-training on generic text from the web.
  • We find that model size is positively correlated with performance, even though the size of the Metamath dataset is relatively small.
  • We demonstrate that iteratively training a value function on statements generated by our language model leads to improved prover performance, which immediately suggests a strategy for continuous self improvement: keep training on proofs generated by the prover.
  • We also achieve a new state of the art for the Metamath environment with our best model capable of closing 56.22% of proofs from a held-out test set (vs 21.16% for the current state of the art, MetaGen-IL [25]), demonstrating that the Transformer architecture may be suitable to formal reasoning.

2        Related Work

Deep learning applied to premise selection and proof guidance Research on automated theorem proving dates back to the 50s [24], but mainstream proof assistants still suffer from combinatorial explosion of their search space as they are scaled to large corpuses, motivating the use of deep learning. Early applications of deep learning to formal mathematics focused primarily on premise selection and proof guidance. DeepMath [26] explored the use of CNNs and RNNs to predict whether a premise is useful to demonstrate a given conjecture, their results were later improved with FormulaNet [27] by the use of graph neural networks, reminiscent of NeuroSAT [28]. Proof guidance consists in selecting the next clause to process inside an automated theorem prover. Loos et al. [29] investigated the use of models similar to DeepMath’s for proof guidance and demonstrated a significant uplift on the Mizar library.

Deep learning applied to automated theorem-proving HOList [30] proposes a formal environ- ment based on HOL Light. They achieve their best performance [31] with a GNN model designed for premise selection and the use of exploration. More recently, the same team studied the use of the BERT objective with Transformers on formal statements [32], demonstrating the potential of leveraging Transformers for formal reasoning. Their study focuses on preliminary tasks that are related but not directly consisting of proving formal theorems (such as typing and conjecturing). GamePad [33] and CoqGymn/ASTactic [34] introduce environments based on the Coq theorem prover. ASTactic generates tactics as programs by sequentially expanding a partial abstract syntax tree. Holophrasm [35] and MetaGen-IL [25] propose RNN-based models to generate proofs for Metamath (the formal system we focus on). They rely on three different models, one to value goals, one to select premises and one to generate substitutions. MetaGen-IL also demonstrates an uplift in performance by generating synthetic data by forward proving.

Use of Transformers for symbolic tasks Several lines of work have been exploring language modeling using Transformers [18]. Language modeling improvements have been demonstrated from better pre-training tasks, using various objectives such as auto-regressive generation [19, 20, 21], token masking [22] or sequence masking [36], but resulting language models have so far felt short when applied to reasoning oriented tasks such as algebraic word problems [37, 38]. Recently, Lample and Charton [39] successfully applied Transformers to anti-derivative calculus and solving differential equations, hinting that Transformers are capable of generating the exogenous terms involved in the substitutions required for successful symbolic integration. The Universal Transformer [40], a Transformer with tied weights, was also shown to be successful at more algorithmic tasks. Also, Saxton et al. [41] evaluated the Transformer architecture on a variety of mathematical problems.

3        Formal Environment

We chose Metamath [42] as our formal environment. Metamath is powered by a simple meta logic system based on a single substitution rule [43].

The main Metamath library is called, which is a collection of ∼ 38k proofs based on ZFC set theory (while other formalisms can also be used on top of Metamath’s meta logic, they are not used in

Metamath has several advantages that make it convenient to use with neural networks:

  • Verification is fast and can be implemented in several hundreds lines of code.
  • Proof steps are context-free: a goal or subgoal that we wish our system to prove, together with a list of the statements of the theorems proven so far, completely define the state of the Metamath system at any stage of a proof. Other formal systems are generally wrapped in high-level programming languages that make them easier to use for humans (by including convenient features like module imports or custom user-defined tactics) but are harder to integrate with a neural network. While proofs in such systems are generally shorter and more human-readable, they are impacted by long-distance interactions which makes the complete description of the intermediary states of proofs longer, and therefore less suitable for neural language models.
  • Access to clean and compact subgoal representations makes searching the proof tree relatively straightforward. It is not the case for systems where the proving objective resembles program synthesis more than an explicit proof tree search.
  • is one of the largest libraries available and its foundations are accepted as compatible with modern mathematics.

But it also has a number of weaknesses:

  • Metamath does not have high-level tactics, which means that all of its proof steps are very low-level. As an example, the de-bruijn factor [44]–the quotient of the size of a formalization of a mathematical text and the size of its informal original version– of a Metamath proof is ∼ 10 − 20 while it is around ∼ 1 − 3 in Coq, HOL Light or Lean. Lower level proof steps mean longer proofs with greater chance of compounding errors during search.
  • The current state of the tooling around Metamath makes it a very “DIY” system, one that is not yet ready for broad adoption by the mathematics community.

While our approach would be applicable to other formal systems (such as Lean, Coq, or HOL Light), Metamath’s features allow faster prototyping and reduced iteration time in the near term, which is why we chose it for this project.

The library contains the background theorems required to demonstrate most Olympiad or undergraduate Mathematics type of problems. For example, assisted by the GPT-f proof assistant described in this work in section 6.2, we formalized IMO 1972 problem B21.

3.1       Proving in Metamath

Proving in Metamath consists of applying a previously demonstrated theorem or axiom by providing a substitution of the variables appearing in the hypotheses and conclusion of the theorem being applied, such that the substituted conclusion unifies to (which means that it “matches”) the current goal which we wish to prove. The substituted hypotheses, if any, become the new subgoals left to prove.

This mechanism, a proof step, can be used in a forward manner (where we start with the axioms and reach the desired statement, one proof step at a time) and a backward manner (where we start with the statement we wish to prove and, after applying enough proof steps, end up at axioms or previously demonstrated theorems with whose hypothesis we already determined to be true). As it is more naturally amenable to proof search, we will be operating backward.

As an example, assume we want to prove f– (3 + 2) = 5 using the definition of 4 and 5 as respective successors of 3 and 4. As a first step, we should use an equality transitivity theorem such as:

|- A = B # first hypothesis
|- C = B # second hypothesis
|- A = C # conclusion

To apply the transitivity theorem, we need to provide a substitutions that substitutes A with (3 + 2) and B with 5 such that the conclusion of the theorem unifies to the current goal. We are left with providing a substitution for B which can hardly be discovered mechanically (hence the appeal to use generative language modeling). We can substitute B with (4 + 1) as is the case in the actual proof2 in Metamath’s library.

Putting it all together, the goal here is:

|- ( 3 + 2 ) = 5
The proof step we apply:
|- A = B # first hypothesis
|- C = B # second hypothesis
|- A = C # conclusion
{{ A : ( 3 + 2 ) }} # substitution of A
{{ B : ( 4 + 1 ) }} # substitution of B
{{ C : 5 }} # substitution of C
And finally the new subgoals are:
|- ( 3 + 2 ) = ( 4 + 1 )
|- ( 4 + 1 ) = 5

Applying the following proof step with no hypothesis (the definition of 53) to the second subgoal allows us to prove it.

[[ ]] |- ( 4 + 1 ) = 5

Note that this proof step has no hypothesis and no substitution involved. It therefore closes that branch of the proof tree. From there the proof can be continued with the first subgoal, proving backward, until no subgoal is left. Also note that a proof for a given theorem of the library can only use theorems proven before the appearance of the theorem to prove; we enforce that constraint when benchmarking our models despite them being trained on the library as a whole.

In most formal systems, a proof step, consists of a goal and a mechanism that, given a goal produces new subgoals, generally referred to as a tactic. In Metamath, there is only one type of tactic based on substitution as illustrated above. Additionally since the substituted theorem must unify to the current goal, the current goal can be deduced from the tactic itself (theorem and substitution pair), which is not generally the case in other systems. As such, we’ll use tactic and proof step interchangeably in the rest of the paper.

This informal presentation of Metamath is sufficient to understand the objectives we use to train our models. A more formal definition of Metamath’s meta-logic can be found in the Metamath Book [42].

3.2       Dataset

Metamath’s uses a binary compressed format to represent proofs of statements. We process the library and extract a dataset of proof steps, stored as JSON blobs using the representation presented

above. For each proof step we store a GOAL, a PROOFSTEP and a reference to the parent goal if any, encoding the tree structure of the proofs:


"proof_label":  "unidmrn",

"goal": "[[ ]] |- U. U. ‘’ A = ( dom A u. ran A )", "proof_step": "[[ |- A = B |- C = B ]] |- A = C \\

{{ A : U. U. ‘’ A }} \\

{{ B : ( ran ‘’ A u. dom ‘’ A ) }} \\

{{ C : ( dom A u. ran A ) }}", "proof_step_hash": "37yZVNorgF8=", "parent_hash": ["n4Kl7judEN4="]


The dataset contains ∼ 3m of such proof steps for ∼ 38k theorems (different proof labels). We split that dataset between a train set and two valid and test sets each containing ∼ 1k proofs sampled randomly (∼ 90k proof steps each).

3.3       Glossary

termA string that comply to the Metamath grammar.
statement or propositionA potentially empty set of hypotheses (terms) and a conclusion (term)
entailed by the hypotheses.
theoremA proven statement.
axiomAn assumed statement.
goalA statement in the context of a proof search.
substitutionsA list of pairs of variables and terms (to substitute the variables within a
theorem or an axiom).
tacticA theorem and substitutions that unify to a goal.
subgoalsGoals generated by a tactic (the substituted hypotheses of the tactic’s
proof stepA goal and a tactic, potentially generating new subgoals.
proofA tree of goals and tactics whose root is the demonstrated theorem; leaves
of the tree are tactics with no subgoals or goals that are hypotheses of the
root theorem.

4        Model

4.1       Architecture

We use decoder-only Transformers similar to GPT-2 [20] and GPT-3 [21]. The largest model we study has 36 layers and 774m trainable parameters.

4.2       Training Objective

The proofstep objective we use for training is a conditional language modeling objective that is asked to generate the PROOFSTEP given a GOAL, which is directly applicable to proof searches. To do so, we format our data in the following way:


There is one such objective for each JSON line in our dataset. We train with only one sentence per context (no-chunking), masking the rest of the context by assigning a loss weight wloss = 0. As we train we track the valid loss and sequence accuracy while masking the query part of the objective:


We regularize the training by early-stopping at the point of minimum valid loss and applying a weight decay wd = 0.1.

Here is a randomly sampled context as presented to our models for training:

GOAL [[ ]] |- ( ( ( J e. Nrm /\ f e. ( J Homeo K ) ) /\ ( x e. K /\ y e. (
( Clsd ‘ K ) i^i ~P x ) ) ) -> ( ‘’ f " x ) e. J ) PROOFSTEP [[ |- ( ph ->
ps ) |- ( ph -> ch ) |- ( ( ps /\ ch ) -> th ) ]] |- ( ph -> th ) {{ ch :
x e. K }} {{ ph : ( ( J e. Nrm /\ f e. ( J Homeo K ) ) /\ ( x e. K /\ y e.
( ( Clsd ‘ K ) i^i ~P x ) ) ) }} {{ ps : f e. ( J Cn K ) }} {{ th : ( ‘’ f
" x ) e. J }} <|endoftext|>

4.3       Proof Search

4.3.1       Goal Expansion

We find proofs by running proof searches. A proof search maintains a proof tree and a queue of open goals sorted by their cumulative logprob, initialized with the root goal that we wish to demonstrate (see figure 1). The cumulative logprob of a goal is defined by the sum of the logprobs of the tactics that were used to reach that goal from the root goal. Intuitively we expand goals for which the generative model is the most confident globally. This has a tendency to explore breadth first as deeper goals have more parent tactics and therefore typically a higher cumulative logprob.

Each time we expand an open goal we sample e = 32 tactics (the proofstep objective described above) from the model at temperature t = 1.0, deduplicate them, and apply the valid tactics (of which there are at most e) to the goal being expanded. Each successful tactic application generates new subgoals that are added to the proof tree and the proof search queue. The expanded goal is then removed from the queue. Note that the subgoals associated with a successfully applied tactic all share the same cumulative logprob and will eventually be expanded together (as subgoals generated from their own expansion will mechanically have a higher cumulative logprob, and will therefore be inserted behind in the queue). We denote the process of selecting the minimal cumulative logprob goal and expanding it as a proof search expansion.

Each proof search involves d = 128 goal expansions, so proofs generated have at most d proof steps. When evaluating our models, we attempt a proof search for each statement in the valid set a = 4 times, starting from an empty proof tree each time. In the above, a, e, and d are hyperparameters of the search process that we can vary to achieve better performance (at the cost of more compute), but keep constant as we compare models.

4.3.2       Formal Verifier

Performing such proof searches requires to tightly couple a Metamath verifier with our models. We implemented a Metamath kernel in Python to avoid the performance cost and brittleness of interacting with an external kernel over its REPL through standard I/O. It also provides us with a flexible environment to experiment with new ideas in ways that were not anticipated by existing verifiers. The kernel consists of a modified LR(0) parser to check that terms generated by our models comply with the Metamath grammar, along with Goal and Tactic objects that implement the Metamath substitution and represent proof trees. Our implementation is capable of exporting our in-memory representations to both our JSON marshalled format and the official proof format. The latter allows us to verify the proofs we generate with an external Metamath kernel implementation such as or metamath-exe.

Collectively, this proof search procedure and the formal verifier tied with it are what we refer to as the GPT-f automated prover.

4.4       Evaluation

We report the performance Perfvalid (θ) of a model θ, as the percentage of proofs found by this procedure within the valid or test set. We evaluate our models on the valid set of ∼ 1k theorems and once at the end of this paper on the held-out test set.


Description automatically generated

Figure 1: Proof search consists in maintaining a proof tree where multiple tactics are explored for each goal, starting from the root goal. Goals are expanded by cumulative (tactic) logprob priority.

The hyperparameters we chose to evaluate our models attempt to minimize the variance in the evaluation process while using the least amount of compute. Decreasing the number of expansions per goal e increases the variance as we less systematically explore the action space when expanding each goal. The e value can be taken quite high as each auto-regressive proof step generation uses the same query for a given goal and can therefore be batched together. Increasing e too much may also hurt performance given the breadth first nature of the cumulative logprob ordering. Increasing the number of attempts per proposition a decreases variance and consistently improves performance up to reasonably high values (We use a = 32 attempts for our final benchmarking). We found that a = 4 limited the variance in our evaluations while remaining tractable given the amount of compute we had available. Finally the proof search depth d has little impact on variance but naturally improves performance (we take d = 128 to evaluate our models and d = 256 for our final benchmarking).

The number of expansions we use per proof search may appear as relatively low, but it’s important to realize that it already requires a substantial amount of compute as each expansion consists in the auto-regressive generation of e = 32 tactics (generally hundreds of tokens and therefore forward passes each). Empirically, the hyperparameters we chose, require on average around ∼ 1k GPU.hours (with V100s) to evaluate our 700m parameters model (which leverages GPT-3’s sparse attention as well as key-value caching).

4.5       Pre-training

We study the effect of pre-training on the performance of our models. We pre-train our models on both GPT-3’s post-processed version of CommonCrawl as well as a more reasoning-focused mix of Github, arXiv and Math StackExchange.

Github is downloaded using BigQuery4 and filtered to only include deduplicated files from selected programming languages (excluding markdown, stylesheets, HTML). arXiv is downloaded using Bulk Data Access5 and filtered to only include articles labeled as Mathematics and whose LaTeX source is available. Math StackExchange is downloaded from their snapshot on the Internet Archive6 and post-processed to remove HTML tags and correlate questions and answers. We demote the mix reported in the table below as WebMath:

Table 1: Mix and source of data involved in the WebMath dataset.

Github23 GB33%
arXiv Math10 GB33%
Math StackExchange2 GB33%

4.6       Synthetic Datasets

Despite being among the largest formal mathematics libraries, the Metamath library remains scarce in the context of deep learning, especially in light of the advantages demonstrated on various NLP tasks by pre-training on large corpora. Also mostly focuses on well-known high-level theorems and does not include a large number of technical lemmas resembling the type of mathematics exercises used as curriculum for humans. Finally, Metamath lacking high level tactics such as HOL Light’s ARITH_RULE7, or Lean’s ring8, it is critical to ensure that our models are capable of proving at least basic technical theorems generally handled by high-level tactics in other systems (in domains such as arithmetic or ring equalities and inequalities)

To achieve this goal we designed synthetic datasets allowing us to generate proofs for each of these domains at will while controlling precisely by how many proofs we augment our training set.

We describe below the synthetic datasets we designed and report in section 5 the sample complexity associated with these synthetic tasks.

4.6.1       n-digit Arithmetic

We synthetically generate proofs for arithmetic formulas such as 11 ∗ 22 = 242 by following the basic algorithm for addition and multiplication, repeatedly applying theorems such as decadd9 or decaddc10. Divisions and subtractions are translated to their equivalent additions and multiplications

theorems in one proof step. We also support generating modulos and exponentiations.

We accept one hyperparameter for these synthetic proof generators, ndigits which controls the number of digits involved in these arithmetic tasks. When generating a new proof we sample uniformly in [−10ndigits , 10ndigits ] each of the numbers involved. To illustrate the level at which Metamath operates, Table 2 shows the average number of proof steps generated as a function of ndigits for each generator. These statements are generally proved with one tactic application in other higher-level

systems, which is a good example of one of Metamath’s drawbacks we identified earlier.

Table 2: Average number of proofsteps produced by our synthetic generators for ndigits = 3, 9, 18.

Addition (in Z)194894

Our goal is to leverage these synthetic generators to ensure our models are confident when faced with such subgoals in order to mitigate the large number of proof steps they require.

4.6.2       Ring Algebra

Our ring equalities generator is largely inspired by the INT inequality generator [45] . They propose an inequality generator that starts from simple formulas (such as A = A) and iteratively transforms them into more complex equalities or inequalities using a predefined list of axioms (such as commutativity of addition or distributivity of addition-multiplication). At each transformation, the axiom to be applied is chosen uniformly.

Our generator operates similarly within the Metamath formalism based on theorems equivalent to the axioms they propose. We accept two hyperparameters, the number of variables nbvar involved in the seed formulas (of the form A = A) as well as the number of theorems applied to transform the expression, denoted as depth. In addition, we use hand-crafted weights as we sample theorems in order to obtain formulas that we judged qualitatively better.

Here is a list of the theorems we use and their associated sampling weights.

Table 3: Metamath theorems use by our Ring Algebra synthetic generators. Theorems are available in the Matmath Proof Explorer.

Examples of equalities produced by the generator:

ABBA(AB)2 + (C + A) = A + (ABBA)2 + C

(AA)2 = A2AA

((BA + CA)2)2 = (BA + CA)2(BAAB + ACCA + BAAC + ABCA)

((A + B)2)2(A + A) = ((A + B)2(AB + AB + AA + BB) + (A + B)2(AB + AB + AA + BB))A

4.6.3       Default augmented Dataset

By default in all of our experiments we add synthetically generated proofs to the dataset extracted from as shown in Table 4. We’ll denote this dataset as our augmented dataset. The synthetically generated proofs account for approximately 1% of our training data which empirically appeared as big enough to achieve decent performance on the tasks we cared about and small enough not to hurt performance on the valid set, especially for small models. We attempted scaling the portion of synthetic proofs to 5% of the dataset and found out that it hurt performance for the model sizes we studied. It is nonetheless possible that including more synthetic data may turn out to be beneficial for larger models than the ones studied in this paper.

4.7       Learned Value Function

To achieve better performance, we also iteratively train a value function to guide the proof search, in place of the cumulative logprob priority described above.

We implement the value function by means of an outcome objective as follows. Any time we attempt to prove a statement, we will generate a significant number of intermediate goals. Some of these

Table 4: Number of proofs and proofsteps adjunct to constitute our augmented dataset.

GeneratorNumber of ProofsNumber of Proofsteps
9-digit Addition (in Z)1004541
9-digit Division10010047
9-digit Modulo504438
9-digit Exponentiation50910
Ring Equalities (depth = 6, nbvar = 2)501373
Ring Equalities (depth = 6, nbvar = 3)501499

goals will lead to the proof, other goals will be proved without being part of the final proof, while others will not be resolved at all. To obtain a value function, we simply train our model to predict whether a goal produced during proof search ended up being resolved by generating a new dataset of the following form:


Where a goal ends with a “P” if was resolved, and “N” otherwise.

The binary nature of the OUTCOME allows the definition of a provability function fP as the conditional probability of token P given a GOAL without having to introduce a separate value head. Given a goal g, for a model parametrized by θ:

We then define our value function V on goals with:

Not having to introduce a separate value head greatly simplifies the overall architecture. Training only involves augmenting the dataset with outcome objectives (as additional masked sentences) and sampling the “provability” function simply consists in reading the probability distribution for the token following the OUTCOME keyword (which can be done in one forward pass).

4.7.1       Iterative Data Generation and Training

Having access to a formal verifier enables us to generate the training data for fP in a fully synthetic manner by first training a model on the proofstep objective, then sampling proofs (using cumulative logprob priority) for statements from the training set, and finally, annotating goals visited by the proof searches as positives if they were proved and as negatives otherwise.

These annotations are used to train fP and the entire process can be run iteratively, similarly to Expert Iteration [46], sampling proofs using the newly trained V (instead of cumulative logprob) to guide proof search for subsequent iterations.

At each iteration we entirely re-train the model on both objectives at the same time on the dataset constructed as follows:

  • We extract the full proofs that were found for statements from the training set at each previous iteration and merge them with the original training set. We deduplicate proof steps at the proof level. This dataset becomes our new train set for the proofstep objective.
  • We extract the annotated goals visited by the proof searches for statements from the train set as well as the goals from the original train set (annotated positively) and deduplicate the goals giving priority to positive outcomes annotations. This dataset becomes our new train set for the outcome objective.

This iterative training allows controlling for overfitting on both objectives by processing in the same way the data generated by proof searches on statements from the valid set and using the resulting datasets to track their associated valid loss.

Training a value function gives an opportunity to the model to learn from its errors on data it generates. It also shifts proof searches from breadth first exploration to one that is more focused, adaptively based on the level of confidence modeled by V .

5        Experiments

We fine-tune all of our models with 1024 examples per global batch and a context size of 2048 tokens, for at most 32B tokens (our augmented dataset contains ∼ 1B tokens), early stopping at min valid loss when applicable. We anneal the learning-rate to zero (over 32B tokens). We found that restarting the training with an annealing to zero that matches the early-stopping for a given model only provides a marginal improvement, and avoided doing so.

The models are trained with the BPE encoding reported in [21], the same tokenization being used for text, code or formalized statements. We leave as future work a thorough ablation of the encoding as preliminary experimental results demonstrate possible gains with specialized tokenization techniques.

5.1       Baselines

We report three baselines: (i) the state of the art for Metamath’s as reported in MetaGen- IL[25] (their methodology for benchmarking their solution is close to ours so the numbers are directly comparable); (ii) a 160m parameters trained from scratch on our raw dataset using the proofstep objective; and (iii) a 160m parameters trained from scratch on our augmented dataset (same objective).

Table 5: Baseline performance from MetaGen-IL as well as a 160m parameters model trained on the

raw and augmented datasets.

ModelPerformance# Tokens
160m raw dataset (ours)29.22%18B
160m augmented dataset (ours)28.96%18B

We explain the improvement over MetaGen-IL (despite not relying on forward proving data genera- tion techniques) by our use of a simpler architecture (one unique Transformer vs 3 separate GRU networks); a more straightforward objective (direct auto-regressive generation of the full tactic as text vs separate premise selection and generation of the substitutions); more learnable parameters (160m vs 300k (3 2-layers bi-directional GRUs with 128 hiddens)); and more compute at training as well as test time.

Note that the dataset augmentation may have a marginal negative effect on performance on the valid set with our 160m model (but we’re within typical variance). We report in section 5.5 a more reliably positive effect with a pre-trained 700m model.

5.2       Model Size

Table 6: Performance for various model sizes trained on the augmented datasets.

ModelPerformancePerplexity# Tokens
160m augmented28.96%1.04118B
400m augmented30.23%1.04218B
700m augmented31.58%1.04018B

These results demonstrate that model size positively impacts performance in our formal setup, despite the training dataset being limited in size (we train for ∼ 18 epochs). Note that the bigger the model the more compute we use at training time as well as benchmarking.

5.3       Pre-training

Models are pre-trained on CommonCrawl using GPT-3’s[21] methodology for 260B tokens. When studying the effect of pre-training on WebMath we start from a CommonCrawl pre-trained model and

continue pre-training on WebMath for 16B additional tokens. We also report results after pre-training on GitHub only instead of WebMath for the same number of tokens.

Table 7: Performance for various model sizes and pre-training datasets.

ModelPerformancePerplexity# Tokens
160m from scratch28.96%1.04118B
160m CommonCrawl32.34%1.03016B
160m Github33.61%1.03016B
160m WebMath34.79%1.02916B
700m from scratch31.58%1.04018B
700m CommonCrawl39.61%1.02615B
700m Github41.55%1.02515B
700m WebMath42.56%1.02415B

We hypothesize that the positive pre-training effect is primarily driven by the emergence and transfer of features that are relevant to formal reasoning. It is possible to argue that most of these features are probably shallow and mostly relevant at the syntactical level but the lower performance achieved with Github only in comparison to WebMath suggests that some features may be more elaborate. We leave as future work a broader investigation of this question, which could be achieved by studying the performance of linear probes on the features of the different pre-trained models with respect to a formal objective, such as the truthiness of a set of statements provided in the Metamath (or any other formal) language.

Table 8: Performance for model sizes ranging from 160m to 1.5b parameters, pre-trained on WebMath.

ModelPerformancePerplexity# Tokens
160m (WebMath)34.79%1.02916B
400m (WebMath)39.94%1.02615B
700m (WebMath)42.56%1.02415B
1p5b (WebMath)42.39%1.02413B

It is unclear why we do not observe a smooth improvement in performance between the 700m and the 1.5b models in table 8. The lack of guarantee that the valid set has a smooth difficulty pattern may play a role here. Another effect may originate from the limited size of the training set, leading the training dynamics to saturate as we grow the number of parameters. We leave as future work a closer study of this effect which could be accomplished by training on various fractions of the training dataset and checking for similar saturation plateaux.

5.4       Learned Value Function

We report the performance of our models as we iteratively train on data generated by sampling proofs against the verifier.

Table 9: Performance of the 160m and 700m parameters models as we iterate through the learned value function data generation and re-training process. policy only consists in adding new positive proofs found to the policy training (without training a value function) while policy+value consists in the full iterative data-generation and training described in section 4.7.

ModelIteration 0Iteration 1Iteration 2
160m (WebMath) policy only34.79%38.17%38.34%
160m (WebMath) policy+value 39.27%40.70%
700m (WebMath) policy only 700m (WebMath) policy+value42.56%42.23% 44.59%43.15% 47.21%

While overfitting on the train set does not generally appear to negatively impact performance on the valid set (and can even often help noticeably if not too catastrophic), we discovered that it dramatically hurts our iterative training process. We hypothesize that overfitting collapses the data generation in a mode where exploration is weakened, the model being overly optimistic about its predictions on

the train set. We therefore carefully avoid overfitting by tracking the loss on the associated valid set, early stopping as we reach a minimum.

There is probably additional performance to be extracted by running more iterations given how continuous this iterative process appears to be. We leave as future work the design of an iterative data generation process that is less compute intensive. Indeed, we believe that a lot of computation is spent on subgoals that are not necessarily providing a lot of signal for the value function, and each iteration is quite compute intensive as it requires sampling proofs for the entire training set (which takes ∼ 20k GPU.hours on V100s in our current setup).

5.5       Sample Complexity

Ablation of our synthetic dataset augmentation demonstrates that synthetically generated proofs generalize to some extent and provide a noticeable uplift in performance on the valid set for larger models.

Table 10: Ablation of the augmented dataset for 160m and 700m parameters models.

ModelPerformancePerplexity# Tokens
160m (WebMath) raw dataset34.12%1.02916B
160m (WebMath) augmented dataset34.79%1.02916B
700m (WebMath) raw dataset40.28%1.02415B
700m (WebMath) augmented dataset42.56%1.02415B

Our main motivation for including synthetic proofs in our training, beyond the relative uplift achieved, is the study of the effect of model size and training a value function on the sample complexity of our models, as we control exactly how many examples from the synthetic domain we use for training. Table 11 reports the performance on 100 synthetically generated statements (different from the train set) as well as the number of synthetic proofs present in the training set for each model (in parenthesis).

Table 11: Performance of our models on 100 test statements from our synthetic generators (run with the same parameters used to augment the training set (see table 4).

Model9-digit addition9-digit divisionRing equalities
160m raw13% (0)4% (0)6% (0)
160m augmented78% (100)27% (100)77% (100)
160m policy+value (iteration 1)87% (100)24% (100)71% (100)
160m policy+value (iteration 2)90% (100)28% (100)79% (100)
700m raw12% (0)5% (0)7% (0)
700m augmented76% (100)32% (100)82% (100)
700m policy+value (iteration 1)90% (100)40% (100)78% (100)
700m policy+value (iteration 2)92% (100)47% (100)88% (100)

This demonstrates the close (yet not perfectly correlated) relationship between sample complexity and performance in our formal reasoning setup, suggesting that sample complexity is an important driver of improved performance with formal mathematics.

More importantly it demonstrates that our models are capable of acquiring new non-trivial capabilities with a number of training examples that is compatible with manual formalization. We plan in the future to study similar learning dynamics for more challenging tasks for which we don’t have a synthetic generator.

5.6       Results

We attempted to push the performance of our models by increasing both the number of expansions per proof search from d = 128 to d = 256, and the number of attempts per proofs from a = 4 to a = 32. We report the achieved performance as a function of the number of attempts per statements on the valid set in Table 12.

Finally, we performed a final evaluation with d = 256 and a = 32 of our 700m model policy+value (iteration 2) on the held-out test set:

Table 12: Performance of our 700m model policy+value (iteration 2) as we double the number of attempts a per proposition (with d = 256).

a = 242.90% 
a = 447.29%+4.39%
a = 851.26%+3.97%
a = 1654.05%+2.99%
a = 3256.50%+2.45%

6        Output

We describe in this section two projects we executed, aimed at sharing with the Metamath community results and tools based on our work.

6.1       Proof Shortening

We contributed 23 shortened proofs1112 of theorems to the Metamath library. These proofs were generated by the GPT-f automated prover. To discover shorter proofs, we sampled proofs for statements from the library, comparing the length of the solutions found by our models to their ground truth versions, also verifying that the shorter proofs didn’t rely on additional axioms.

The reception13 from the Metamath community was positive, proof length being a metric the commu- nity care about:

“I had a look at the proofs—very impressive results! Especially because we had a global minimization recently, and your method found much shorter proofs nevertheless.”

“Any ML-based system is impressive if it can find many shorter proofs than the ones we already have. Nice work.”

“The shorter proof is easier to translate. It’s more symmetric in that it treats A and B identically. It’s philosophically more concise in that it doesn’t rely on the existence of a universal class of all sets.”

To our knowledge, these shortened proofs are the first effective contribution of a deep learning system to a formal mathematics library14

6.2       GPT-f Proof Assistant

We created an online proof assistant15 to allow interactive proof constructions with the assistance of our models.

We used it to formalize more than 200 theorems and exercises. We found our models to be particularly useful to automatically generate a variety of technical low level proofsteps required in most Metamath proofs, search the library by adapting existing theorems to the format needed by the user (e.g.,


Description automatically generated

Figure 2: Screenshot of the GPT-f Proof Assistant

deduction form16) and suggest theorems to use. Even when mistaken, our models generally go for the right theorems, whose erroneous substitutions are often easy to fix by humans.

We shared the proof assistant with the Metamath community with the objective for it to be mutually beneficial, helping the community to be more productive and reciprocally helping us improve our models’ accuracy by automatically gathering human feedback. We also plan to extend GPT-f to other formal systems.

7        Conclusion

In this paper, we present the GPT-f automated prover and proof assistant and show that the Trans- former is suitable to formal reasoning, achieving a new state of the art result on the Metamath library. In particular we demonstrate the importance of pre-training as well as iterative training of a value function. Our results suggest that tightly coupling a deep learning system with a formal system opens up interesting opportunities for further research, with the goal of better leveraging the generative power of the former and the verification capabilities of the latter.


Szymon Sidor, Jakub Pachocki, Harri Edwards, Yura Burda and Vedant Misra inspired many of the ideas presented in this work, offering their guidance throughout the process of building GPT-f. Auguste Poiroux implemented the synthetic dataset generators presented in this paper, and formalized a large number of theorems using the proof assistant, providing invaluable feedback in the process. Szymon Sidor, Pranav Shyam, John Schulman, Jared Kaplan, Ryan Lowe and Jack Clark slogged through drafts of this paper, identifying errors and sources of confusion as well as providing helpful suggestions. Finally, the authors would like to thank the whole Metamath community for their support, feedback, and encouragement, in particular, David A. Wheeler for his motivating enthusiasm and Mario Carneiro for his precious help on a wide variety of technical questions.


  • [1]   Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [2]    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [3]    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • [4]    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • [5]    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  • [6]    Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning, pages 1764–1772, 2014.
  • [7]    Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182, 2016.
  • [8]    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [9]    Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [10]    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • [11]    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  • [12]    Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, 2020.
  • [13]    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  • [14]    Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław De˛biak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  • [15]    Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • [16]    Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
  • [17]    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  • [18]    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [19]    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018.
  • [20]    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
  • [21]    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • [22]    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [23]    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
  • [24]    John Harrison, Josef Urban, and Freek Wiedijk. History of interactive theorem proving. In

Computational Logic, volume 9, pages 135–214, 2014.

  • [25]    Mingzhe Wang and Jia Deng. Learning to prove theorems by learning to generate theorems.

arXiv preprint arXiv:2002.07019, 2020.

  • [26]    Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Eén, François Chollet, and Josef Urban. Deepmath-deep sequence models for premise selection. In Advances in Neural Information Processing Systems, pages 2235–2243, 2016.
  • [27]    Mingzhe Wang, Yihe Tang, Jian Wang, and Jia Deng. Premise selection for theorem proving by deep graph embedding. In Advances in Neural Information Processing Systems, pages 2786–2796, 2017.
  • [28]    Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L Dill. Learning a sat solver from single-bit supervision. arXiv preprint arXiv:1802.03685, 2018.
  • [29]    Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proof search. arXiv preprint arXiv:1701.06972, 2017.
  • [30]    Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, and Stewart Wilcox. Holist: An environment for machine learning of higher order logic theorem proving. In International Conference on Machine Learning, pages 454–463, 2019.
  • [31]    Kshitij Bansal, Sarah M Loos, Markus N Rabe, and Christian Szegedy. Learning to reason in large theories without imitation. arXiv preprint arXiv:1905.10501, 2019.
  • [32]    Markus N Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical reasoning via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757, 2020.
  • [33]    Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. Gamepad: A learning environment for theorem proving. arXiv preprint arXiv:1806.00608, 2018.
  • [34]    Kaiyu Yang and Jia Deng. Learning to prove theorems via interacting with proof assistants.

arXiv preprint arXiv:1905.09381, 2019.

  • [35]    Daniel Whalen. Holophrasm: a neural automated theorem prover for higher-order logic. arXiv preprint arXiv:1608.02644, 2016.
  • [36]   Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
  • [37]    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by ratio- nale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
  • [38]    Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
  • [39]    Guillaume Lample and François Charton. Deep learning for symbolic mathematics. arXiv preprint arXiv:1912.01412, 2019.
  • [40]    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers. arXiv preprint arXiv:1807.03819, 2018.
  • [41]    David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019.
  • [42]    Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Pure Mathe- matics, 2019.
  • [43]    Norman          Megill.                              How          Metamath          Proofs          Work,            2006.

  • [44]    Freek Wiedijk. The “de Bruijn factor”, 2014. freek/factor/.
  • [45]    Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Grosse. Int: An inequality benchmark for evaluating generalization in theorem proving. arXiv preprint arXiv:2007.02924, 2020.
  • [46]    Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems, pages 5360–5370, 2017.

A         Key Results

Table 13: Key results described in this paper (on the valid set) with a summary of the source of performance gains.

ModelPerformanceGainMain ablation
MetaGen-IL [25]21.16% Baseline and state of the art.
160m (ours)28.96%+7.8%Use of Transformers.
700m (ours)31.58%+2.5%Increase in parameters count.
700m WebMath (ours)42.56%+10.9%Pre-training.
700m policy+value (ours)47.21%+4.6%Iterated learned value function.
700m policy+value a = 32 (ours)56.50%+9.2%Increased test-time compute.

B         Example Proofs Generated

In this appendix, we display a selection of proofs generated by GPT-f (from our valid set). The right column contains the current goal. The left column displays the name of the theorem applied by to the goal. Proofs are read bottom-up and the statement being demonstrated is the last goal of the table. The subgoals generated by a proof step can be retrieved by looking at the theorem names that are indented by one additional space. The statement of the theorems can be retrieved with the Metamath Proof Explorer. Substitutions are omitted for clarity, but can be inferred by looking at the statement of the theorem being applied and comparing it with the current goal and associated subgoals.

B.1       Proof of nn0onn0ex

Such generation of exogenous terms, here to demonstrate an existence proof, is exactly what motivated
our work. It’s therefore encouraging to witness it effectively happening in practice.

B.2       Proof of uznn0sub

This proof demonstrates that n m ∈ Z =⇒ (n m) ∈ N. It exhibits another form of term generation. Here, sylibr19 states that assuming P =⇒ Q, R Q then P =⇒ R. Again, P is mechanically unified to n m ∈ Z, and R with (n m) ∈ N. The model is left to generate freely a substitution for Q: (n m) ∈ Z ∧ 0 ≤ (n m). The equivalence R Q to demonstrate becomes (n m) ∈ N ⇔ (n m) ∈ Z ∧ 0 ≤ (n m) which is exactly the statement of a theorem available in the Metamath library, elnn0z20. The statement of elnn0z is memoized by the model, and the generation of the substitution term for Q is driven by this memoization.

B.3       Proof of pm4.78

This proof displays the model capabilities to demonstrate non-trivial propositional logic statements, a task of interest because of its relationship to SAT solving.

Foot Note

1Metamath Proof Explorer – imo72b2

2Metamath Proof Explorer – 3p2e5 3Metamath Proof Explorer – df-5

4 5



8    9Metamath Proof Explorer – decadd

10Metamath Proof Explorer – decaddc

11 12


14To determine whether other deep learning-based provers have made contributions to their respective libraries, we looked for such contributions in the following systems: Holist family in HOL Light, CoqGym+ASTatic in Coq, TacticToe in HOL4. In addition, we interviewed 6 experts in formal mathematics and/or deep learning applied to formal mathematics.


16Deduction Form and Natural Deduction

17Metamath Proof Explorer – syl2anc 18Metamath Proof Explorer – rspcev

19Metamath Proof Explorer – sylibr

20Metamath Proof Explorer – elnn0z