Premise-Augmented Reasoning Chains Improve Error Identification in Math Reasoning with LLMs

1University of Illinois at Urbana Champaign *Equal contribution
ConvAI Lab

Abstract

Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large language models (LLMs) by enabling detailed step-by-step solutions. However, due to the verbosity of LLMs, the resulting reasoning chains can be long, making it harder to verify the reasoning steps and trace issues resulting from dependencies between the steps that may be farther away in the sequence of steps. Importantly, mathematical reasoning allows each step to be derived from a small set of premises, which are a subset of the preceding steps in the reasoning chain. In this paper, we present a framework that identifies the premises for each step, to improve the evaluation of reasoning. We restructure conventional linear reasoning chains into Premise Augmented Reasoning Chains (PARC) by introducing premise links, resulting in a directed acyclic graph where the nodes are the steps and the edges are the premise links. Through experiments with a PARC-based dataset that we built, namely PERL (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6% to 16% absolute when step-by-step verification is carried out in PARC under the premises. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations.

PARC Main Image

Method

Our framework, Premise-Augmented Reasoning Chains (PARC), is structured into three primary stages: Premise Extraction, Error Identification, and Accumulation Error Detection. Initially, PARC transforms conventional Linear Reasoning Chains (LRCs) by explicitly identifying premises necessary for each reasoning step. This transformation converts LRCs into a directed acyclic graph structure, improving the traceability and evaluation of the reasoning process.

Baseline vs PARC Example

Premise Extraction

The Premise Extraction component converts a Linear Reasoning Chain (LRC), comprising sequentially generated reasoning steps, into a Premise-Augmented Reasoning Chain (PARC). Each step in PARC explicitly links back to its necessary premises. We explore two distinct approaches for premise identification: Aggregative Premise Mapping and Dyadic Premise Mapping.

In the Aggregative approach, we query large language models (LLMs) with the entire reasoning context up to the current step to collectively identify all necessary premises simultaneously. Conversely, the Dyadic approach performs a pairwise assessment, where an LLM evaluates the validity of each earlier step as a premise individually. Steps identified as necessary are aggregated into a premise set for the current step. This systematic identification of premises enhances context relevance and reduces distractors, facilitating improved verification accuracy.

Error Identification

After premises are identified, we systematically evaluate each step for errors. The error identification process involves three primary error categories: Mathematical Errors, Logical Inconsistencies, and Accumulation Errors. Mathematical Errors refer to incorrect calculations or misapplication of formulas, while Logical Inconsistencies occur when a step does not logically follow from its premises. Our method prompts an LLM to evaluate mathematical correctness and logical consistency independently, constrained to the minimal context provided by identified premises.

Accumulation Error Detection

Accumulation Errors are identified through a dependency graph traversal. Steps that are locally correct but depend on erroneous prior steps (premises) are flagged as accumulation errors. We employ Depth-First Search (DFS) on the PARC structure to systematically identify these dependencies, clearly distinguishing inherent step-level inaccuracies from errors propagated due to faulty premises.

Algorithm 1 Constructing and Evaluating PARC

Algorithm 1: Constructing and Evaluating PARC

Dataset and Evaluation Setup

We developed the Premises and ERrors identification in Language models (PERL) dataset to assess the capabilities of our approach. PERL incorporates reasoning chains from established mathematical reasoning datasets (GSM8K, MATH, Orca-Math, and MetaMathQA), annotated with both premises and error types using OpenAI's GPT-4o. This dataset includes naturally occurring errors as well as systematically introduced synthetic errors, ensuring comprehensive coverage of potential error scenarios.

We evaluate PARC's effectiveness using precision, recall, F1-score for premise extraction, and accuracy metrics for error identification. Our experimental setup utilizes both open-source (e.g., Llama, Qwen) and proprietary (e.g., GPT-4o) large language models, comparing our premise-centric verification approach against traditional full-context baseline methods.

Results

Premise Identification

Datasets: The evaluation was conducted across four mathematical reasoning datasets: GSM8K, MATH, Orca Math, and MetaMathQA. These datasets contain diverse mathematical problems ranging from elementary to competition-level questions.

Baselines: Premise identification was assessed under two methods—Aggregative Premise Mapping (identifying premises collectively) and Dyadic Premise Mapping (pairwise identification). Models evaluated include Llama 3.1 (8B and 70B parameters), Qwen 2.5 (7B and 32B parameters), and GPT-4 variants (GPT4o-mini and GPT-4o).

Table 1: Results for Premise Identification under Aggregative Premise Mapping

Results for Premise Identification under Aggregative Premise Mapping show strong performance from larger models, especially Llama 3.1 70B, Qwen 2.5 32B, and GPT-4o, all achieving over 90% recall consistently across datasets. Notably, Aggregative mapping significantly outperformed Dyadic mapping, demonstrating higher precision and recall, suggesting that collective identification of premises is more effective and efficient for LLMs.

Error Identification

Datasets: Error identification was tested using the PERL dataset (Premises and ERrors identification in LLMs), comprising correct solutions (positives), incorrect solutions (negatives), and synthetically generated negative examples to simulate realistic errors.

Baselines: Error detection was compared across two primary contexts: Full Context (standard LLM inference over the entire reasoning chain) and Model Premises (reasoning verification using identified premises). Models included are Llama 3.1 (8B and 70B), Qwen 2.5 (7B, 32B, and 72B), GPT4o-mini, and GPT-4o.

Table 2: Comparative results between Aggregative and Dyadic Premise Mapping

Comparison between Aggregative and Dyadic Premise Mapping methods, showing the superior performance of the Aggregative approach.

Table 3: Results of Error Identification across different models

Results illustrate that the use of Model Premises substantially improves accuracy in error identification across all models, with larger models such as GPT-4o and Llama 3.1 70B benefiting the most. In particular, GPT-4o improved error identification accuracy from 68.52% in the Full Context baseline to 79.82% under Model Premises on the GSM8K dataset.

Table 4: Comparison of oracle versus model-generated premises

Comparison of oracle versus model-generated premises demonstrates that error identification accuracy remains robust when LLM-generated premises are used, reflecting that current models achieve high-quality premise identification.

Detailed Error Type Analysis

Table 5: Analysis of error types and detection accuracy

Further analysis highlights challenges in detecting accumulation errors compared to native errors (mathematical and logical). Under Model Premises, identification accuracy significantly increased for both types, though accumulation errors remained the hardest to detect, improving from 12% (Full Context baseline) to approximately 57.54% (Model Premises).

Summary

Our evaluations indicate that converting Linear Reasoning Chains to PARCs and verifying steps under identified premises significantly enhances the accuracy and reliability of error detection in LLMs. The results underscore the importance of premise-aware verification, particularly for detecting subtle accumulation errors that propagate through reasoning chains.

Acknowledgments

This research project has benefited from the Microsoft Accelerate Foundation Models Research (AFMR) grant program through which leading foundation models hosted by Microsoft Azure along with access to Azure credits were provided to conduct the research.

BibTeX

@misc{mukherjee2025premiseaugmentedreasoningchainsimprove,
      title={Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs}, 
      author={Sagnik Mukherjee and Abhinav Chinta and Takyoung Kim and Tarun Anoop Sharma and Dilek Hakkani-Tür},
      year={2025},
      eprint={2502.02362},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.02362}, 
}