Premise-Augmented Reasoning Chains Improve Error Identification in Math Reasoning with LLMs

Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large language models (LLMs) by enabling detailed step-by-step solutions. However, due to the verbosity of LLMs, the resulting reasoning chains can be long, making it harder to verify the reasoning steps and trace issues resulting from dependencies between the steps that may be farther away in the sequence of steps. Importantly, mathematical reasoning allows each step to be derived from a small set of premises, which are a subset of the preceding steps in the reasoning chain. In this paper, we present a framework that identifies the premises for each step, to improve the evaluation of reasoning. We restructure conventional linear reasoning chains into Premise Augmented Reasoning Chains (PARC) by introducing premise links, resulting in a directed acyclic graph where the nodes are the steps and the edges are the premise links. Through experiments with a PARC-based dataset that we built, namely PERL (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6% to 16% absolute when step-by-step verification is carried out in PARC under the premises. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations.

Our framework, Premise-Augmented Reasoning Chains (PARC), is structured into three primary stages: Premise Extraction, Error Identification, and Accumulation Error Detection. Initially, PARC transforms conventional Linear Reasoning Chains (LRCs) by explicitly identifying premises necessary for each reasoning step. This transformation converts LRCs into a directed acyclic graph structure, improving the traceability and evaluation of the reasoning process.

Premise Extraction

The Premise Extraction component converts a Linear Reasoning Chain (LRC), comprising sequentially generated reasoning steps, into a Premise-Augmented Reasoning Chain (PARC). Each step in PARC explicitly links back to its necessary premises. We explore two distinct approaches for premise identification: Aggregative Premise Mapping and Dyadic Premise Mapping.

In the Aggregative approach, we query large language models (LLMs) with the entire reasoning context up to the current step to collectively identify all necessary premises simultaneously. Conversely, the Dyadic approach performs a pairwise assessment, where an LLM evaluates the validity of each earlier step as a premise individually. Steps identified as necessary are aggregated into a premise set for the current step. This systematic identification of premises enhances context relevance and reduces distractors, facilitating improved verification accuracy.

Error Identification

After premises are identified, we systematically evaluate each step for errors. The error identification process involves three primary error categories: Mathematical Errors, Logical Inconsistencies, and Accumulation Errors. Mathematical Errors refer to incorrect calculations or misapplication of formulas, while Logical Inconsistencies occur when a step does not logically follow from its premises. Our method prompts an LLM to evaluate mathematical correctness and logical consistency independently, constrained to the minimal context provided by identified premises.

Accumulation Error Detection

Accumulation Errors are identified through a dependency graph traversal. Steps that are locally correct but depend on erroneous prior steps (premises) are flagged as accumulation errors. We employ Depth-First Search (DFS) on the PARC structure to systematically identify these dependencies, clearly distinguishing inherent step-level inaccuracies from errors propagated due to faulty premises.

Algorithm 1: Constructing and Evaluating PARC

Dataset and Evaluation Setup

We developed the Premises and ERrors identification in Language models (PERL) dataset to assess the capabilities of our approach. PERL incorporates reasoning chains from established mathematical reasoning datasets (GSM8K, MATH, Orca-Math, and MetaMathQA), annotated with both premises and error types using OpenAI's GPT-4o. This dataset includes naturally occurring errors as well as systematically introduced synthetic errors, ensuring comprehensive coverage of potential error scenarios.

We evaluate PARC's effectiveness using precision, recall, F1-score for premise extraction, and accuracy metrics for error identification. Our experimental setup utilizes both open-source (e.g., Llama, Qwen) and proprietary (e.g., GPT-4o) large language models, comparing our premise-centric verification approach against traditional full-context baseline methods.

Acknowledgments

This research project has benefited from the Microsoft Accelerate Foundation Models Research (AFMR) grant program through which leading foundation models hosted by Microsoft Azure along with access to Azure credits were provided to conduct the research.

BibTeX

@misc{mukherjee2025premiseaugmentedreasoningchainsimprove,
      title={Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs}, 
      author={Sagnik Mukherjee and Abhinav Chinta and Takyoung Kim and Tarun Anoop Sharma and Dilek Hakkani-Tür},
      year={2025},
      eprint={2502.02362},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.02362}, 
}

Premise-Augmented Reasoning Chains Improve Error Identification in Math Reasoning with LLMs

Abstract

Method

Premise Extraction

Error Identification

Accumulation Error Detection

Dataset and Evaluation Setup

Results

Premise Identification

Error Identification

Detailed Error Type Analysis

Summary

Acknowledgments

BibTeX