A Survey of Optimization Modeling Meets LLMs: Progress and Future Directions

Abstract

By virtue of its great utility in solving real-world problems, optimization modeling has been widely employed for optimal decision-making across various sectors, but it requires substantial expertise from operations research professionals. With the advent of large language models (LLMs), new opportunities have emerged to automate the procedure of mathematical modeling. This survey presents a comprehensive and timely review of recent advancements. First, we categorize the research based on the technical stack, including data synthesis and fine-tuning for the base model, more advanced inference frameworks, benchmarks, and evaluation metrics. Second, we conduct an in-depth analysis on the quality of benchmark datasets and deliver new insights on the reported results upon these datasets. Third, we build an online portal that aggregates existing datasets and algorithms. Finally, we identify limitations in current methodologies and outline future research opportunities

Categorization

categorization of OR LLMs

Categorization of the existing LLM4OR research based on the technical stack. Left: Taxonomy of LLMs for Operations Research Modeling. Right: We present representative works for each category, sorted by their publication dates

Benchmarks

Below are the OR benchmarks used for evaluating LLMs.

Xiao et al: "Chain-of-Experts: When LLMs Meet Complex Operation Research", ICLR(2024).

  • abstract modeling
  • contains 37 instances collected from both industrial and academic scenarios

AhmadiTeshnizi et al: "OptiMUS: Scalable Optimization Modeling with (MI)LP Solvers and Large Language Models", arXiv preprint arXiv:2407.19633(2024).

  • abstract modeling
  • extends the number of instances to 269

Wang et al: "OptiBench: Benchmarking Large Language Models in Optimization Modeling with Equivalence-Detection Evaluation", ICLR(2025).

  • abstract modeling
  • offers a collection of 816 instances

Huang et al: "ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling", arXiv preprint arXiv:2405.17743(2024).

  • covers a variety of problem types, including MIP and NIP
  • features descriptions with or without tabular data
  • suffers from quality con-trol issues, which result in a high error rate.

Huang et al: "Mamo: a Mathematical Modeling Benchmark with Solvers", arXiv preprint arXiv:2405.13144(2024).

  • includes optimal variable information, offering additional perspectives for evaluating model correctness
  • categorizes problems into three classes:EasyLP,ComplexLP and ODE.

Ramamonjison et al: "NL4Opt Competition: Formulating Optimization Problems Based on Their Natural Language Descriptions", NeurIPS(2022).

  • primarily focuses on simple optimization modeling problems.
  • the first optimization modeling benchmark proposed in acompetition
  • features a test set of 289 instances.

Yang et al: "OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling", ICML(2024).

  • introduces a com-prehensive framework that applies multiple filters to remove erroneous cases
  • expands the test set to 605 instances.

Parashar et al: "WIQOR: A dataset for what-if analysis of Operations Research problems", ICLR(2025).

  • employs what-if analysis to assess performance

In this work, we identified the error samples in these benchmarks, subsequently tagging them with an "error" label. We provide download links for these labeled benchmarks Here (https://github.com/LLM4OR/LLM4OR/tree/master/static/clean_benchmarks).

Analysis on Benchmark Datasets

Result of evaluating OR LLMs - Part A

(a) Statistics related to the OR modeling Benchmarks

Result of evaluating OR LLMs - Part B

(b) Statistical Analysis of Problem Complexity across all Benchmarks

Analysis of Benchmark Datasets

LeaderBoard

Result of evaluating OR LLMs

Performance Comparison of Existing Methods over all Benchmarks

Poster

BibTeX

BibTex Code Here