AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits

1Shanghai Jiao Tong University 2University of California, Los Angeles 3Tsinghua University 4Eastern Institute of Technology, Ningbo
*Indicates Equal Contribution

**Indicates Corresponding authors
AMSbench Teaser

AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits

Abstract

Analog/Mixed-Signal (AMS) circuits play a critical role in the integrated circuit (IC) industry. However, automating AMS circuit design has remained a longstanding challenge due to its difficulty and complexity. Recent advances in Multi-modal Large Language Models (MLLMs) offer promising potential for supporting AMS circuit analysis and design. However, current research typically evaluates MLLMs on isolated tasks within the domain, lacking a comprehensive benchmark that systematically assesses model capabilities across diverse AMS-related challenges.

To address this gap, we introduce AMSbench, a benchmark suite designed to evaluate MLLM performance across critical tasks including circuit schematic perception, circuit analysis, and circuit design. AMSbench comprises approximately 8000 test questions spanning multiple difficulty levels and assesses eight prominent models, encompassing both open-source and proprietary solutions such as Qwen 2.5-VL and Gemini 2.5 Pro.

Our evaluation highlights significant limitations in current MLLMs, particularly in complex multi-modal reasoning and sophisticated circuit design tasks. These results underscore the necessity of advancing MLLMs’ understanding and effective application of circuit-specific knowledge, thereby narrowing the existing performance gap relative to human expertise and moving toward fully automated AMS circuit design workflows. Our data is released at this URL.

AMSBench Benchmark

Benchmark Introduction

The design of Analog/Mixed-Signal (AMS) circuits is highly dependent on human expertise, and its automation has been a long-standing challenge. Although Multimodal Large Language Models (MLLMs) have achieved breakthroughs in many fields, their application in the AMS domain remains limited, and a comprehensive evaluation framework to systematically measure their capabilities is lacking. To fill this gap, we propose AMSbench, the first comprehensive benchmark designed to rigorously evaluate the three core capabilities of MLLMs in the AMS domain: Perception, Analysis, and Design.

Data Collection & Curation

To build a comprehensive benchmark, we collected data from various sources, including academic textbooks, research papers, and industrial datasheets. We used tools like MinerU to convert PDFs and AMSnet to generate netlists from schematics. We then combined expert annotations with MLLM outputs to create high-quality "circuit-caption" data pairs.

Data Collection Pipeline

Question Generation & Task Design

AMSbench covers both Visual and Textual Question Answering (VQA/TQA) with multiple formats. Questions are tiered into three difficulty levels (Easy, Medium, Hard) to simulate knowledge requirements from undergraduate to professional engineer levels, ensuring a thorough evaluation of model capabilities.

Question Generation Examples

Evaluation & Findings

We evaluated 8 leading models, including GPT-4o and Gemini-2.5-pro. Our findings reveal significant limitations in current SOTA models, especially in complex reasoning and design.

  • Perception: Models struggle to extract complete, accurate netlists.
  • Analysis: They show potential but fail to grasp key performance trade-offs.
  • Design: Performance is poor on complex circuits, and models cannot generate valid testbenches.
Model Performance Radar Chart
Perception Task Results
Interconnect and Analysis Results

For Perception Tasks, while models show promise in recognizing local connectivity, their effectiveness deteriorates when performing comprehensive netlist extraction. For instance, Gemini-2.5-pro achieves the best overall results in component classification (94% accuracy), but all models are challenged by the diversity of component types. Even the best-generated netlists require substantial modifications to match the ground truth.

Design Task Results

For Analysis and Design Tasks, some models can interpret circuit functionalities but often arrive at correct answers through flawed reasoning. A critical weakness is their poor understanding of performance trade-offs, a key skill for engineers, where the top-performing model, GPT-4o, only scored 58%. In design, models like Grok-3 and Claude-Sonnet perform best on simple circuits but fail on complex systems like SAR-ADCs. Crucially, no model could consistently generate syntactically correct testbenches, likely due to a lack of relevant training data.

Data Statistics

The benchmark is carefully balanced across tasks and difficulty levels to provide a robust evaluation framework.

Data Statistics Pie Charts

AMSbench is composed of approximately 8,000 test questions, with 6,000 for AMS-Perception, 2,000 for AMS-Analysis, and 68 for AMS-Design. The Perception tasks (visualized in the left pie chart) are categorized by difficulty based on component count—simple (<9), medium (9-16), and hard (>16). They cover various sub-tasks like Total Counting, Type-wise Counting, Element Classification, and Topology Generation. The Analysis tasks (right pie chart) are divided by the academic level required: Undergraduate (532 questions), Graduate (625 questions), and Engineer-level (100 questions). These tasks cover areas like Function recognition, Partitioning, Captioning, and Reasoning to test both visual understanding and deep domain knowledge.

 

Examples of various tasks

 

Paper

BibTeX

@misc{shi2025amsbenchcomprehensivebenchmarkevaluating,
        title={AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits}, 
        author={Yichen Shi and Ze Zhang and Hongyang Wang and Zhuofu Tao and Zhongyi Li and Bingyu Chen and Yaxin Wang and Zhiping Yu and Ting-Jung Lin and Lei He},
        year={2025},
        eprint={2505.24138},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2505.24138}, 
  }