Continual Benchmarking of LLM-Based Systems on Networking Operations

Selected as a finalist for the ACM SIGCOMM Student Research Competition (SRC)

The inherent complexity of operating modern network infrastructures has led to growing interest in using Large Language Models (LLMs) to support network operators, particularly in the area of Incident Management (IM). Yet, the absence of standardized benchmarks for evaluating such systems poses challenges in tracking progress, comparing approaches, and uncovering their limitations. As LLM-based tools become widespread, there is a clear need for a comprehensive benchmarking suite that reflects the diversity and complexity of operational tasks encountered in real-world networks.

This poster outlines our vision for designing such a modular benchmarking suite. We describe an approach for generating operational tasks of varying complexity and discuss how to evaluate LLMs on these tasks and assess system-level performance. As a preliminary evaluation, we benchmark three LLMs — GPT-4.1, Gemini 2.5-Pro, and Claude 3.7 Sonnet — across over 100 test cases and two pipeline variants.

Research Area: Network Analysis and Reasoning

People