AI agents are a cutting-edge research direction with significant potential applications. However, a recent analysis by researchers at Princeton University has revealed critical shortcomings in current agent benchmarks and evaluation practices that hinder their real-world usefulness.
One major issue highlighted by the researchers is the lack of cost control in agent evaluations. Unlike single model calls, AI agents can be much more expensive to run as they often rely on stochastic language models that can produce different results when given the same query multiple times. Sampling hundreds or thousands of responses to increase accuracy can come at a significant computational cost.
In research settings, where the goal is to maximize accuracy, inference costs may not be a problem. However, in practical applications, there is a limit to the budget available for each query, making cost control crucial. Failing to do so may lead researchers to develop excessively costly agents simply to top the leaderboard.
The researchers propose visualizing evaluation results as a Pareto curve of accuracy and inference cost and using techniques that jointly optimize the agent for these two metrics. By optimizing for both accuracy and cost, researchers can develop agents that cost less while maintaining accuracy.
The researchers evaluated accuracy-cost tradeoffs of different prompting techniques and agentic patterns introduced in various papers. They found that for substantially similar accuracy, the cost can differ by almost two orders of magnitude. However, the cost of running these agents is not a top-line metric reported in most papers.
Joint optimization of accuracy and inference costs can lead to the development of agents that strike an optimal balance between these two metrics. It enables researchers and developers to trade off fixed and variable costs by spending more on optimizing the agent’s design but reducing variable costs.
The researchers tested joint optimization on HotpotQA, a popular question-answering benchmark, and found that it provides a way to balance accuracy and inference costs effectively. Controlling for cost in agent evaluations is crucial, even if the primary focus is on identifying innovative agent designs rather than cost.
A significant difference exists between evaluating models for research purposes and developing downstream applications. While accuracy is often the primary focus in research, inference costs play a crucial role in deciding which model and technique to use in real-world applications.
Evaluating inference costs for AI agents can be challenging due to varying charges by different model providers and fluctuating costs of API calls. The researchers created a website that adjusts model comparisons based on token pricing to address this issue.
A case study on NovelQA, a benchmark for question-answering tasks on very long texts, revealed that benchmarks intended for model evaluation could be misleading when used for downstream evaluations. For example, retrieval-augmented generation (RAG) appeared much worse than long-context models in the original study, yet the latter was 20 times more expensive.
Machine learning models often find shortcuts to score well on benchmarks, a prominent type being overfitting. This issue is more severe for agent benchmarks than for training foundation models, as knowledge of test samples can be directly programmed into the agent.
To address overfitting, benchmark developers should create and maintain holdout test sets comprising examples that can’t be memorized during training. Lack of proper holdout datasets in many agent benchmarks allows agents to take shortcuts unintentionally, affecting the reliability of evaluation results.
Efforts to ensure that shortcuts are impossible should be the responsibility of benchmark developers rather than agent developers. Benchmark developers must design benchmarks that do not allow shortcuts, making it easier to maintain the integrity of evaluation results.
Challenges in evaluating AI agents for real-world applications include cost control in agent evaluations, accuracy vs. inference costs tradeoffs, inference costs in real-world applications, and addressing shortcuts and overfitting in agent benchmarks. Establishing best practices for AI agent benchmarking is essential to distinguish genuine advances from hype in this evolving field.
Leave a Reply