Autonomous Agents
Enterprise Cost Observability
Category - GSM8K Benchmark: Grade-school math word problems for LLMs (8.5k train, ~1.3k test). Evaluates multi-step reasoning via exact match; includes chain-of-thought prompting, self-consistency, and tool use baselines.