Agent Efficiency

CES and Groq "Acqui-hire" Reflection: Nvidia's Plan to Build Real Time Agents?

In this blog post, we will discuss Nvidia’s recent announcements at CES and their strategic partnership with Groq, focusing on their strategy to enhance LLM agent inference. We will explore three main aspects: the importance of KV cache hits, the role of SRAM in improving decoding speed, and a proposed hardware-software architecture that potentially speeds up agent inference to real-time. Prerequisite: LLM inference basics. Suggested reading: LLM Inference; KV Cache Offloading with LMCache; LLM Agent with KV Cache ...

February 16, 2026 · Hanchen Li, and Collaborators
Agent Efficiency

Why Agents are Efficiency Nightmares and How to Fix them?

Agents are EXPENSIVE. Claude Code takes 1 USD to handle a single issue in a mid-sized public repository when using API. In the same time, it charges 20 USD per month for a subscription license. Why do we still get to use them? Maybe the price war, maybe the crazy debts some companies upstream are carrying, maybe the implicit labeling you are carrying out for the providers. But regardless of the reason behind the pricing, we want to use more these powerful agents. Current models and applications are already capable of handling many complex tasks with minimal human intervention. However, the efficiency of these agents has just rised to our attention. In this blog post, we will discuss why agents are efficiency nightmares, how we can make them (somewhat) better with KV cache management tool like LMCache, and where we could be moving forward for improving agents. ...

January 7, 2025 · Hanchen Li, Xiaokun Chen, Jingzhuo Hu, and Collaborators