Insights      Artificial Intelligence      A Practical Guide to Reducing Latency and Costs in Agentic AI Applications

A Practical Guide to Reducing Latency and Costs in Agentic AI Applications

Scaling companies that are actively integrating Large Language Models (LLMs) into their agentic AI products are likely to face two significant challenges—increasing latency and costs:

  • increasing traffic and long prompts lead to slower response times (latency) from LLMs, this can negatively impact user experience and sales, and

  • application costs increase exponentially as API (Application Programming Interface) usage grows.

Why is it important for organizations to address cost and latency now?

In May 2025, Magne Vange, an AI Executive Advisor at Cisco, outlined a potential timeline for agentic AI at the Alberta Machine Intelligence Institute’s Upper Bound Conference. He suggested that as companies move toward building standalone AI agents and multi-agent systems, latency and cost implications will progressively increase. Vange argued that companies will increasingly need to address latency and cost as they mature in agentic AI implementations.

Magne Vange Agentic AI Roadmap

Image provided by Magne Vange, Cisco

According to Georgian and NewtonX research, 91% of surveyed  R&D leaders say they are already using or planning to implement agentic AI, making the issues of latency and cost reduction particularly timely .

Furthermore, according to the June 2025 “AI, Applied Global Benchmarking Report” from Georgian and NewtonX, 48% of surveyed R&D (Engineering, Product and IT) executives say that their most sophisticated AI models involve single or mutil-step API calls to off-the-shelf LLMs.

% of surveyed R&D leaders use API calls

48% of R&D leaders say they use API calls

In this blog post, we aim to provide actionable strategies devised through the Georgian AI Lab’s research into cost and latency reduction, including:

  • model selection

  • prompt optimization

  • caching techniques, and

  • efficient API usage.

These techniques may allow organizations who are scaling their AI initiatives to achieve up to 80% latency reduction and over 50% cost savings and enhanced user satisfaction for AI-driven applications.

Which cost and latency optimization strategies are most effective?

Georgian’s AI Lab partners with portfolio companies to implement AI solutions to help them scale. To help companies reduce cost and latency in their applications, Georgian’s AI Lab evaluated 17 techniques for cost and latency optimization and ranked them in order of highest to lowest efficacy. The following strategies had the highest likelihood of reducing cost and latency.

1. Model Selection & Complexity Reduction

Instead of using large, expensive AI models for every use case, this approach focuses on using more tailored models and smarter prompts to save time, cost and computing power.

Potential approaches include:

  • Using smaller, optimized models for simpler tasks

  • Leveraging prompt engineering (e.g., few-shot prompting, DSPy) to improve performance without extra API calls

  • Fine-tuning or distilling models for task-specific improvements (assuming we still meet our accuracy threshold)

2. Input Length Optimization

Input length optimization involves trimming down what is sent to the AI by removing unnecessary context and reordering parts smartly, which makes responses faster and cheaper while maintaining a desired accuracy threshold.

Possible approaches include:

  • Leveraging Key and Value (KV) Caching (see details below) by putting dynamic portions (e.g. RAG results) at the end of the prompt to maximize prompt cache hits

  • Monitoring caching metrics to ensure caching is optimized as planned

  • Using semantic caching to return cached responses to similar user queries

  • Reducing prompt size by stripping unnecessary context, HTML, and metadata*

  • Pruning irrelevant RAG (retrieval augmented generation) results to minimize API token usage

*Note: According to OpenAI, shortening a prompt may only reduce latency by up to 5%. Shortening prompts may also cause KV caching to be disabled.

3. Key and Value (KV) Caching

The strategy of KV Caching aims to save money and speed up LLM response time by storing and reusing parts of prompts that don’t change much, especially during frequent or large batch jobs. KV caching can be especially effective when the appropriate provider for the task and determining the optimal timing.

Potential approaches include:

  • Assuming similar model performance, using OpenAI for input-heavy workloads (cheaper token pricing than Anthropic).

  • Taking advantage of KV-caching from Claude for static, long prompts & continuous requests (steep cache discounts).

  • Keeping KV cache alive with frequent requests (~5 min intervals)

  • Submitting batch, off-peak workloads to the Batch API to compound API discounts (an additional 50% savings) and extend cache lifespan.

Because prompt caching is not enabled by default on Anthropic and Google Gemini, it requires explicit implementation even when using LLM frameworks. We have prepared the following prompt caching specification sheet outlining KV caching parameters and specifications for different model providers compiled from OpenAI, Anthropic and Google Gemini documentation.

KV Caching Spec Sheet

Georgian KV Caching Spec Sheet. Spec sheet references: 1. OpenAI Platform: “Prompt Caching”, 2. Anthropic Docs: “Prompt Caching”, 3. Google Gemini API Docs: “Context Caching”, 4. Google Gemini API Docs: “Gemini Developer API Pricing”, 5. OpenAI Platform: “Pricing”, 6. Anthropic: “Prompt caching with Claude

4. Output Length Optimization

Output length optimization  involves  guiding the AI to give shorter, more clear answers by using smart prompts, formatting tricks, and response controls to reduce latency and cost, especially in high-volume or interactive use cases.

Potential approaches include:

  • Keeping responses concise using prompting, structured outputs, concise field names and token limits

  • Applying stop words and tune temperature to reduce verbosity.

  • OpenAI users: Using Predicted Outputs (speculative decoding) where minor response modifications are expected (for Copilot-like use cases)

5. Efficient API Calling Patterns

The strategy of efficient API calling patterns  focuses on structuring how applications talk to AI systems to be faster and more scalable. It is about batching and speeding up requests, streaming responses for quicker feedback and sometimes skipping AI entirely when simpler solutions work.

Potential  approaches include:

  • Batching requests where possible to minimize API round-trips, lower costs, simplify application. Batching can also improve throughput if your application is request rate-limited, but not yet token rate-limited (link).

  • Parallelizing sub-requests to optimize throughput. This technique consumes  more input tokens, but the cost can be mitigated through KV caching.

  • Using async programming and speculative execution (example).

  • Streaming responses to reduce a user’s perceived latency, such as displaying text as it is being generated, and other UI components (loading indicators and skeleton screens).

  • Not every use case requires an LLM or Deep Learning. In some cases (like FAQ use cases), canned inputs/outputs may suffice.

6. Network & Infrastructure Optimization

At a high-level, network & infrastructure optimization focuses on setting up servers close to the AI providers, keeping systems ready to respond quickly and using premium plans to get faster service, especially for applications that need real-time speed.

Potential approaches include:

  • Co-locating servers near API inference endpoints (OpenAI: US-based, Anthropic: cloud-provider’s Claude model).

  • Warming up KV caches to avoid cold starts for latency-sensitive applications.

  • Leveraging organizational pricing tiers (e.g. OpenAI’s Scale Tier for priority compute and latency SLAs).

Takeaways

  • API-focused optimization: latency and cost wins are achievable through thoughtful application design, tailored API calling strategies, and provider-specific optimizations.

  • Prompt caching is a high-leverage, but underused optimization that, in Georgian AI Lab research, improved latency (up to 80%) and cost (up to 90%) without affecting output quality.

  • By combining prompt caching with batching for calls with large prompts we experienced ~95% cost reduction opportunities for latency-tolerant jobs.

  • Short prompts ≠ Faster responses. Shortening prompts yields only marginal (≤5%) latency gains and may disable prompt caching, leading to slower responses.

  • Controlling output tokens by explicitly prompting the model to keep its responses concise leads to meaningful reductions in response time.

  • A robust LLM observability tool is needed to verify whether caching, batching, and other optimizations are being correctly triggered in production.

  • Model choice matters. Smaller models are better suited to lightweight tasks since they are cheaper and respond more quickly.

  • Don’t default to using an LLM. Classical techniques are still relevant in a world of LLMs. Not all use cases require unconstrained inputs and outputs. Show a set of constrained inputs where applicable for your application, or potentially hard-code responses.

By implementing some of these strategies, Georgian’s AI Lab observed:

  • Up to 80% latency reduction (through caching & optimized API calls) and

  • ~50% of potential cost savings (through batching, caching, and workload scheduling), leading to

  • better user experience & scalability for AI-driven applications.

Read more like this

How to Build Differentiated Agentic AI Products: A Practical Guide

In her latest blog, Asna Shafiq dives into: Identifying potential use cases...

Why Georgian Invested in Ambience Healthcare

We are pleased to announce Georgian’s participation in Ambience Healthcare’s $243 million…

Data Security in the Age of Agentic AI

AI now lies at the heart of the battle between attackers and...