Cached tokens drop when tool has interleaved thinking

tallesborges · February 3, 2026, 2:39pm

it appears caching is not working properly with interleaved thinking.

Just ask it to read the same file 5 times and think briefly between the calls. After the turns end, send a simple message and check the `cached_tokens`.

yuikns · February 3, 2026, 3:00pm

The cached_tokens metric depends on whether the current request’s context exactly matches a prefix of a previous request. When interleaving thinking with tool calls, slight variations in the reasoning content or tool parameters can break the cache alignment.

Here are two key factors to consider for optimization:

1. Tool Call Consistency
Ensure that the tool definitions (names, descriptions, schemas) and input parameters are byte-for-byte identical across requests. Even minor changes in JSON formatting, whitespace, or parameter ordering will invalidate the cache for subsequent turns.

2. Preserving reasoning_content
In the K2.5 architecture, prefix caching requires byte-for-byte alignment of the full context, including hidden reasoning_content. When interleaved thinking generates new reasoning between tool calls, this creates a divergent prefix—cached_tokens will only extend up to the start of the current message, as the new reasoning block breaks the contiguous matching sequence. To maximize cache hits across turns, you must preserve all historical reasoning_content intact in the context window (see K2 Thinking Model FAQ).

Hope it helps.

tallesborges · February 5, 2026, 5:48pm

Ok, what I don’t understand is that the caching increases with multiples tool calls happen in the same turn. (by turn I mean a user message + multiples thinkings, tool call). but it drops when I send a follow up message.

BTW it was reproduced also on the kimi-cli

Thanks.

yuikns · February 9, 2026, 10:40am

I discussed this with my colleagues, and I completely understand the gap now. You are absolutely correct about the behavior you are seeing. The drop in caching between turns isn’t a bug, but a specific consequence of how the context history is constructed versus how the underlying model processes “thinking.”

Here is the breakdown of why this happens:

Within a Single Turn (Cache Hits):
When the model is performing multiple steps (thinking → tool call → thinking), the context grows linearly. The system appends the latest thinking and tool results to the existing context. Because the prefix is identical to the previous step just a moment ago, the KV cache is successfully hit and extended.
Starting a New Turn (Cache Drop):
When you send a follow-up message (Turn N+1), the client (like the kimi-cli) re-assembles the history. To save tokens or present a cleaner context, the historical thinking blocks from previous turns are usually stripped out.
Since our architecture relies on Longest Prefix Matching, the system successfully hits the cache for all the common history up to the end of Turn N’s user message. However, because the ‘thinking’ block that originally followed is removed in the new request, the prefix diverges right after that point. Consequently, the model must re-process the stream starting from the ‘Assistant Answer’ of Turn N, which explains the drop in cached tokens.

Regarding the Infrastructure and Optimization:

On a deeper level, our backend consists of multiple clusters maintaining massive cache blocks. Without a specific hint, load balancing might route your follow-up request to a different cluster that doesn’t hold your specific KV cache, leading to “random” cache misses even if your prefix is theoretically correct.

To address this, we support a field called prompt_cache_key (a string) in our OpenAI-compatible API.

Mechanism: This acts as a scheduling hint. When multiple requests share the same prompt_cache_key, the gateway prioritizes routing them to the same underlying cluster. This significantly increases the probability of hitting an existing prefix cache pool. Once a block is successfully hit, the cost is automatically discounted.
Best Practice: Similar to how kimi-cli handles this, if you are integrating via API, you should generate a unique random string (a Session ID) when a conversation starts. Then, pass this ID in the body of every request in that session:

{
  "messages": [...],
  "prompt_cache_key": "unique_session_id_xyz"
}

This ensures “sticky” routing, maximizing your cache hit rate across the entire session.