it appears caching is not working properly with interleaved thinking.
Just ask it to read the same file 5 times and think briefly between the calls. After the turns end, send a simple message and check the `cached_tokens`.
it appears caching is not working properly with interleaved thinking.
Just ask it to read the same file 5 times and think briefly between the calls. After the turns end, send a simple message and check the `cached_tokens`.
The cached_tokens metric depends on whether the current request’s context exactly matches a prefix of a previous request. When interleaving thinking with tool calls, slight variations in the reasoning content or tool parameters can break the cache alignment.
Here are two key factors to consider for optimization:
1. Tool Call Consistency
Ensure that the tool definitions (names, descriptions, schemas) and input parameters are byte-for-byte identical across requests. Even minor changes in JSON formatting, whitespace, or parameter ordering will invalidate the cache for subsequent turns.
2. Preserving reasoning_content
In the K2.5 architecture, prefix caching requires byte-for-byte alignment of the full context, including hidden reasoning_content. When interleaved thinking generates new reasoning between tool calls, this creates a divergent prefix—cached_tokens will only extend up to the start of the current message, as the new reasoning block breaks the contiguous matching sequence. To maximize cache hits across turns, you must preserve all historical reasoning_content intact in the context window (see K2 Thinking Model FAQ).
Hope it helps.
Ok, what I don’t understand is that the caching increases with multiples tool calls happen in the same turn. (by turn I mean a user message + multiples thinkings, tool call). but it drops when I send a follow up message.
BTW it was reproduced also on the kimi-cli
Thanks.