Lesson 06 · memory

The context is full, learn to cut

"The agent can forget strategically and keep working forever." Strategic forgetting = engineering capabilities.

⏱ ~12 min · 📝 3 interactive widgets · 🧑‍💻 Based on shareAI-lab · s06_context_compact.py

Why compact?

If the agent runs for a long time, messages[] will expand: each read_file will return thousands of tokens, each bash will return hundreds, and each round of dialogue will also include the model’s thinking text. After 50 rounds, the context can be stuffed to 100K+. Two consequences:

  • Encountering the upper limit of the model: It collapses when the window size is reached, or the price increases linearly with each API call.
  • Attention dilution: Now that the task at hand is drowned in irrelevant tool_results from 30 rounds ago, the model starts to lose focus.

The idea of s06: Let the agent actively forget unimportant content but retain key states. Three-layer mechanism, from light to heavy.

Layer 1 · micro_compact (runs silently every round)

The cheapest tier. Run it before each LLM call and replace more than 3 old tool_results with placeholders:

# Moving forward from the 10th round, most tool_results become:
{
  "type": "tool_result",
  "tool_use_id": "tolu_01A",
  "content": "[Previous: used bash]" # Reduced from thousands of characters to dozens
}

There is a special case: the result of read_file is not compressed. Why? Because the output of read is a reference material, if you press the model, you have to read it again, which is more expensive.

PRESERVE_RESULT_TOOLS = {"read_file"} # Never compress

Look at micro_compact and press turn to eat the old results

The following steps simulate 10 rounds of interaction, and let micro_compact run once before each round. Look at the old tool_result in messages[] and it becomes [Previous: ...], but the last 3 remain the same.

Layer 2 · auto_compact (triggered when threshold is exceeded)

Even if the micro keeps running, it will still explode when it gets tired to a certain size. s06 sets a threshold (default 50000 token):

  1. Estimate the number of tokens len(str(messages)) // 4 (rough but sufficient).
  2. Exceeds the threshold → Write the complete transcript to .transcripts/transcript_TIMESTAMP.jsonl (leave the bottom).
  3. Ask the LLM to write a summary of the entire conversation.
  4. Replace messages with one "[compressed] SUMMARY...".

The cost is obvious - the specific tool output and conversational tone are lost, and only the outline is left. But the agent can continue doing, which is the core benefit.

Layer 3 · Adjust the model yourself compact tool

auto_compact is automatically triggered by the harness and is not known to the model. Layer 3 conversely: Give the model a compact tool and let it actively request compression - for example, if it feels that the previous exploration is useless and a new stage needs to be started.

Model call:

tool_use("compact", focus="keep the API design decisions")

Triggering is the same as auto, but it can take a focus parameter to tell what to focus on when summarizing. It is very practical in actual combat - the model knows which "finished small tasks" are, which is more accurate than the harness' heuristic.

Which layer is appropriate? true or false question

The following scenarios determine which trigger of micro / auto / manual is more reasonable.

Interactive

Widget 1 · Micro Compact · See tool_result Press turn Aging

Step Step by step and observe how the old tool_result is replaced with [Previous: used X], while the last 3 remain intact. read_file never compresses (highlighted in green).

Turn: 0 · Tokens: ~0
Interactive

Widget 2 · Threshold Simulator · Which layer is triggered after the token goes up?

Drag the slider to change the number of tokens to see which of the three layers is active.

3000
Interactive

Widget 3 · Which layer is appropriate · 6 scene judgment questions

Choose micro/auto/manual for each scene, and talk about when each is suitable for triggering.

答对 0 / 6