Overcoming Context Limits: The Need for Adaptive Memory

The quest for truly intelligent AI agents often hits a wall: the notorious “context window” limitation of large language models (LLMs). While LLMs excel at processing information within a fixed frame, their ability to reason over extended periods, manage complex workflows, or tackle multi-step problems — what we call long-horizon reasoning — remains a significant hurdle. This constraint means that as tasks grow in complexity and require more intermediate steps or past knowledge, critical information can “scroll out” of the model’s active memory, leading to fragmented reasoning and incomplete solutions.
But what if an LLM could intelligently manage its own memory, compressing past experiences into concise summaries, much like a human mind? This is precisely the power of a Context-Folding LLM Agent. In this tutorial, we explore how to build a Context-Folding LLM Agent that efficiently solves long, complex tasks by intelligently managing limited context. We design the agent to break down a large task into smaller subtasks, perform reasoning or calculations when needed, and then fold each completed sub-trajectory into concise summaries. By doing this, we preserve essential knowledge while keeping the active memory small. Check out the FULL CODES here.
Overcoming Context Limits: The Need for Adaptive Memory
Traditional LLM applications often struggle with tasks that demand extensive memory or iterative problem-solving. Imagine trying to solve a multi-stage puzzle where you can only remember the last few moves. Crucial context from earlier stages would be lost, making it impossible to connect disparate pieces of information or build on previous results effectively. This challenge is magnified in real-world scenarios requiring detailed project planning, multi-step code generation, or complex data analysis.
The Context-Folding LLM Agent addresses this fundamental limitation by introducing a dynamic memory management strategy. Instead of simply discarding information, it proactively distills past interactions and outcomes into compact, meaningful summaries. This process ensures that the agent’s “active” context remains focused on the immediate task, while a rich, summarized history is always accessible, facilitating robust long-horizon reasoning without overwhelming the model.
Setting Up the Foundation: Environment and Tools
To begin constructing our intelligent agent, we first establish the necessary environment and load a suitable language model. For local and efficient execution, particularly in environments like Google Colab, selecting a lightweight model is crucial. This approach allows us to avoid external API dependencies, ensuring a self-contained and reproducible setup.
import os, re, sys, math, random, json, textwrap, subprocess, shutil, time
from typing import List, Dict, Tuple
try: import transformers
except: subprocess.run([sys.executable, "-m", "pip", "install", "-q", "transformers", "accelerate", "sentencepiece"], check=True)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
MODEL_NAME = os.environ.get("CF_MODEL", "google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
llm = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device_map="auto")
def llm_gen(prompt: str, max_new_tokens=160, temperature=0.0) -> str: out = llm(prompt, max_new_tokens=max_new_tokens, do_sample=temperature>0.0, temperature=temperature)[0]["generated_text"] return out.strip()
We begin by setting up our environment and loading a lightweight Hugging Face model. We use this model to generate and process text locally, ensuring the agent runs smoothly on Google Colab without any API dependencies. Check out the FULL CODES here.
Beyond the core LLM, a truly capable agent needs tools to interact with the world, or at least, perform operations beyond pure text generation. A simple calculator tool, for instance, dramatically enhances the agent’s accuracy for quantitative tasks, while a sophisticated memory system becomes the bedrock of its long-horizon capabilities.
import ast, operator as op
OPS = {ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul, ast.Div: op.truediv, ast.Pow: op.pow, op.USub: op.neg, op.FloorDiv: op.floordiv, op.Mod: op.mod}
def _eval_node(n): if isinstance(n, ast.Num): return n.n if isinstance(n, ast.UnaryOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.operand)) if isinstance(n, ast.BinOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.left), _eval_node(n.right)) raise ValueError("Unsafe expression")
def calc(expr: str): node = ast.parse(expr, mode='eval').body return _eval_node(node)
class FoldingMemory: def __init__(self, max_chars:int=800): self.active=[]; self.folds=[]; self.max_chars=max_chars def add(self,text:str): self.active.append(text.strip()) while len(self.active_text())>self.max_chars and len(self.active)>1: popped=self.active.pop(0) fold=f"- Folded: {popped[:120]}..." self.folds.append(fold) def fold_in(self,summary:str): self.folds.append(summary.strip()) def active_text(self)->str: return "\n".join(self.active) def folded_text(self)->str: return "\n".join(self.folds) def snapshot(self)->Dict: return {"active_chars":len(self.active_text()),"n_folds":len(self.folds)}
We define a simple calculator tool for basic arithmetic and create a memory system that dynamically folds past context into concise summaries. This helps us maintain a manageable active memory while retaining essential information. Check out the FULL CODES here.
Orchestrating Intelligence: Task Decomposition and Prompt Engineering
The true genius of an LLM agent lies not just in its ability to generate text, but in its strategic approach to problem-solving. This involves breaking down complex tasks into manageable subtasks, executing each subtask with precision, and then intelligently summarizing the outcomes. Structured prompt engineering is the backbone of this orchestrated intelligence, guiding the LLM through each stage of the reasoning process.
By using carefully designed prompt templates, we provide the agent with a clear framework for its operations. These templates ensure that the model understands its role at each step, whether it’s planning, solving, or synthesizing. This structured communication is vital for maintaining coherence and efficiency across long-horizon reasoning tasks.
SUBTASK_DECOMP_PROMPT="""You are an expert planner. Decompose the task below into 2-4 crisp subtasks.
Return each subtask as a bullet starting with '- ' in priority order.
Task: "{task}" """
SUBTASK_SOLVER_PROMPT="""You are a precise problem solver with minimal steps.
If a calculation is needed, write one line 'CALC(expr)'.
Otherwise write 'ANSWER: '.
Think briefly; avoid chit-chat. Task: {task}
Subtask: {subtask}
Notes (folded context):
{notes} Now respond with either CALC(...) or ANSWER: ..."""
SUBTASK_SUMMARY_PROMPT="""Summarize the subtask outcome in <=3 bullets, total <=50 tokens.
Subtask: {name}
Steps:
{trace}
Final: {final}
Return only bullets starting with '- '."""
FINAL_SYNTH_PROMPT="""You are a senior agent. Synthesize a final, coherent solution using ONLY:
- The original task
- Folded summaries (below)
Avoid repeating steps. Be concise and actionable. Task: {task}
Folded summaries:
{folds} Final answer:"""
def parse_bullets(text:str)->List[str]: return [ln[2:].strip() for ln in text.splitlines() if ln.strip().startswith("- ")]
We design prompt templates that guide the agent in decomposing tasks, solving subtasks, and summarizing outcomes. These structured prompts enable clear communication between reasoning steps and the model’s responses. Check out the FULL CODES here.
The Context-Folding Agent in Action: Iterative Reasoning and Memory Compression
With the prompts and memory system in place, we can now assemble the core logic of our Context-Folding LLM Agent. This involves defining how subtasks are executed, how external tools are invoked, and most critically, how the memory compression mechanism transforms detailed process traces into concise, folded summaries. This iterative process is key to maintaining a coherent understanding across complex, multi-step problems.
Each time a subtask is completed, its outcome isn’t just stored; it’s actively processed and summarized. This summarization is then “folded” into the agent’s long-term memory, ensuring that valuable insights are retained without expanding the active context window. This elegant solution mimics how humans distill learning from experience, making our LLM agent far more capable of long-horizon reasoning.
def run_subtask(task:str, subtask:str, memory:FoldingMemory, max_tool_iters:int=3)->Tuple[str,str,List[str]]: notes=(memory.folded_text() or "(none)") trace=[]; final="" for _ in range(max_tool_iters): prompt=SUBTASK_SOLVER_PROMPT.format(task=task,subtask=subtask,notes=notes) out=llm_gen(prompt,max_new_tokens=96); trace.append(out) m=re.search(r"CALC\((.+?)\)",out) if m: try: val=calc(m.group(1)) trace.append(f"TOOL:CALC -> {val}") out2=llm_gen(prompt+f"\nTool result: {val}\nNow produce 'ANSWER: ...' only.",max_new_tokens=64) trace.append(out2) if out2.strip().startswith("ANSWER:"): final=out2.split("ANSWER:",1)[1].strip(); break except Exception as e: trace.append(f"TOOL:CALC ERROR -> {e}") if out.strip().startswith("ANSWER:"): final=out.split("ANSWER:",1)[1].strip(); break if not final: final="No definitive answer; partial reasoning:\n"+"\n".join(trace[-2:]) summ=llm_gen(SUBTASK_SUMMARY_PROMPT.format(name=subtask,trace="\n".join(trace),final=final),max_new_tokens=80) summary_bullets="\n".join(parse_bullets(summ)[:3]) or f"- {subtask}: {final[:60]}..." return final, summary_bullets, trace
class ContextFoldingAgent: def __init__(self,max_active_chars:int=800): self.memory=FoldingMemory(max_chars=max_active_chars) self.metrics={"subtasks":0,"tool_calls":0,"chars_saved_est":0} def decompose(self,task:str)->List[str]: plan=llm_gen(SUBTASK_DECOMP_PROMPT.format(task=task),max_new_tokens=96) subs=parse_bullets(plan) return subs[:4] if subs else ["Main solution"] def run(self,task:str)->Dict: t0=time.time() self.memory.add(f"TASK: {task}") subtasks=self.decompose(task) self.metrics["subtasks"]=len(subtasks) folded=[] for st in subtasks: self.memory.add(f"SUBTASK: {st}") final,fold_summary,trace=run_subtask(task,st,self.memory) self.memory.fold_in(fold_summary) folded.append(f"- {st}: {final}") self.memory.add(f"SUBTASK_DONE: {st}") final=llm_gen(FINAL_SYNTH_PROMPT.format(task=task,folds=self.memory.folded_text()),max_new_tokens=200) t1=time.time() return {"task":task,"final":final.strip(),"folded_summaries":self.memory.folded_text(), "active_context_chars":len(self.memory.active_text()), "subtask_finals":folded,"runtime_sec":round(t1-t0,2)}
We implement the agent’s core logic, in which each subtask is executed, summarized, and folded back into memory. This step demonstrates how context folding enables the agent to reason iteratively without losing track of prior reasoning. Check out the FULL CODES here.
To truly grasp the capabilities of this system, it’s essential to see it in action. By running the `ContextFoldingAgent` on a variety of sample tasks, we can observe its planning, execution, and synthesis process unfold. These demonstrations highlight how the agent maintains focus on immediate subtasks while leveraging its compressed memory for a holistic understanding of the overarching goal, resulting in coherent and insightful final outputs.
DEMO_TASKS=[ "Plan a 3-day study schedule for ML with daily workouts and simple meals; include time blocks.", "Compute a small project budget with 3 items (laptop 799.99, course 149.5, snacks 23.75), add 8% tax and 5% buffer, and present a one-paragraph recommendation."
]
def pretty(d): return json.dumps(d, indent=2, ensure_ascii=False)
if __name__=="__main__": agent=ContextFoldingAgent(max_active_chars=700) for i,task in enumerate(DEMO_TASKS,1): print("="*70) print(f"DEMO #{i}: {task}") res=agent.run(task) print("\n--- Folded Summaries ---\n"+(res["folded_summaries"] or "(none)")) print("\n--- Final Answer ---\n"+res["final"]) print("\n--- Diagnostics ---") diag={k:res[k] for k in ["active_context_chars","runtime_sec"]} diag["n_subtasks"]=len(agent.decompose(task)) print(pretty(diag))
We run the agent on sample tasks to observe how it plans, executes, and synthesizes final results. Through these examples, we see the complete context-folding process in action, producing concise and coherent outputs.
Conclusion
Building an LLM agent capable of long-horizon reasoning with true memory compression is a significant leap forward in AI. The Context-Folding LLM Agent demonstrates a powerful paradigm for managing the inherent limitations of LLM context windows, turning a potential weakness into a strength. By intelligently breaking down tasks, leveraging external tools, and compressing past experiences, this agent offers a robust solution for tackling increasingly complex problems.
This approach isn’t just about technical finesse; it’s about enabling LLMs to mimic a more human-like problem-solving process, where accumulated knowledge informs future actions and decisions. As we continue to push the boundaries of AI, agents that can learn, adapt, and reason over extended periods will become indispensable. The Context-Folding LLM Agent stands as a testament to the power of combining decomposition, tool use, and sophisticated memory management to create efficient and highly capable AI systems.
In conclusion, we demonstrate how context folding enables long-horizon reasoning while avoiding memory overload. We see how each subtask is planned, executed, summarized, and distilled into compact knowledge, mimicking how an intelligent agent would handle complex workflows over time. By combining decomposition, tool use, and context compression, we create a lightweight yet powerful agentic system that scales reasoning efficiently.
Check out the FULL CODES here and Paper . Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




