Building a Context Pruning Pipeline for Long-Running Agents
Learn how to implement a context pruning pipeline that uses semantic similarity to help AI agents manage conversational memory efficiently.
In this article, you will learn how to implement a context pruning pipeline for long-running AI agents, enabling them to manage conversational memory efficiently through semantic similarity.
Topics we will cover include:
- Why unbounded conversation history is a problem for agents built on top of large language models, and what a context pruning strategy looks like.
- How to use sentence transformer embedding models to compute semantic similarity between a current prompt and archived conversation turns.
- How to assemble a pruned context window from the most recent turn, the top-K semantically relevant past turns, and the current prompt.

Introduction
Modern AI agents built on top of large language models (LLMs) are designed to run continuously. As a result, their conversation history keeps growing indefinitely. Passing such an entire history as the LLM’s context window is the perfect recipe for prohibitive token costs, latency bottlenecks, and eventual degradation in reasoning.
Building a context pruning pipeline can address this issue by dynamically managing recent conversational memory. This article outlines the basic principles for implementing a context pruning pipeline for long-running agents.
We use an entirely accessible and free-to-run local solution based on open-source embedding models rather than paid APIs, but you can replace them with paid APIs if you want a more efficient solution.
Proposed Memory Strategy
Classical memory strategies in agents rely on a sliding window that forgets old information as it falls behind, including potentially critical details. Moving beyond that approach, it is possible to build a selective, smarter pipeline that gives the LLM precisely what it needs as context.
In essence, the context can be pruned down to the following basic elements:
- The current prompt, containing the user’s request or question.
- The most recent turn, i.e. the immediate previous input-response exchange, which is key to maintaining conversational continuity.
- The top-K semantically relevant matches, calculated based on a similarity score. These are past turns closely related to the current prompt, retrieved through vector embeddings.
Everything in the conversation history that falls outside the scope of these three elements is discarded from the active prompt’s context, saving compute and memory.
Simulation-Based Implementation
Our example implementation simulates the application of the aforementioned strategy, building a context pruning window step by step. Sentence transformer models are used to simulate a long-running pipeline alongside a mocked conversation history.
We start by making the necessary imports:
import numpy as np
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine
Next, we load and initialize a pre-trained embedding model — concretely all-MiniLM-L6-v2 from the sentence_transformers library. This model has been trained to transform raw text into embedding vectors that capture semantic characteristics. We also create a simple, simulated agent history containing user-agent interactions (in a real setting, this would be fetched from a database):
# Initialize a lightweight open-source embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# 1. Simulated Agent History (Usually fetched from a database)
chat_history = [
{"role": "user", "content": "My name is Alice and I work in logistics."},
{"role": "agent", "content": "Nice to meet you, Alice. How can I help with logistics?"},
{"role": "user", "content": "What's the weather like today?"},
{"role": "agent", "content": "It's sunny and 75 degrees."},
{"role": "user", "content": "I need help calculating route efficiency for my fleet."},
{"role": "agent", "content": "Route efficiency involves analyzing distance, traffic, and load weight."},
{"role": "user", "content": "Thanks, that makes sense."},
{"role": "agent", "content": "You're welcome! Let me know if you need anything else."}
]
The core logic of the context pruning pipeline comes next. It is encapsulated in a prune_context() function that receives the current prompt, the full interaction history, and the number of semantically relevant past turns to retrieve, k:
def prune_context(current_prompt, history, top_k=2):
# If the conversation history is too short, we simply return it
if len(history) <= 2:
return history + [{"role": "user", "content": current_prompt}]
# Extracting the most recent turn (last user/agent pair)
recent_turn = history[-2:]
# The rest of the history will be eligible for semantic pruning
archived_turns = history[:-2]
# 2. Embedding the current prompt
prompt_emb = model.encode(current_prompt)
# 3. Embedding archived turns and computing similarities
scored_turns = []
for turn in archived_turns:
turn_emb = model.encode(turn["content"])
# We want similarity, so we subtract cosine distance from 1
similarity = 1 - cosine(prompt_emb, turn_emb)
scored_turns.append((similarity, turn))
# 4. Sorting by highest similarity and slicing the Top-K turns
scored_turns.sort(key=lambda x: x[0], reverse=True)
top_turns = [turn for _, turn in scored_turns[:top_k]]
# 5. Assembling the pruned context window
pruned_context = top_turns + recent_turn + [{"role": "user", "content": current_prompt}]
return pruned_context
The function first handles the edge case where the history is too short to prune, returning it as-is. For longer histories, it separates the most recent turn from the archive, encodes the current prompt and each archived turn into vector embeddings, computes cosine similarity scores, and selects the top-K most relevant past turns. These are then combined with the most recent turn and the current prompt to form a compact, semantically informed context window ready to be passed to the LLM.
This approach ensures that the agent retains conversational continuity through the most recent exchange while also surfacing historically relevant context — all without exceeding token limits or incurring unnecessary compute costs. By replacing the embedding model with a higher-capacity alternative or integrating a vector database for archived turn retrieval, the same pipeline can scale to production-grade long-running agents.