Datasets and data flow
When an agent produces a large result — thousands of rows from a query, a computed table, a time series for a chart — it does not paste those rows into the conversation. Instead it registers a dataset: the full data stays as a file in your S3 workspace, and the chat renders a compact preview, chart, or grid from a reference to it. This page explains the dataset-registration model, how chat reads from a dataset without loading all of it, the short-lived dataset store, and how data moves between workflow nodes that do not share a workspace.
Why datasets exist
A model's context window is finite and dumping a large table into it is both wasteful and unreliable. The dataset model solves this: the agent writes the data to a file in its workspace, then registers that file as a dataset. The registration records lightweight metadata — a name, the data's shape (row and column counts, column names), a small preview, and the file's location in S3 — while the full data never enters the conversation.
The chat then renders from that reference. A table becomes a scrollable grid, a series becomes a chart, and the UI pulls only the rows or columns it needs to display, on demand, directly from S3.
How registration works
After generating a data file in its workspace, the agent registers it as a dataset. The registration captures:
- Name — a human-readable label.
- Type and format — for example a table or a series, stored as CSV, JSON, or similar.
- Shape — row count, column count, and column names.
- A preview — a small number of rows for an at-a-glance view.
- Location — the dataset's key in the S3 workspaces bucket, which is what every later reference resolves against.
The full dataset stays in S3 under the session's workspace prefix. What flows into chat is the reference plus the preview — not the rows.
How chat renders without loading everything
Because the reference points at the file in S3 and the registration knows the data's shape, the chat can fetch exactly what a view needs:
- Grids request a page of rows at a time (offset and limit), so a million-row table scrolls without ever being fully loaded.
- Charts request only the specific columns being plotted.
The agent's textual answer stays concise — it talks about the data and points at the dataset — while the heavy data is served separately and only as needed. This is what lets an agent "return ten thousand rows" in a conversation that still reads cleanly.
The session-datasets store and ~6-hour retention
Dataset records live in a dedicated DynamoDB table in your account (the session-datasets store), keyed by the session that produced them. Each record carries a time-to-live of about six hours, after which the metadata record expires automatically.
Two things are worth separating here:
- The dataset record (metadata and reference) is short-lived — roughly six hours.
- The underlying file in the S3 workspace follows the workspace's own lifetime (see How agents run code); it is not deleted when the record expires.
The retention window keeps the research views responsive and bounded to the active working period of a conversation, rather than accumulating indefinitely.
Moving data between workflow nodes
In a workflow, each node runs in its own sandbox session and its own workspace — nodes do not share a working directory. So when one node produces data that the next node needs, the data has to be moved explicitly.
The mechanism is load_dataset. When an upstream node registers a dataset, its output includes the dataset's S3 location. A downstream node calls load_dataset with that location, which downloads the file from S3 into the downstream node's live working directory. The next piece of code in that node can then open the file directly — for example reading it into a dataframe — as if it had produced the file itself.
This keeps node workspaces isolated by default — which is what preserves per-node and per-user separation — while giving you a clear, explicit way to hand data forward through a pipeline. Because the transfer is a copy from your own S3 into another in-account session, the data never leaves your account.
Where to go next
- How agents run code — the workspace these datasets are written to.
- Connector data in the sandbox — where the data the agent processes comes from.
- Code interpreter and datasets — the end-user view of datasets, charts, and grids.
- Workflows — building multi-node pipelines that pass data with
load_dataset.