step

chunk_text

Split text into smaller chunks for processing with AI models or vector databases.

Overview

This step divides large text documents into manageable chunks, which is essential when
working with language models that have token limits or when creating embeddings for
semantic search. Uses semantic-aware chunking that respects word boundaries and attempts
to break at natural points. You can configure chunk size (in tokens), overlap between
chunks (for context preservation), and the source/target field paths. The tokenization
uses a simple whitespace counter for speed. Overlapping chunks help maintain context
across boundaries, which improves retrieval quality in RAG systems.

Examples

Basic document chunking

Split a document into 400-token chunks with no overlap

type: chunk_text
input_from: document.content
output_to: document.chunks
chunk_size: 400

Chunking with overlap for context

Use 25% overlap to preserve context between chunks

type: chunk_text
input_from: article.text
output_to: article.chunks
chunk_size: 500
overlap: 0.25

Small chunks for embedding models

Create 200-token chunks ideal for text-embedding-ada-002

type: chunk_text
input_from: content
output_to: embedding_chunks
chunk_size: 200
overlap: 0.1

Large chunks for GPT-4

Create larger chunks that fit within GPT-4's context window

type: chunk_text
input_from: document.full_text
output_to: document.sections
chunk_size: 2000
overlap: 0.15

No overlap for distinct sections

Split text into completely separate chunks

type: chunk_text
input_from: blog_post
output_to: paragraphs
chunk_size: 300
overlap: 0.0

Configuration

Parameter	Type	Required	Description
`input_from`	`string`	Yes	Dot-delimited path to the string value that should be chunked.
`source_path`	`string`	No	DEPRECATED: Use 'input_from' instead. Dot-delimited path to the string value that should be chunked.
`output_to`	`string`	No	Dot-delimited path where the resulting list of chunk strings will be written. Default: `"chunks"`
`target_path`	`string`	No	DEPRECATED: Use 'output_to' instead. Dot-delimited path where the chunks will be written.
`chunk_size`	`integer`	No	Maximum number of whitespace tokens per chunk using the default token counter. Default: `400`
`overlap`	`number`	No	Optional chunk overlap as a fraction of chunk size (e.g. 0.25 for 25% overlap). Zero disables overlap. Default: `0`

Base Configuration

These configuration options are available on all steps:

Parameter	Type	Default	Description
`name`		`null`	Optional name for this step (for documentation and debugging)
`description`		`null`	Optional description of what this step does
`retries`	`integer`	`0`	Number of retry attempts (0-10)
`backoff_seconds`	`number`	`0`	Backoff (seconds) applied between retry attempts
`retry_propagate`	`boolean`	`false`	If True, raise last exception after exhausting retries; otherwise swallow.

← Back to All Steps