chunk_text
Split text into smaller chunks for processing with AI models or vector databases.
Overview
Split text into smaller chunks for processing with AI models or vector databases. This step divides large text documents into manageable chunks, which is essential when working with language models that have token limits or when creating embeddings for semantic search. Uses semantic-aware chunking that respects word boundaries and attempts to break at natural points. You can configure chunk size (in tokens), overlap between chunks (for context preservation), and the source/target field paths. The tokenization uses a simple whitespace counter for speed. Overlapping chunks help maintain context across boundaries, which improves retrieval quality in RAG systems.
Quick Start
steps:
- type: chunk_text
input_from: <string>Configuration
| Parameter | Type | Required | Description |
|---|---|---|---|
input_from | string | Yes | Dot-delimited path to the string value that should be chunked. |
source_path | string | No | DEPRECATED: Use 'input_from' instead. Dot-delimited path to the string value that should be chunked. |
output_to | string | No | Dot-delimited path where the resulting list of chunk strings will be written.
Default: "chunks" |
target_path | string | No | DEPRECATED: Use 'output_to' instead. Dot-delimited path where the chunks will be written. |
chunk_size | integer | No | Maximum number of whitespace tokens per chunk using the default token counter.
Default: 400 |
overlap | number | No | Optional chunk overlap as a fraction of chunk size (e.g. 0.25 for 25% overlap). Zero disables overlap.
Default: 0 |
Examples
Basic document chunking
Split a document into 400-token chunks with no overlap
type: chunk_text
input_from: document.content
output_to: document.chunks
chunk_size: 400
Chunking with overlap for context
Use 25% overlap to preserve context between chunks
type: chunk_text
input_from: article.text
output_to: article.chunks
chunk_size: 500
overlap: 0.25
Small chunks for embedding models
Create 200-token chunks ideal for text-embedding-ada-002
type: chunk_text
input_from: content
output_to: embedding_chunks
chunk_size: 200
overlap: 0.1
Large chunks for GPT-4
Create larger chunks that fit within GPT-4's context window
type: chunk_text
input_from: document.full_text
output_to: document.sections
chunk_size: 2000
overlap: 0.15
No overlap for distinct sections
Split text into completely separate chunks
type: chunk_text
input_from: blog_post
output_to: paragraphs
chunk_size: 300
overlap: 0.0
Advanced Options
These options are available on all steps for error handling and retry logic:
| Parameter | Type | Default | Description |
|---|---|---|---|
retries | integer | 0 | Number of retry attempts (0-10) |
backoff_seconds | number | 0 | Backoff (seconds) applied between retry attempts |
retry_propagate | boolean | false | If True, raise last exception after exhausting retries; otherwise swallow. |