step

chunk_text

Split text into smaller chunks for processing with AI models or vector databases.

Overview

Split text into smaller chunks for processing with AI models or vector databases. This step divides large text documents into manageable chunks, which is essential when working with language models that have token limits or when creating embeddings for semantic search. Uses semantic-aware chunking that respects word boundaries and attempts to break at natural points. You can configure chunk size (in tokens), overlap between chunks (for context preservation), and the source/target field paths. The tokenization uses a simple whitespace counter for speed. Overlapping chunks help maintain context across boundaries, which improves retrieval quality in RAG systems.

Quick Start

steps:
- type: chunk_text
  input_from: <string>

Configuration

Parameter Type Required Description
input_from string Yes Dot-delimited path to the string value that should be chunked.
source_path string No DEPRECATED: Use 'input_from' instead. Dot-delimited path to the string value that should be chunked.
output_to string No Dot-delimited path where the resulting list of chunk strings will be written.
Default: "chunks"
target_path string No DEPRECATED: Use 'output_to' instead. Dot-delimited path where the chunks will be written.
chunk_size integer No Maximum number of whitespace tokens per chunk using the default token counter.
Default: 400
overlap number No Optional chunk overlap as a fraction of chunk size (e.g. 0.25 for 25% overlap). Zero disables overlap.
Default: 0

Examples

Basic document chunking

Split a document into 400-token chunks with no overlap

type: chunk_text
input_from: document.content
output_to: document.chunks
chunk_size: 400

Chunking with overlap for context

Use 25% overlap to preserve context between chunks

type: chunk_text
input_from: article.text
output_to: article.chunks
chunk_size: 500
overlap: 0.25

Small chunks for embedding models

Create 200-token chunks ideal for text-embedding-ada-002

type: chunk_text
input_from: content
output_to: embedding_chunks
chunk_size: 200
overlap: 0.1

Large chunks for GPT-4

Create larger chunks that fit within GPT-4's context window

type: chunk_text
input_from: document.full_text
output_to: document.sections
chunk_size: 2000
overlap: 0.15

No overlap for distinct sections

Split text into completely separate chunks

type: chunk_text
input_from: blog_post
output_to: paragraphs
chunk_size: 300
overlap: 0.0

Advanced Options

These options are available on all steps for error handling and retry logic:

Parameter Type Default Description
retries integer 0 Number of retry attempts (0-10)
backoff_seconds number 0 Backoff (seconds) applied between retry attempts
retry_propagate boolean false If True, raise last exception after exhausting retries; otherwise swallow.