Chunking - Scale AI

Dex provides flexible chunking for parsed documents. You can apply chunking in two ways: Dex rechunking (parse once, then rechunk with different strategies) or engine-specific chunking during the initial parse (Reducto only).

IRIS v2 Chunk Structure

IRIS v2 always returns a chunks array in the parse result. chunk_mode: disabled is a Reducto-only concept (soon to be deprecated) that does not exist in IRIS v2. Chunk count and size are driven by how IRIS segments each page, not by a separate chunking enum.

Control	Values	Effect on chunks
`layout`	`"rt_detr_bce"` (default)	One chunk per detected layout region (text, table, image, etc.)
`layout`	`"whole_page"`	One region per page → one chunk per page
`e2e_ocr` + `e2e_response_parser`	e.g. DeepSeek OCR2	Full-page e2e OCR; parser defines region/box structure
Layout filters	`confidence_threshold`, `containment_threshold`, `strict_containment_filter`, table/image thresholds	Fewer or more regions → fewer or more chunks
`img_method`	`"description"`, `"base64"`, `"skip"`	Whether image regions become chunks (`skip` omits image regions)

These options live on IrisParseEngineOptions inside IrisParseJobParams. They are not the same as Dex’s four rechunk strategies below—they control IRIS’s native output only.

Parse with IRIS v2

from dex_sdk.types import (
    ParseEngine,
    IrisParseJobParams,
    IrisParseEngineOptions,
)
parse_result = await dex_file.parse(
    IrisParseJobParams(
        engine=ParseEngine.IRIS,
        options=IrisParseEngineOptions(
            layout="rt_detr_bce",   # default: region-level chunks
            # layout="whole_page",  # alternative: one chunk per page
            text_ocr="openai/gpt-5.4",
        ),
    )
)

Dex Chunking Strategies

Dex offers four chunking strategies that work on any parse result—including IRIS v2 via rechunking. Use these when you need consistent, configurable chunk boundaries across documents or for embeddings/RAG

Strategy	Description	Best For	Embedding Suitability
`token_size`	Splits by token count using a tokenizer (e.g., tiktoken)	LLM APIs with token limits, cost optimization	Excellent
`recursive`	Recursively splits using separators (paragraphs → sentences → words)	Articles, documentation, RAG systems	Excellent
`by_page`	Splits by page boundaries, grouping complete pages	Legal documents, forms, reports	May be large
`by_section`	Splits by section headers (e.g., markdown #, ##, ###)	Technical manuals, wikis, academic papers	Good

Parse Once, Post-chunking as needed

With IRIS 2, you parse without Dex rechunking (omit chunking_options), then apply Dex strategies on the parse result. This is the recommended pattern: one parse, many chunking experiments.

from dex_sdk.types import IrisParseJobParams, IrisParseEngineOptions, ParseEngine
from dex_core.models.chunking import (
    TokenSizeChunkingOptions,
    RecursiveChunkingOptions,
    PageChunkingOptions,
    SectionChunkingOptions,
)
# Step 1: Parse with IRIS v2 (native region- or page-level chunks)
parse_result = await dex_file.parse(
    IrisParseJobParams(
        engine=ParseEngine.IRIS,
        options=IrisParseEngineOptions(
            layout="whole_page",  # optional: coarser native chunks
        ),
    )
)

# Step 2: Rechunk with Dex strategies
# Token-based (for LLM APIs, embedding models)
rechunked = await parse_result.rechunk(
    TokenSizeChunkingOptions(
        chunk_size=512,
        chunk_overlap=50,
        encoding_name="cl100k_base",
    )
)

# Recursive (preserves paragraphs and sentences)
rechunked = await parse_result.rechunk(
    RecursiveChunkingOptions(
        chunk_size=1000,
        chunk_overlap=200,
    )
)

# Page-based (legal docs, forms)
rechunked = await parse_result.rechunk(
    PageChunkingOptions(pages_per_chunk=1)
)

# Section-based (structured documents with headers)
rechunked = await parse_result.rechunk(
    SectionChunkingOptions(
        section_headers=None,  # Auto-detect from block types
        include_header_in_chunk=True,
    )
)

Async Rechunking

For long-running documents, start the rechunk job and poll for completion:

# Start rechunk job (returns immediately)
job = await parse_result.start_rechunk_job(
    TokenSizeChunkingOptions(chunk_size=512, chunk_overlap=50)
)

# Wait for completion and get rechunked result
rechunked = await job.get_rechunked_result()

Chunking Decision Tree

IRIS v2 native (IrisParseEngineOptions):

Many small, layout-aware chunks → layout="rt_detr_bce" (default)
One chunk per page → layout="whole_page"
Full-page e2e model → e2e_ocr + e2e_response_parser
Fewer regions (fewer chunks) → raise confidence thresholds or tighten containment filters

Dex rechunk (after IRIS v2 parse): Use token_size when:

Working with LLM APIs that have token limits
Embedding models with specific token limits
Cost optimization

Use recursive when:

General document chunking for RAG
Preserving paragraphs and sentences
Articles, blog posts, documentation

Use by_page when:

Legal documents, forms, reports
Page references matter
Page structure should be preserved

Use by_section when:

Documents have clear section headers
Technical manuals, wikis, academic papers
Semantic coherence within topics

Reducto Chunking Parse-Time - (Legacy)

When using the Reducto parse engine (soon to be deprecated), you can chunk during the initial parse instead of rechunking. Reducto’s methods are layout-aware and use document structure.

Method	Chunk Size	Best For	Embedding	Location Tracking
VARIABLE	Auto (optimal)	General use, embeddings	Excellent	Good
BLOCK	Small (~100-500 chars)	Precise locations, UI overlays	Too small	Excellent
SECTION	Medium (~1000-3000 chars)	Structured documents	Good	Good
PAGE	Large (full page)	Page-oriented docs	May be large	Excellent
PAGE_SECTIONS	Medium-Large	Hybrid needs	Good	Good
DISABLED	Very large (entire doc)	Special cases	Too large	Excellent

Recommendation: Use VARIABLE for most cases, especially with embeddings.

# (Legacy) Reducto parse-time chunking
from dex_sdk.types import (
    ParseEngine,
    ReductoParseJobParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
)

# Reducto variable chunking (parse-time)
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
                chunk_size=None,
            )
        ),
    )
)

# Reducto block chunking for precise location tracking
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(chunk_mode=ReductoChunkingMethod.BLOCK)
        ),
    )
)

Pattern: Retry with Different Chunking

# (Legacy) Reducto retry pattern
# Try 1: Variable chunking
result1 = await file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    )
)

# If not satisfactory, try block chunking
result2 = await file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.BLOCK,
            )
        ),
    )
)

Reducto-only decision tree (Legacy)

Use Reducto VARIABLE when:

Still on Reducto and want layout-aware chunking at parse time
General document processing with parser-optimal chunk sizes

Use Reducto BLOCK when:

Need precise bounding box information
Building UI overlays on documents
Not using for embeddings

Use Reducto DISABLED when:

You plan to rechunk with Dex strategies (same pattern as IRIS v2)

What's your primary goal? (Reducto only)
├─ Embeddings/Semantic Search → VARIABLE
├─ Precise bounding boxes/UI → BLOCK
├─ Page-by-page processing → PAGE
└─ Structured document navigation → SECTION

Next Steps

Vector Stores: Add chunks to vector stores for semantic search
Extract: Extract structured data from parse results
Parse: Parse engine options and configuration (IRIS v2)

Documentation Index

​IRIS v2 Chunk Structure

​Parse with IRIS v2

​Dex Chunking Strategies

​Parse Once, Post-chunking as needed

​Async Rechunking

​Chunking Decision Tree

​Reducto Chunking Parse-Time - (Legacy)

​Pattern: Retry with Different Chunking

​Reducto-only decision tree (Legacy)

​Next Steps