Skip to main content
This reference documents the Python SDK methods for Scale’s Dex document understanding capability.

DexClient

The main client for interacting with the Dex service.

Project Management

  • create_project(name, configuration) - Create a new project with optional configuration
  • list_projects() - List all accessible projects
  • get_project(project_id) - Retrieve a specific project
  • update_project(project_id, updates) - Update project name, configuration, or status
Example:
import os
from datetime import timedelta
from dex_sdk import DexClient
from dex_sdk.types import ProjectConfiguration, RetentionPolicy

# Initialize client with SGP credentials
dex_client = DexClient(
    base_url="https://dex.sgp.scale.com",
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

# Create project (credentials are passed via client initialization)
project = await dex_client.create_project(
    name="My Project",
)

# Create project with data retention policy
project = await dex_client.create_project(
    name="My Compliant Project",
    configuration=ProjectConfiguration(
        retention=RetentionPolicy(
            files=timedelta(days=30),
            result_artifacts=timedelta(days=7),
        )
    )
)

# Update project configuration
await dex_client.update_project(
    project_id=project.id,
    updates={
        "configuration": ProjectConfiguration(
            retention=RetentionPolicy(
                files=timedelta(days=90),
                result_artifacts=timedelta(days=30),
            )
        )
    }
)

Project

Represents a Dex project with isolated data and credentials.

File Operations

  • upload_file(file_path) - Upload a document to the project
  • list_files() - List all uploaded files
  • get_file(file_id) - Get file metadata
  • download_file(file_id) - Download file content
Example:
# Upload a file
dex_file = await project.upload_file("path/to/document.pdf")

# List all files
files = await project.list_files()

Vector Store Operations

  • create_vector_store(name, engine, embedding_model) - Create a vector store with SGP Knowledge Base engine
  • list_vector_stores() - List all vector stores
  • get_vector_store(vector_store_id) - Get vector store details
  • delete_vector_store(vector_store_id) - Delete a vector store
Example:
from dex_sdk.types import VectorStoreEngines

vector_store = await project.create_vector_store(
    name="My Vector Store",
    engine=VectorStoreEngines.SGP_KNOWLEDGE_BASE,
    embedding_model="openai/text-embedding-3-large",
)

DexFile

Represents an uploaded file in Dex.

Parsing

  • parse(params) - Parse document to structured format
Example:
from dex_sdk.types import (
    ParseEngine,
    ReductoParseJobParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
)

parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    )
)

Working with Parse Results

After parsing, you can access the structured content including chunks and blocks. Example:
# Parse the file
parse_result = await dex_file.parse(parse_params)

# Access metadata
metadata = parse_result.parse_metadata
print(f"Source: {metadata.filename} ({metadata.pages_processed} pages, engine: {parse_result.engine})")

# Access content chunks
for i, chunk in enumerate(parse_result.content.chunks):
    print(f"\nChunk {i}: {chunk.content[:100]}... ({len(chunk.blocks)} blocks)")

    for block in chunk.blocks:
        print(f"  [{block.type}] Page {block.page_number}, "
              f"confidence: {block.confidence:.2f}, "
              f"pos: ({block.bbox.left:.2f}, {block.bbox.top:.2f})")

ParseResult

Represents the result of a document parsing operation.

Extraction

  • extract(extraction_schema, user_prompt, model, generate_citations, generate_confidence) - Extract structured data with user prompt, schema, model, and options
Parameters:
  • extraction_schema (BaseModel): Pydantic model class for extraction (pass the class directly, not model_json_schema())
  • user_prompt (str): Natural language instructions for extraction
  • model (str): LLM model to use (e.g., “openai/gpt-4o”)
  • generate_citations (bool): Include source citations in results
  • generate_confidence (bool): Include confidence scores in results
Example:
from pydantic import BaseModel, Field

class InvoiceData(BaseModel):
    invoice_number: str = Field(description="The invoice number")
    total_amount: float = Field(description="Total amount in dollars")
    date: str = Field(description="Invoice date")

extract_result = await parse_result.extract(
    extraction_schema=InvoiceData,
    user_prompt="Extract invoice details from this document.",
    model="openai/gpt-4o",
    generate_citations=True,
    generate_confidence=True,
)

Working with Extraction Results

After extraction, you can access the structured data, citations, and confidence scores. Example:
# Extract data
extract_result = await parse_result.extract(
    extraction_schema=InvoiceData,
    user_prompt="Extract invoice details from this document.",
    model="openai/gpt-4o",
    generate_citations=True,
    generate_confidence=True,
)

# Access the extraction result
result = extract_result.result

# Access structured data
for field_name, field in result.data.items():
    print(f"{field_name}: {field.value} (confidence: {field.confidence:.2f})")

    if field.citations:
        for cite in field.citations:
            loc = f", pos: ({cite.bbox.left:.2f}, {cite.bbox.top:.2f})" if cite.bbox else ""
            print(f"  → Page {cite.page}: {cite.content[:50]}...{loc}")

# Access usage information
if result.usage_info:
    usage = result.usage_info
    print(f"\nTokens: {usage.total_tokens} total ({usage.prompt_tokens} prompt + {usage.completion_tokens} completion)")

VectorStore

Represents a vector store for semantic search and RAG-enhanced extraction.

Indexing

  • add_parse_results(parse_result_ids) - Add parsed documents to vector store by parse result IDs
  • remove_files(file_ids) - Remove files from index
Example:
# Add parsed documents to vector store
await vector_store.add_parse_results([parse_result.id])

# Remove files from vector store
await vector_store.remove_files([file_id])
  • search(query, top_k, filters) - Semantic search across all documents in the vector store
  • search_in_file(file_id, query, top_k, filters) - Search within a specific file with optional filters
Example:
# Search across all documents
results = await vector_store.search(
    query="What is the total revenue?",
    top_k=5,
)

# Search within a specific file
file_results = await vector_store.search_in_file(
    file_id=dex_file.id,
    query="What is the total revenue?",
    top_k=5,
    filters=None,
)

Extraction

  • extract(extraction_schema, user_prompt, model, generate_citations, generate_confidence) - Extract structured data from entire vector store with RAG context
Example:
# Extract from vector store with RAG context
extract_result = await vector_store.extract(
    extraction_schema=FinancialData,
    user_prompt="Extract financial summary from all documents.",
    model="openai/gpt-4o",
    generate_citations=True,
    generate_confidence=True,
)

Parse Job Parameters

When parsing documents, you can specify different engines and options to customize the parsing behavior.

Reducto Parse Parameters

ReductoParseJobParams - Parameters for the Reducto OCR engine. Best for: English and Latin-script documents (Spanish, French, German, Italian, Portuguese, etc.) with tables, figures, and complex layouts. Fields:
  • engine (ParseEngine): Set to ParseEngine.REDUCTO
  • options (ReductoParseEngineOptions): Parsing options
  • advanced_options (dict): Advanced options for fine-tuning
  • experimental_options (dict): Experimental features
  • priority (bool): Whether to prioritize this job (default: False)
ReductoParseEngineOptions:
  • chunking (ReductoChunkingOptions | None): Chunking configuration
ReductoChunkingOptions:
  • chunk_mode (ReductoChunkingMethod): Chunking method (default: VARIABLE)
    • DISABLED: No chunking
    • BLOCK: Block-level chunks
    • PAGE: Page-level chunks
    • PAGE_SECTIONS: Page sections
    • SECTION: Section-level chunks
    • VARIABLE: Variable-size chunks based on content
  • chunk_size (int | None): Custom chunk size
Example:
from dex_sdk.types import (
    ParseEngine,
    ReductoParseJobParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
)

parse_params = ReductoParseJobParams(
    engine=ParseEngine.REDUCTO,
    options=ReductoParseEngineOptions(
        chunking=ReductoChunkingOptions(
            chunk_mode=ReductoChunkingMethod.VARIABLE,
            chunk_size=None,
        )
    ),
    priority=False,
)

parse_result = await dex_file.parse(parse_params)

Iris Parse Parameters

IrisParseJobParams - Parameters for the Iris OCR engine. Best for: Non-English, non-Latin scripts including Arabic, Hebrew, Chinese (CJK), Japanese, Korean, Thai, Hindi, and other Indic languages. Fields:
  • engine (ParseEngine): Set to ParseEngine.IRIS
  • options (IrisParseEngineOptions): Parsing options
IrisParseEngineOptions:
  • layout (str | None): Layout detection model to use
  • text_ocr (str | None): Text OCR model to use
  • table_ocr (str | None): Table OCR model to use
  • text_prompt (str | None): Custom prompt for text extraction (VLMs only)
  • table_prompt (str | None): Custom prompt for table extraction (VLMs only)
  • left_to_right (bool | None): Sort regions left-to-right instead of right-to-left (default: False)
  • confidence_threshold (float | None): Minimum confidence threshold for layout detection
  • containment_threshold (float | None): Containment threshold for filtering overlapping boxes
Example:
from dex_sdk.types import (
    ParseEngine,
    IrisParseJobParams,
    IrisParseEngineOptions,
)

parse_params = IrisParseJobParams(
    engine=ParseEngine.IRIS,
    options=IrisParseEngineOptions(
        layout="layout_model_v1",
        text_ocr="text_ocr_v1",
        confidence_threshold=0.5,
    )
)

parse_result = await dex_file.parse(parse_params)

Common Types

This section documents the core data models and types used throughout the Dex SDK.

Type Categories

Importable Types - Types you can import from dex_sdk.types to configure your requests:
  • Configuration types (ProjectConfiguration, RetentionPolicy)
  • Parse parameter types (ReductoParseJobParams, IrisParseJobParams, etc.)
  • Enum types (ParseEngine, ReductoChunkingMethod, VectorStoreEngines)
Response Types - Types returned by the SDK, accessible via the .data attribute on wrapper objects:
  • When you call SDK methods, you get wrapper objects (DexProject, DexFile, DexParseResult, etc.)
  • Access the underlying data via .data: project.data.id, file.data.filename
  • These entities are automatically validated but don’t need to be imported

Configuration Types

ProjectConfiguration

Configuration options for a Dex project. Import: from dex_sdk.types import ProjectConfiguration Fields:
  • retention (RetentionPolicy | None): Data retention policy for the project
Example:
from datetime import timedelta
from dex_sdk.types import ProjectConfiguration, RetentionPolicy

config = ProjectConfiguration(
    retention=RetentionPolicy(
        files=timedelta(days=30),
        result_artifacts=timedelta(days=7),
    )
)

RetentionPolicy

Defines data retention periods for automatic cleanup of files and processing artifacts. Import: from dex_sdk.types import RetentionPolicy Fields:
  • files (timedelta | None): Retention period for uploaded files. Files older than this period are automatically deleted. If None, files are retained indefinitely.
  • result_artifacts (timedelta | None): Retention period for parse results, extraction results, and job artifacts. If None, artifacts are retained indefinitely.
Example:
from datetime import timedelta
from dex_sdk.types import RetentionPolicy

# 30-day file retention, 7-day artifact retention
policy = RetentionPolicy(
    files=timedelta(days=30),
    result_artifacts=timedelta(days=7),
)

# Keep files indefinitely, but clean up artifacts after 14 days
policy = RetentionPolicy(
    files=None,
    result_artifacts=timedelta(days=14),
)
Use Cases:
  • Compliance: Meet regulatory requirements (GDPR, HIPAA, etc.)
  • Cost Management: Automatically clean up old data to reduce storage costs
  • Security: Limit exposure of sensitive documents by enforcing retention limits
Note: The retention period is calculated from the creation time of the file or artifact. Retention policies can be updated at any time using update_project().

ExtractionParameters

Parameters for extraction operations. Import: from dex_sdk.types import ExtractionParameters Fields:
  • model (str): LLM model to use (e.g., "openai/gpt-4o")
  • model_kwargs (dict | None): Additional kwargs for the LLM model
  • extraction_schema (dict): JSON schema defining the desired output structure
  • system_prompt (str | None): High-level instructions for the extraction model
  • user_prompt (str | None): Specific hints about the current document
  • generate_citations (bool): Whether to return bounding boxes for extracted values (default: True)
  • generate_confidence (bool): Whether to return confidence scores (default: True)

Parse Configuration Types

ParseEngine

Enum of available OCR engines. Import: from dex_sdk.types import ParseEngine Values:
  • REDUCTO = “reducto”
  • IRIS = “iris”
  • CUSTOM = “custom”

ReductoParseJobParams

Parameters for the Reducto OCR engine. Import: from dex_sdk.types import ReductoParseJobParams See the Parse Job Parameters section for detailed usage.

IrisParseJobParams

Parameters for the Iris OCR engine. Import: from dex_sdk.types import IrisParseJobParams See the Parse Job Parameters section for detailed usage.

ReductoChunkingMethod

Enum of chunking methods for Reducto parser. Import: from dex_sdk.types import ReductoChunkingMethod Values:
  • DISABLED = “disabled”
  • BLOCK = “block”
  • PAGE = “page”
  • PAGE_SECTIONS = “page_sections”
  • SECTION = “section”
  • VARIABLE = “variable”

ReductoChunkingOptions

Chunking configuration for Reducto parser. Import: from dex_sdk.types import ReductoChunkingOptions Fields:
  • chunk_mode (ReductoChunkingMethod): Chunking method
  • chunk_size (int | None): Custom chunk size

ReductoParseEngineOptions

Options for Reducto parser. Import: from dex_sdk.types import ReductoParseEngineOptions Fields:
  • chunking (ReductoChunkingOptions | None): Chunking configuration

IrisParseEngineOptions

Options for Iris parser. Import: from dex_sdk.types import IrisParseEngineOptions Fields:
  • layout (str | None): Layout detection model
  • text_ocr (str | None): Text OCR model
  • table_ocr (str | None): Table OCR model
  • text_prompt (str | None): Custom prompt for text extraction
  • table_prompt (str | None): Custom prompt for table extraction
  • left_to_right (bool | None): Sort regions left-to-right
  • confidence_threshold (float | None): Minimum confidence threshold
  • containment_threshold (float | None): Containment threshold for filtering

Vector Store Types

VectorStoreEngines

Enum of available vector store engines. Import: from dex_sdk.types import VectorStoreEngines Values:
  • SGP_KNOWLEDGE_BASE = “sgp_knowledge_base”

VectorStoreSearchResult

Result from vector store search operations. Import: from dex_sdk.types import VectorStoreSearchResult

Response Entity Types

These types are returned by SDK methods and accessed via the .data attribute on wrapper objects. You typically don’t need to import these directly.

Working with Response Data

When you call SDK methods, you receive wrapper objects with a .data attribute:
# Create a project
project = await dex_client.create_project(name="My Project")
print(f"Project: {project.data.name} ({project.data.id}) created at {project.data.created_at}")

# Upload a file
dex_file = await project.upload_file("document.pdf")
print(f"File: {dex_file.data.filename} ({dex_file.data.size_bytes} bytes) → {dex_file.data.id}")

# Parse a file
parse_result = await dex_file.parse(parse_params)
metadata = parse_result.data.parse_metadata
print(f"Parsed: {metadata.pages_processed} pages with {parse_result.data.engine}{parse_result.data.id}")

Common Response Entity Fields

ProjectEntity (accessed via project.data):
  • id (str): Project ID with proj_ prefix
  • name (str): Project readable name
  • status (str): Project status ("active" or "archived")
  • configuration (ProjectConfiguration | None): Project configuration
  • created_at (datetime): When the project was created
  • archived_at (datetime | None): When the project was archived
FileEntity (accessed via dex_file.data):
  • id (str): File ID with file_ prefix
  • project_id (str): Project ID that the file belongs to
  • filename (str): Original filename
  • size_bytes (int): File size in bytes
  • mime_type (str): MIME type of the file
  • status (str): Current file status
  • created_at (datetime): When the file was uploaded
ParseResultEntity (accessed via parse_result.data):
  • id (str): Parse result ID with pres_ prefix
  • project_id (str): Project ID
  • source_document_id (str): Source document ID that was parsed
  • engine (str): Engine used for parsing
  • parse_metadata (object): Metadata including filename, pages_processed
  • content (object): Parsed content with chunks
  • created_at (datetime): When the parse result was created
ExtractionEntity (accessed via extract_result or in extraction results):
  • id (str): Extraction result ID
  • source_id (str): Source ID that was extracted from
  • result (object): The extraction result with data and usage_info
  • parameters (ExtractionParameters): Parameters used for extraction
  • created_at (datetime): When the extraction was completed
  • processing_time_ms (int | None): Processing time in milliseconds
VectorStoreEntity (accessed via vector_store.data):
  • id (str): Vector store ID with vs_ prefix
  • project_id (str): Project ID
  • name (str): Name of the vector store
  • engine (str): Engine used for vector store
  • created_at (datetime): When the vector store was created

Deprecated Types

The following types are deprecated as of version 0.3.2 and should no longer be used:
  • ProjectCredentials - No longer used; credentials are passed to DexClient constructor
  • SGPCredentials - No longer used; credentials are passed to DexClient constructor
See the Changelog for migration instructions.

Error Handling

The SDK raises exceptions for various error conditions. For detailed troubleshooting guidance, see the Troubleshooting Guide.
from dex_sdk.exceptions import DexException

try:
    parse_result = await dex_file.parse(...)
except DexException as e:
    print(f"Error: {e}")

Async/Await Pattern

The Dex SDK is fully async. Use await with all SDK methods:
import asyncio
import os

async def main():
    # Initialize client with credentials
    dex_client = DexClient(
        base_url="https://dex.sgp.scale.com",
        api_key=os.getenv("SGP_API_KEY"),
        account_id=os.getenv("SGP_ACCOUNT_ID"),
    )

    project = await dex_client.create_project(name="My Project")
    dex_file = await project.upload_file("document.pdf")
    parse_result = await dex_file.parse(...)
    extract_result = await parse_result.extract(...)

# Run in Jupyter/IPython
await main()

# Run in regular Python script
asyncio.run(main())

See Also