Skip to main content
This guide will help you get started with Scale’s Dex service - a document understanding capability that extracts accurate, structured information from unstructured documents.

Overview

Dex is Scale’s document understanding capability that provides composable primitives for:
  • File Management - Secure file upload, storage, and retrieval with fine-grained access control
  • Document Parsing - Convert any document (PDFs, DOCX, images, etc.) into structured JSON format with multiple OCR engines
  • Vector Stores - Index and search parsed documents with semantic embeddings
  • Data Extraction - Extract specific information using custom schemas, prompts, and RAG-enhanced context
  • Project Management - Organize and isolate data with proper credential management and authorization

Prerequisites

Before using Dex, ensure you have:
  • ✅ A valid Scale account with SGP (Scale General Platform) access
  • ✅ Your SGP account ID and API key set as environment variables:
export SGP_ACCOUNT_ID="your_account_id"
export SGP_API_KEY="your_api_key"
  • ✅ Python 3.8+ installed
  • ✅ Dex SDK installed (see Installation section)

Installation

Install Dex SDK from CodeArtifact

With access to Scale CodeArtifact, install the Dex SDK (version 0.3.2 or higher):
pip install sdk/dex_core-xxx.whl sdk/dex_sdk-xxx.whl
This will install all required dependencies including Pydantic.
Note: Version 0.3.2 introduces a new authentication method. Ensure you update to this version or higher. See the Changelog for details.

Quick Start

1. Initialize Dex Client

import os
from pydantic import BaseModel, Field
from dex_sdk import DexClient
from dex_sdk.types import (
    ParseEngine,
    ReductoParseJobParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
)

# Initialize the Dex client with SGP credentials
dex_client = DexClient(
    base_url="https://dex.sgp.scale.com",
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

2. Create a Project

Projects isolate your data and credentials for tracing, billing, and SGP model calls. Every operation is tied to a project.
# Create a project
project = await dex_client.create_project(name="My Dex Project")
print(f"Created project: {project.data.id}")
Tip: Keep one project per use case or group of related files for clean traceability.

3. Upload a Document

Upload your document to the project. Dex supports PDFs, images, spreadsheets, and more.
# Upload a file to the project
dex_file = await project.upload_file("path/to/your/document.pdf")
print(f"Uploaded: {dex_file.data.filename}{dex_file.data.id}")
Supported file types: PDF, PNG, JPEG, DOCX, XLSX, CSV, and many more. See Advanced Features for the complete list.

4. Parse the Document

Parse converts your document into a structured format with text, tables, and layout information.
# Parse the document with Reducto OCR
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    )
)
print(f"Parsed: {parse_result.data.parse_metadata.pages_processed} pages")
Note: Parsing is asynchronous. The SDK automatically polls for completion.

5. Extract Structured Data

Define a schema and extract specific information from your document.
# Define your extraction schema using Pydantic
class InvoiceData(BaseModel):
    """Schema for invoice extraction"""
    invoice_number: str = Field(description="The invoice number")
    total_amount: float = Field(description="Total amount in dollars")
    date: str = Field(description="Invoice date in YYYY-MM-DD format")
    vendor_name: str = Field(description="Name of the vendor")

# Extract data with a clear prompt
extract_result = await parse_result.extract(
    extraction_schema=InvoiceData,
    user_prompt="Extract invoice details from this document.",
    model="openai/gpt-4o",
    generate_citations=True,
    generate_confidence=True,
)

# Access the extracted data
result = extract_result.result
for field_name, field in result.data.items():
    print(f"{field_name}: {field.value} (confidence: {field.confidence:.2f})")

Complete Example

Here’s a complete working example you can copy and run:
import os
import asyncio
from pydantic import BaseModel, Field
from dex_sdk import DexClient
from dex_sdk.types import (
    ParseEngine,
    ReductoParseJobParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
)

class InvoiceData(BaseModel):
    invoice_number: str = Field(description="The invoice number")
    total_amount: float = Field(description="Total amount in dollars")

async def main():
    # 1. Initialize client
    dex_client = DexClient(
        base_url="https://dex.sgp.scale.com",
        api_key=os.getenv("SGP_API_KEY"),
        account_id=os.getenv("SGP_ACCOUNT_ID"),
    )

    # 2. Create project
    project = await dex_client.create_project(name="Invoice Processing")

    # 3. Upload document
    dex_file = await project.upload_file("invoice.pdf")

    # 4. Parse document
    parse_result = await dex_file.parse(
        ReductoParseJobParams(
            engine=ParseEngine.REDUCTO,
            options=ReductoParseEngineOptions(
                chunking=ReductoChunkingOptions(
                    chunk_mode=ReductoChunkingMethod.VARIABLE,
                )
            ),
        )
    )

    # 5. Extract data
    extract_result = await parse_result.extract(
        extraction_schema=InvoiceData,
        user_prompt="Extract invoice number and total amount.",
        model="openai/gpt-4o",
        generate_citations=True,
    )

    print(extract_result.result.data)

# Run the example
asyncio.run(main())

Next Steps

Now that you’ve completed the basics, explore these topics:

Learn Advanced Features

Deep Dive into the API

Additional Resources


Common Questions

Q: How do I process multiple documents? A: Upload multiple files to the same project, parse each one, then optionally use vector stores for cross-document search. See Advanced Features. Q: Can I use a synchronous client? A: Yes! Use DexSyncClient from dex_sdk for synchronous operations. See Advanced Features. Q: How do I configure data retention policies? A: Set retention policies when creating a project. See Advanced Features. Q: What OCR engines are available? A: Reducto (for English and Latin scripts) and Iris (for non-English, non-Latin scripts like Arabic, Hebrew, CJK, Indic languages). See API Reference for details.