Skip to main content

Overview

This workflow enables you to automatically extract text content from documents (PDFs, images, etc.) stored in Snowflake stages using Snowflake’s AI_PARSE_DOCUMENT capability with OCR mode and Elementum’s Automation System. The Snowflake AI OCR workflow consists of nine main steps:
  1. Create a Snowflake stage for document files
  2. Create an AI OCR stored procedure in Snowflake
  3. Create a Snowflake view for stage files
  4. Import the stored procedure into Elementum via CloudLink
  5. Import the view as an Elementum table
  6. Build a Data Mine to monitor for new or changed documents
  7. Create an automation triggered by the Data Mine
  8. Process documents using the Run Function action to call your OCR procedure
  9. Add additional actions to work with the extracted text
This workflow leverages Snowflake AI capabilities to extract text from documents without moving your files outside of your data environment. The OCR processing is orchestrated through Elementum within your Snowflake environment, keeping your data secure and centralized.

Prerequisites

Before starting this workflow, ensure you have:
  • Snowflake access with permissions to create stages, views, and stored procedures
  • Elementum CloudLink configured and connected to your Snowflake instance
  • Documents uploaded to a Snowflake stage (e.g., PDFs, images)
  • Directory Table enabled on your Snowflake stage for file listing and metadata access
  • Snowflake AI features enabled in your account for AI_PARSE_DOCUMENT functionality
  • Understanding of Elementum Tables, Data Mining, and Automation System

Step 1: Create Snowflake Stage

First, create a Snowflake stage for your documents with encryption enabled and directory table enabled. Execute this SQL in your Snowflake environment:
USE DATABASE YOUR_DATABASE;
USE SCHEMA YOUR_SCHEMA;

-- Create internal stage with directory table and encryption enabled
CREATE OR REPLACE STAGE DOCUMENT_STAGE 
  DIRECTORY = (ENABLE = TRUE) 
  ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');

Step 2: Create AI OCR Stored Procedure

Create a stored procedure that uses Snowflake’s AI_PARSE_DOCUMENT function to extract text from documents.
CREATE OR REPLACE PROCEDURE AI_OCR_FROM_STAGE_SP(FILE_PATH STRING)
RETURNS STRING
LANGUAGE JAVASCRIPT
EXECUTE AS OWNER
AS
$$
  var sql = `
    SELECT TO_VARCHAR(
      AI_PARSE_DOCUMENT(
        TO_FILE('@YOUR_DATABASE.YOUR_SCHEMA.DOCUMENT_STAGE', ?),
        OBJECT_CONSTRUCT('mode', 'OCR')
      )
    ) AS response
  `;
  var stmt = snowflake.createStatement({
    sqlText: sql,
    binds: [FILE_PATH]
  });
  var rs = stmt.execute();
  if (rs.next()) {
    return rs.getColumnValue(1); // response
  } else {
    return null;
  }
$$;
  • FILE_PATH: Takes the relative path of the file within the stage
  • TO_FILE(): References the file in the Snowflake stage
  • AI_PARSE_DOCUMENT(): Snowflake’s AI function that processes the document
  • mode: 'OCR': Specifies OCR mode for text extraction
  • Returns: JSON string with extracted content and metadata
The response structure looks like this:
{
  "content": "Extracted text content from the document...",
  "metadata": {
    "pageCount": 1
  }
}

Step 3: Create Snowflake View from Stage

Create a Snowflake view that provides access to your stage files with metadata. Execute this SQL in your Snowflake environment:
CREATE OR REPLACE VIEW DOCUMENT_STAGE_VIEW AS 
SELECT RELATIVE_PATH,
       SIZE,
       LAST_MODIFIED,
       MD5
FROM DIRECTORY(@DOCUMENT_STAGE);
  • RELATIVE_PATH: File path within the stage (used to identify files for OCR processing)
  • SIZE: File size in bytes
  • LAST_MODIFIED: Timestamp of last file modification
  • MD5: File hash for integrity checking
Before building your automation, import the stored procedure into Elementum through CloudLink to make it available for use.
  1. Navigate to your CloudLink connection settings
  2. Click on “Functions”
  3. Select the database and schema where your stored procedure is located
  4. Find your AI_OCR_FROM_STAGE_SP stored procedure in the list
  5. Optionally rename it for easier identification in automations
  6. Click “Save” to make it available for use in automations
Once saved, the stored procedure will appear in the Run Function action dropdown when building automations.

Step 5: Import View as Elementum Table

Once your Snowflake view is created, import it into Elementum as a table.
  1. Navigate to TablesExplore DataCloudLink
  2. Select your Snowflake connection and choose the view you created
  3. Click “Create Table” and fill out the details

Step 6: Build Data Mine for Document Monitoring

Create a Data Mine to automatically detect when new documents arrive or existing documents change.
  1. In your table, go to Data MiningCreate Data MineLogic-Based Rules Mining
  2. Identifying Columns: Select RELATIVE_PATH, LAST_MODIFIED, and MD5
These columns work together to track individual files across Data Mine runs, detect when files are modified or replaced, and ensure accurate state management (ON/OFF transitions).
  1. Matching Criteria: Set filters for file types or conditions (optional - e.g., only .pdf files)
  2. Name and Schedule: Give it a name and set check frequency

Step 7: Create Automation with Data Mine Trigger

Build an automation that processes documents when the Data Mine detects them.
Your automation will follow this logical flow: Data Mine TriggerRun OCR FunctionProcess Extracted Text (e.g., store content in a record, trigger AI analysis)
  1. Navigate to AutomationsCreate Automation
  2. Add Data Mine Trigger and select your Data Mine
  3. Set trigger option to “Trigger when data meets requirement”

Step 8: Process Documents Using Run Function Action

Add a Run Function action to your automation to OCR documents using the stored procedure.

Run Function action details:
  • Function: Select your AI_OCR_FROM_STAGE_SP stored procedure from CloudLink
  • Parameters:
    • FILE_PATH: $RELATIVE_PATH (from the Data Mine trigger)
Variable Reference: The $RELATIVE_PATH variable comes from the Data Mine trigger, providing access to all fields from the matching stage file record.
The Run Function action will return a JSON response containing:
  • content: The extracted text content from the document
  • metadata.pageCount: Number of pages processed

Step 9: Work with the OCR Results

After the Run Function action completes, subsequent actions in your automation will have access to the OCR results.
Add an Update Record or Create Record action to store the extracted text in an Elementum record for future reference and searchability.
Add an AI Action to analyze, summarize, or categorize the extracted text content using your configured AI provider.
Add Conditional Logic to route documents based on extracted content (e.g., if certain keywords are detected, assign to specific team members).

Summary

This workflow provides a powerful way to automatically extract text from documents stored in Snowflake stages:
  1. Snowflake Stage stores your document files with encryption and directory tracking
  2. AI OCR Stored Procedure leverages Snowflake’s AI_PARSE_DOCUMENT for text extraction
  3. Snowflake View makes stage files accessible with metadata
  4. CloudLink Functions imports the stored procedure for use in automations
  5. Elementum Table brings stage file information into your workspace
  6. Data Mine automatically detects new or changed documents
  7. Automation orchestrates the OCR processing workflow
  8. Run Function Action executes the OCR procedure on each document
  9. Additional Actions enable text analysis, storage, and intelligent workflow automation
By following this guide, you can create a robust, automated document processing system that transforms your Snowflake stage into an intelligent OCR pipeline, enabling your business to automatically extract and process text from documents as they arrive.

Appendix: Complete Quick Setup

Use the following SQL to create a complete OCR processing setup in Snowflake. Replace the ALL_CAPS placeholders with your actual values.

Complete Setup Script

USE DATABASE DATABASE_NAME;
USE SCHEMA SCHEMA_NAME;

-- Create internal stage with directory table and encryption enabled
CREATE OR REPLACE STAGE DOCUMENT_STAGE 
  DIRECTORY = (ENABLE = TRUE) 
  ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');

-- Create AI OCR stored procedure
CREATE OR REPLACE PROCEDURE AI_OCR_FROM_STAGE_SP(FILE_PATH STRING)
RETURNS STRING
LANGUAGE JAVASCRIPT
EXECUTE AS OWNER
AS
$$
  var sql = `
    SELECT TO_VARCHAR(
      AI_PARSE_DOCUMENT(
        TO_FILE('@DATABASE_NAME.SCHEMA_NAME.DOCUMENT_STAGE', ?),
        OBJECT_CONSTRUCT('mode', 'OCR')
      )
    ) AS response
  `;
  var stmt = snowflake.createStatement({
    sqlText: sql,
    binds: [FILE_PATH]
  });
  var rs = stmt.execute();
  if (rs.next()) {
    return rs.getColumnValue(1);
  } else {
    return null;
  }
$$;

-- Create view for stage files
CREATE OR REPLACE VIEW DOCUMENT_STAGE_VIEW AS 
SELECT RELATIVE_PATH,
       SIZE,
       LAST_MODIFIED,
       MD5
FROM DIRECTORY(@DOCUMENT_STAGE);
Upload a test document to verify the stage and OCR processing are working correctly:
-- Using SnowSQL CLI
PUT file://path/to/test-document.pdf @DATABASE_NAME.SCHEMA_NAME.DOCUMENT_STAGE 
    OVERWRITE=TRUE 
    AUTO_COMPRESS=FALSE;
You can also upload files through the Snowflake web interface by navigating to your stage and using the “Upload Files” option.
Test your stored procedure directly in Snowflake:
CALL AI_OCR_FROM_STAGE_SP('test-document.pdf');
You should receive a JSON response with the extracted text content and metadata.

Additional Resources