Data Ingestion

Overview

Flex Ingest leverages best-in-class ingestion engines like Meltano, dlt (Data Load Tool), SlingData, and more to deliver a highly configurable YAML-driven data ingestion experience. One configuration file can move data from any source to any destination. Flex Ingest also provides custom source and destination connectors for data ingestion and reverse ETL. We also provide custom adapters that leverage high-performance native connectors and data lake architecture that's up to 20x faster than standard adapters.

Prior to setting up a Data Ingestion task, you'll need to create a Data Transformation and Orchestration job.

YAML Config Reference

Minimal Config

pipeline_name: my_pipeline
dataset: my_dataset
source: sql_database
destination: snowflake

tables:
  - users
  - orders

Complete Config

pipeline_name: my_pipeline
dataset: MY_DATASET
source: salesforce
destination: snowflake

# Memory management - process tables in batches
batch_size: 2  # Tables per batch (frees memory between batches)

# Default settings for all tables
defaults:
  primary_key: Id
  incremental_column: SystemModstamp

# Source-specific tuning
tuning:
  bulk_threshold: 1000000     # Auto-bulk for tables > 1M rows
  parallelize: false          # Sequential extraction (safer for memory)
  max_text_length: 4000       # VARCHAR max for formula fields
  text_length_multiplier: 4   # UTF-8 byte safety (4x multiplier)
  preserve_case: true         # Keep camelCase field names
  include_deleted: true       # Capture soft deletes

# dlt options (maps to DLT__* env vars)
options:
  extract:
    workers: 1
  normalize:
    workers: 2
    naming: sql_cs_v1         # See naming conventions below
  load:
    workers: 2

tables:
  - Account
  - Contact
  - name: et4ae5__IndividualEmailResult__c
    bulk: true  # Force Bulk API for this table

Config Sections

`batch_size`

Process tables in groups to prevent out-of-memory (OOM) errors on large pipelines.

batch_size: 2  # Process 2 tables, load to destination, free memory, repeat

When to use:

Container has limited memory (< 8GB)
Loading many large tables
Seeing OOM errors

How it works:

Pipeline runs with first N tables
Data loaded to destination
Memory freed
Next batch starts

`defaults`

Settings applied to all tables (can be overridden per-table).

defaults:
  primary_key: Id
  incremental_column: SystemModstamp
  write_disposition: merge

When primary_key or incremental_column is set, write_disposition automatically defaults to merge.

`dataset`

Use {SOURCE_SCHEMA} in the dataset name to automatically route each source schema to its own destination schema:

dataset: "raw_{SOURCE_SCHEMA}"  # Creates raw_HR, raw_FINANCE, etc.

tables:
  HR:
    - employees
    - departments
  FINANCE:
    - transactions

The pipeline runs one sub-pipeline per schema group, substituting the actual schema name into the dataset. Requires tables in grouped format.

`validate_rowcounts`

After loading, compare source row counts to destination row counts to catch silent data loss.

validate_rowcounts: true         # Log a warning if counts differ
validate_rowcounts: 5.0          # Warn only if difference exceeds 5%
validate_rowcounts: "strict"     # Raise an exception on any mismatch

Can also be set per-table to target specific tables:

tables:
  - users
  - name: critical_table
    validate_rowcounts: "strict"

`tuning`

Source-specific extraction settings.

Oracle Tuning

Key	Description	Default	When to Change
`parallelize`	Extract tables in parallel	`true`	Set `false` if hitting connection limits
`arraysize`	Rows per Oracle fetch batch	`100000`	Lower if OOM, higher if network is slow
`lob_arraysize`	Rows per fetch batch for tables with LOB columns (CLOB/BLOB)	`1000`	Increase carefully — LOB columns require extra round-trips per value
`encoding_errors`	How to handle unencodable characters in string columns	`"replace"`	`"ignore"` silently drops bad chars; `"strict"` raises an exception

Salesforce Tuning

Key	Description	Default	When to Change
`bulk`	API strategy: `false` (REST always), `true` (Bulk always), `"full_only"` (Bulk for full loads, REST for incrementals)	`false`	Use `"full_only"` for large tables with small daily changes
`bulk_threshold`	Auto-use Bulk API when total table rows > N. Counts all rows, not just the incremental subset.	`None`	Set to `1000000`; prefer `bulk: "full_only"` for incremental workloads
`bulk_yield_size`	Rows per yield batch during Bulk extraction	`100000`	Lower to `10000` for very wide tables (100+ fields) or low-memory servers
`chunk_days`	Split bulk queries by date range	`None`	Set to `30`–`365` for very large tables (10M+)
`parallelize`	Extract tables in parallel	`true`	Set `false` to reduce memory
`max_text_length`	VARCHAR max for formula fields	`None`	Set to `4000` for Informatica compatibility
`text_length_multiplier`	Multiply VARCHAR lengths	`1`	Set to `4` for UTF-8 byte safety
`preserve_case`	Keep original field names	`false`	Set `true` with `naming: sql_cs_v1`
`include_deleted`	Include soft-deleted records	`true`	Set `false` to exclude deleted rows

`options`

Maps directly to dlt configuration. Any nested key becomes DLT__<SECTION>__<KEY>.

options:
  extract:
    workers: 1              # DLT__EXTRACT__WORKERS=1
    max_parallel_items: 5   # DLT__EXTRACT__MAX_PARALLEL_ITEMS=5
  normalize:
    workers: 2              # DLT__NORMALIZE__WORKERS=2
    naming: sql_cs_v1       # DLT__NORMALIZE__NAMING=sql_cs_v1
  load:
    workers: 2              # DLT__LOAD__WORKERS=2
    loader_file_format: jsonl
  data_writer:
    buffer_max_items: 5000  # In-memory rows before flushing to disk
    file_max_items: 100000  # Rows per intermediary file
    file_max_bytes: 10000000
  runtime:
    log_level: CRITICAL     # Silent — only pipeline-level output

Key options reference:

Key	Default	Description
`extract.workers`	5	Threads for parallel extraction
`extract.max_parallel_items`	20	Max queued async items during extraction
`normalize.workers`	1	Processes for normalization (>1 uses multiprocessing)
`normalize.naming`	`snake_case`	Identifier naming convention
`load.workers`	20	Threads for parallel loading
`load.loader_file_format`	auto	File format for load jobs (`jsonl`, `parquet`)
`data_writer.buffer_max_items`	5000	In-memory rows before flush to disk
`runtime.log_level`	`WARNING`	dlt log verbosity

Log levels:

Level	Behavior
`DEBUG`	Full dlt internals — schema diffs, file writes, normalization details
`INFO`	Standard progress — extract counts, load status, step durations
`WARNING`	Warnings only — data issues, retries, non-fatal errors
`ERROR`	Errors only
`CRITICAL`	Silent — only pipeline-level output (table counts, batch timing)

Naming conventions (normalize.naming):

Built-in dlt conventions:

Convention	Quoted	Normalization	Use When
`snake_case`	No	Lowercase snake_case	Default — most destinations
`sql_ci_v1`	No	Lowercase, SQL-safe	Alternative to `snake_case`; strips special chars like `$`
`sql_cs_v1`	Yes	As-is	Preserve Salesforce camelCase field names
`direct`	Yes	Pass-through	Preserve special characters (e.g., `$`) — but quotes identifiers

FlexFlow custom conventions:

Convention	Quoted	Normalization	Snowflake Result
`flexflow_naming.unquoted`	No	As-is	UPPERCASE (DB default fold), special chars preserved
`flexflow_naming.uppercase`	Yes	Uppercase	`"UPPERCASE"` (exact, case-sensitive)
`flexflow_naming.lowercase`	Yes	Lowercase	`"lowercase"` (exact, case-sensitive)

flexflow_naming.unquoted is recommended for Oracle → Snowflake pipelines: identifiers arrive unquoted so Snowflake folds them to uppercase by default, and special characters like $ in table names are preserved as-is.

Note: Switching naming conventions on a pipeline that has already loaded data requires clearing the local pipeline state directory first.

Recommended workers by container memory:

Memory	extract.workers	normalize.workers	load.workers
4GB	1	1	1
8GB	1	2	2
16GB+	2	4	4

`pipelines_dir`

Override the default dlt pipeline state directory (~/.dlt/pipelines/). Useful for container deployments or non-standard filesystem layouts.

pipelines_dir: /data/dlt-state

Also configurable via environment variable (takes precedence over the config value):

export DLT_PIPELINES_DIR=/data/dlt-state

Tables Configuration

Flat List

tables:
  - users
  - orders
  - products

Grouped by Schema

For sources with tables in multiple schemas:

tables:
  SCHEMA_A:
    - users
    - orders
  SCHEMA_B:
    - transactions

Wildcards

Use * (matches any sequence) and ? (matches a single character) in schema or table names:

# Flat list — dot notation: SCHEMA.TABLE_PATTERN
tables:
  - "SATURN.*"          # All tables in SATURN
  - "HR.EMP*"           # Tables starting with EMP in HR
  - "*.audit_log"       # audit_log table in any schema

# Grouped format
tables:
  SATURN:
    - "STV*"            # All tables starting with STV
    - "*_log"           # All tables ending with _log

Expansion is logged at startup: SATURN.STV* -> 12 table(s). Any per-table settings on a wildcard entry (write_disposition, bulk, chunk_days, etc.) are copied to every expanded table. Non-matching patterns are skipped with a warning.

Per-Table Config

tables:
  - Account                              # Simple replace
  - name: Contact
    primary_key: Id
    incremental_column: SystemModstamp   # Auto-merge
  - name: LargeTable__c
    bulk: true                           # Force Bulk API
    chunk_days: 30                       # Date chunking for 10M+ rows

Renaming Tables in the Destination

Use table_name to write to a different table name in the destination:

tables:
  - name: Account          # Salesforce object name (source)
    table_name: account_base  # Table name written to Snowflake (destination)

name controls which object is queried from the source. table_name controls where it lands in the destination. All other per-table settings (primary_key, incremental_column, etc.) work as normal.

Column Hints

Override column types, precision, or constraints using dlt's native columns syntax:

tables:
  - name: ps_assignment_type
    columns:
      descrshort:
        data_type: text
        precision: 200        # VARCHAR(200)
      amount:
        data_type: decimal
        precision: 10
        scale: 2              # DECIMAL(10,2)
      event_time:
        data_type: timestamp
        precision: 3          # Milliseconds
        timezone: false       # TIMESTAMP_NTZ
      is_active:
        nullable: false       # NOT NULL constraint

Supported hints:

Hint	Description	Example
`data_type`	Column type: `text`, `bigint`, `double`, `decimal`, `bool`, `date`, `timestamp`	`text`
`precision`	Length for text, total digits for decimal, fractional seconds for timestamp	`200`
`scale`	Decimal places (for `decimal` type)	`2`
`timezone`	Timezone awareness for timestamps	`false`
`nullable`	Allow NULL values	`false`

Write Dispositions

Mode	Behavior	Use When
`replace`	Truncate and reload	Full refresh, small tables
`append`	Insert new rows	Event logs, immutable data
`merge`	Upsert by primary key	Incremental sync, mutable data

Auto-detection: When primary_key or incremental_column is set, write_disposition defaults to merge.

Native Oracle Extraction

Features

Auto-detected — Uses native path when drivername contains "oracle"
Per-table schema — Grouped YAML format supports different schemas
Retry logic — Retries on connection failures (not mid-stream to avoid duplicates)
Graceful skip — Missing tables (ORA-00942) logged and skipped
Arrow-native — Direct to columnar format, merge-compatible

Configuration

source: sql_database
source_credentials: connections.oracle_prod

tuning:
  arraysize: 100000    # Rows per fetch (tune for memory vs speed)
  parallelize: true    # Parallel table extraction

tables:
  HR:
    - employees
    - departments
  FINANCE:
    - transactions

Native Salesforce Extraction

Features

Any Salesforce object — Standard and custom objects (e.g., Academic_Advising__c)
Bulk API 2.0 — 10–40x faster for large tables
Auto-bulk threshold — Automatically switch API based on row count
Arrow-native — Direct to columnar, merge-compatible
Date chunking — Split huge tables into date ranges to reduce memory
Timing diagnostics — See exactly where time is spent (API wait vs processing)

API Selection Guide

Table Size	Recommended	Config
< 100K rows	REST	(default)
100K – 1M rows	Either	`bulk_threshold: 100000`
> 1M rows	Bulk API	`bulk: true` or `bulk_threshold: 1000000`
> 10M rows	Bulk + chunking	`bulk: true` + `chunk_days: 30`

Basic Usage

pipeline_name: salesforce
dataset: SALESFORCE_DATA
source: salesforce
destination: snowflake

tables:
  - Account
  - Contact
  - Custom_Object__c    # Custom objects work!

Incremental Sync (Recommended for Production)

pipeline_name: salesforce
dataset: SALESFORCE_DATA
source: salesforce
destination: snowflake

defaults:
  primary_key: Id
  incremental_column: SystemModstamp

tuning:
  bulk_threshold: 1000000   # Auto-bulk for large tables
  include_deleted: true     # Track soft deletes

tables:
  - Account
  - Contact
  - Opportunity
  - name: et4ae5__IndividualEmailResult__c
    bulk: true              # Force bulk for known large table

How incremental works:

First run: Full extract of all records
Subsequent runs: Only fetches WHERE SystemModstamp > last_loaded_value
Merge: Upserts by Id, updating existing and inserting new records

Key fields:

primary_key: Id — Salesforce record ID (always use this)
incremental_column: SystemModstamp — Updated on any change (preferred over LastModifiedDate)
include_deleted: true — Captures soft deletes so the IsDeleted flag is synced

Bulk API with Date Chunking

For very large tables (10M+ rows), split into date ranges to reduce memory:

tuning:
  bulk: true
  chunk_days: 30    # Query 30 days at a time

tables:
  - name: ActivityHistory__c
    chunk_field: CreatedDate    # Optional: override chunk field

Timing Diagnostics

The Bulk API path logs detailed timing:

Contact: 117 chunks, wait_sf=1341.6s, parse=39.9s, yield=62.2s

wait_sf — Time waiting for Salesforce to deliver data (network/API bottleneck)
parse — Time parsing CSV to Arrow
yield — Time for dlt to process batches

If wait_sf dominates, the bottleneck is Salesforce. If yield dominates, increase batch_size or reduce normalize.workers.

Case-Preserving DDL (Informatica Compatibility)

tuning:
  max_text_length: 4000
  text_length_multiplier: 4
  preserve_case: true

options:
  normalize:
    naming: sql_cs_v1

Produces:

CREATE TABLE "Account" (
    "Id" VARCHAR(72),
    "Name" VARCHAR(1020),
    "_dlt_load_id" VARCHAR,
    "_dlt_id" VARCHAR
);

Filesystem Source (SFTP / S3 / Local)

Pull files (CSV, JSONL, Parquet) from any fsspec-compatible location and load them into a warehouse. Built on dlt's filesystem source — Flex Ingest adds wildcard expansion at the table level, format auto-detection, and config-driven incremental modes.

Quick Start

pipeline_name: my_pipeline
source: filesystem
destination: snowflake
dataset: my_dataset

tables:
  - "exports/*.csv"           # one table per matching file
  - name: orders
    file_glob: "orders/*.csv"   # many files merge into one 'orders' table
    primary_key: order_id
    incremental_column: last_modified
    incremental_files: true

SFTP requires paramiko: pip install paramiko (or pip install "dlt[sftp]"). S3 needs s3fs, GCS gcsfs, Azure adlfs.

SFTP — `extra_kex_algorithms`

Some vendor SFTP servers only speak key-exchange algorithms paramiko 3+ removed from defaults (e.g. diffie-hellman-group1-sha1). Opt in per-pipeline via the top-level sftp: YAML section — nothing is auto-enabled:

source: filesystem

sftp:
  extra_kex_algorithms:
    - diffie-hellman-group1-sha1
    - diffie-hellman-group-exchange-sha1

Tables Configuration

Two patterns, picked per-entry:

1:1 wildcard expansion — each matching file becomes its own destination table named after the file stem.

tables:
  - "*.csv"             # all CSVs in bucket root
  - "exports/*.csv"     # all CSVs in exports/ subdir
  - "orders.csv"        # single file

Many-files-into-one-table — set both name and file_glob. Use this when files are time-partitioned but feed one logical table.

tables:
  - name: orders
    file_glob: "orders/orders_*.csv"
    primary_key: order_id
    incremental_column: last_modified

file_glob is a literal pattern — * and ? are the only wildcards. If a feed's filenames are date-stamped, either use a plain wildcard (*) with incremental_files below to pick up new files automatically, or resolve the date to a literal string yourself before writing the config.

Supported Formats

Auto-detected by extension:

Extension	Reader
`.csv`, `.tsv`, `.txt`	`read_csv`
`.jsonl`, `.ndjson`, `.json`	`read_jsonl`
`.parquet`	`read_parquet`
`.xlsx`, `.xls`	`read_xlsx` (requires `openpyxl`)

Override with format: per-table for files without a recognized extension, or to force a specific reader:

tables:
  - name: shipments
    file_glob: "shipments/*.dat"
    format: csv

Incremental Modes

Two independent flags — combine them for the best of both:

Setting	Effect	Default disposition
`incremental_column: <col>`	Row-level incremental on a column inside the data	`merge` (with `primary_key`)
`incremental_files: true`	File-level: only re-read files with `modification_date > cursor`	`append`
both	Skip unchanged files AND dedup rows in changed files	`merge`

File-level incremental is essentially free correctness — once a file is loaded, it won't be re-read until its modification date advances. Pair it with primary_key + incremental_column to handle re-uploaded files.

Append-only file drops (e.g. events_2024_01.jsonl, events_2024_02.jsonl):

tables:
  - name: events
    file_glob: "events_*.jsonl"
    incremental_files: true
    # write_disposition defaults to 'append' — each new file's rows accumulate

Daily-snapshot pattern (e.g. one report file per day, table should reflect only the latest file): pair incremental_files: true with write_disposition: replace instead of append — each run that finds a new file truncates and reloads with just that file's rows.

tables:
  - name: daily_report
    file_glob: "daily_report-*.csv"
    incremental_files: true
    write_disposition: replace

Notes

validate_rowcounts is not supported for filesystem — counting rows requires reading the file.

Memory Management

Symptoms of Memory Issues

Container killed (OOMKilled)
"Cannot allocate memory" errors
Pipeline hangs during normalize phase

Solutions

1. Reduce batch size:

batch_size: 1  # Process one table at a time

2. Reduce workers:

options:
  extract:
    workers: 1
  normalize:
    workers: 1
  load:
    workers: 1

3. Disable parallelization:

tuning:
  parallelize: false

4. Use date chunking for huge tables:

tuning:
  chunk_days: 30  # Frees memory between chunks

5. Lower Oracle fetch size:

tuning:
  arraysize: 10000  # Down from default 100000

Memory Budget Guide

Container Memory	Recommended Config
4GB	`batch_size: 1`, all workers: 1, `parallelize: false`
8GB	`batch_size: 2`, workers: 2, `parallelize: false`
16GB	`batch_size: 4`, workers: 4, `parallelize: true`

Troubleshooting

"No data loaded" / Empty tables

Check source credentials and permissions
Verify table/object names (case-sensitive for custom objects)
Check for IsDeleted = true filter if include_deleted: false

Slow Salesforce extraction

Check timing logs: if wait_sf is high, Salesforce API is the bottleneck
Reduce number of fields if possible
Use Bulk API for tables > 100K rows
Use date chunking for tables > 10M rows

Merge not working / No `_dlt_id` column

Ensure primary_key and incremental_column are set
The pipeline auto-enables add_dlt_id for Arrow tables
Check that write_disposition resolved to merge (logged at startup)

OOM / Memory errors

See Memory Management section.

Oracle connection timeouts

The pipeline sets reasonable defaults (tcp_connect_timeout: 30s, call_timeout: 600000ms). Override in tuning if needed:

tuning:
  tcp_connect_timeout: 60
  call_timeout: 1200000

Best Practices

For First-Time Users

Start small — Test with one small table first
Use incremental — Set primary_key and incremental_column for production
Monitor memory — Start with conservative settings, increase gradually
Check timing logs — Understand where time is spent before optimizing

For Production

Always use incremental for mutable data (not append-only logs)
Set include_deleted: true for accurate state sync
Use bulk_threshold instead of bulk: true for mixed table sizes
Set batch_size based on container memory
Use named connections for environment separation

Performance Optimization

Salesforce: Use Bulk API for anything > 100K rows
Oracle: Increase arraysize if memory allows
General: Increase workers if CPU is underutilized
Snowflake: Use a larger warehouse for faster COPY (scale up during load)

Data Ingestion#

Overview#

YAML Config Reference#

Minimal Config#

Complete Config#

Config Sections#

batch_size#

defaults#

dataset#

validate_rowcounts#

tuning#

Oracle Tuning#

Salesforce Tuning#

options#

pipelines_dir#

Tables Configuration#

Flat List#

Grouped by Schema#

Wildcards#

Per-Table Config#

Renaming Tables in the Destination#

Column Hints#

Write Dispositions#

Native Oracle Extraction#

Features#

Configuration#

Native Salesforce Extraction#

Features#

API Selection Guide#

Basic Usage#

Incremental Sync (Recommended for Production)#

Bulk API with Date Chunking#

Timing Diagnostics#

Case-Preserving DDL (Informatica Compatibility)#

Filesystem Source (SFTP / S3 / Local)#

Quick Start#

SFTP — extra_kex_algorithms#

Tables Configuration#

Supported Formats#

Incremental Modes#

Notes#

Memory Management#

Symptoms of Memory Issues#

Solutions#

Memory Budget Guide#

Troubleshooting#

"No data loaded" / Empty tables#

Slow Salesforce extraction#

Merge not working / No _dlt_id column#

OOM / Memory errors#

Oracle connection timeouts#

Best Practices#

For First-Time Users#

For Production#

Performance Optimization#

Data Ingestion

Overview

YAML Config Reference

Minimal Config

Complete Config

Config Sections

`batch_size`

`defaults`

`dataset`

`validate_rowcounts`

`tuning`

Oracle Tuning

Salesforce Tuning

`options`

`pipelines_dir`

Tables Configuration

Flat List

Grouped by Schema

Wildcards

Per-Table Config

Renaming Tables in the Destination

Column Hints

Write Dispositions

Native Oracle Extraction

Features

Configuration

Native Salesforce Extraction

Features

API Selection Guide

Basic Usage

Incremental Sync (Recommended for Production)

Bulk API with Date Chunking

Timing Diagnostics

Case-Preserving DDL (Informatica Compatibility)

Filesystem Source (SFTP / S3 / Local)

Quick Start

SFTP — `extra_kex_algorithms`

Tables Configuration

Supported Formats

Incremental Modes

Notes

Memory Management

Symptoms of Memory Issues

Solutions

Memory Budget Guide

Troubleshooting

"No data loaded" / Empty tables

Slow Salesforce extraction

Merge not working / No `_dlt_id` column

OOM / Memory errors

Oracle connection timeouts

Best Practices

For First-Time Users

For Production

Performance Optimization