Stream Ingestion

Data Migration Basics

Moving data between systems (e.g., SQL Server → Snowflake) is rarely seamless
Schema differences almost always exist
Best practice:
Test with sample data first
Validate schema compatibility before full migration
Bulk transfer is preferred over row-by-row transfer
Use object storage (S3, GCS) as an intermediate layer
Biggest challenge is often pipeline migration, not just data movement
Use automated tools whenever possible

Example Migrating a user table:

Source: VARCHAR(255)
Target: TEXT
Edge case: encoding differences → corrupted strings if not tested

Streaming Ingestion: Core Challenges

Schema Evolution

Event schemas change over time:
New fields added
Fields removed
Data types changed
Risks:
Downstream pipeline failures
Silent data corruption

Mitigation

Use schema registry with versioning
Maintain communication with upstream teams
Use dead-letter queues for failures

Example IoT event:

{ "temp": 22 }

After firmware update:

{ "temp": 22, "humidity": 60 }

Older consumers may break if not handled

Late-Arriving Data

Event time ≠ ingestion time
Causes:
Network delays
Device offline periods
Problem:
Incorrect analytics if using ingestion time

Solution

Define cutoff window (e.g., accept events within 10 minutes delay)

Example

Sensor sends reading at 10:00
Arrives at 10:05
Without correction → wrong time-series analysis

Ordering and Multiple Delivery

Distributed systems →
Messages may arrive out of order
Messages may be duplicated
Delivery semantics:
At-least-once (duplicates possible)

Example

Payment event processed twice → double charge if not idempotent

Replay

Ability to reprocess historical data
Supported by systems like Kafka, Kinesis

Use case

Bug fix → reprocess last 24 hours of events

Time to Live (TTL)

Defines how long events are retained

Trade-offs:

Too short → data loss
Too long → backlog and latency

Examples

Pub/Sub: up to 7 days
Kinesis: up to 365 days
Kafka: configurable (even indefinite)

Message Size Constraints

Systems limit event size

Examples:

Kinesis: 1 MB
Kafka: default 1 MB (configurable higher)

Design implication

Large payloads → store in object storage, send reference

Error Handling and Dead-Letter Queues

Failed events must not block pipeline
Solution: Dead-letter queue (DLQ)
Store problematic events separately
“Good” events → consumer
“Bad” events → DLQ Src: blog.algomaster.io

Example

Invalid JSON → routed to DLQ
Later fixed and reprocessed

Push vs Pull Consumption

Pull:
Consumer fetches data (Kafka, Kinesis)
Push:
System sends data to consumer (Pub/Sub, RabbitMQ)

Trade-off

Pull = more control
Push = simpler integration

Location and Latency

Ingest close to data source → lower latency

Trade-off:

Cross-region data transfer cost

Example

IoT data processed in same region vs sent globally

Ways to Ingest Data

Direct Database Connections (JDBC/ODBC)

Src: networkencyclopedia.com

Key ideas:

JDBC → JVM-based portability
ODBC → OS-dependent drivers

Limitations:

Poor support for nested data
Row-based transfer inefficient

Optimization

Parallel queries (but increases DB load)

Change Data Capture (CDC)

Batch CDC

Uses timestamps (e.g., updated_at)

Limitation:

Only captures latest state

Example

Bank account updated 5 times → only final balance captured

Continuous CDC

Captures every change as an event
Uses database logs

Example

PostgreSQL WAL → streamed to Kafka

CDC vs Replication

Synchronous replication:
Strong consistency
Used for read replicas
Asynchronous CDC:
Flexible
Supports analytics pipelines

APIs as Data Sources

No standardization → high engineering effort

Trends:

API client libraries
Managed connectors
Data sharing platforms

Advice

Avoid building connectors from scratch unless necessary

Message Queues vs Streams

Message queue:
Transient
Event disappears after consumption
Stream:
Persistent log
Supports replay
As seen in diagram on page 12:
Multiple producers → consumers → recombined → new stream

Managed Data Connectors

Prebuilt connectors (SaaS or OSS)

Capabilities:

CDC
Scheduling
Monitoring

Key idea

Avoid reinventing ingestion plumbing

Object Storage as Ingestion Layer

Highly scalable
Secure
Supports any file type

Use cases

Intermediate storage
Cross-team sharing

File Formats

CSV (Problematic)

No schema
Ambiguous delimiters
No support for nested data

Modern Formats

Parquet, Avro, ORC, Arrow

Advantages:

Schema included
Efficient
Supports nested data

Example

JSON stored as nested structure in Parquet

Legacy and Practical Methods

EDI (Email, Flash Drives)

Still used in real-world systems

Improvement

Automate ingestion from email → object storage

Shell and CLI

Used for scripting ingestion workflows

SSH, SCP, SFTP

Secure data transfer methods
Common in enterprise pipelines

Webhooks (Reverse APIs)

Src: bannerbear.com

Data provider sends events
Consumer must expose endpoint
Webhook → serverless → stream → storage

Challenges:

Hard to maintain
Requires robust architecture

Web Scraping

Extract data from websites

Challenges:

Legal risks
Fragile HTML structure
Maintenance overhead

Large-Scale Data Transfer

For 100TB+:
Use physical devices (Snowball, Snowmobile)

Insight

“Truck > Internet” at massive scale

Access data without copying
Read-only datasets from providers

Trade-off:

No ownership
Access can be revoked

Key Takeaways

Ingestion is not just data movement—it impacts:
Schema
Storage
Processing
Trade-offs exist everywhere:
Latency vs cost
Flexibility vs complexity
Prefer:
Managed solutions
Object storage
Streaming systems with replay
Always design for:
Failures
Schema changes
Scalability

References

Chapter 7 Fundamentals of Data Engineering

Stream Ingestion

Data Migration Basics

Streaming Ingestion: Core Challenges

Schema Evolution

Late-Arriving Data

Ordering and Multiple Delivery

Replay

Time to Live (TTL)

Message Size Constraints

Error Handling and Dead-Letter Queues

Push vs Pull Consumption

Location and Latency

Ways to Ingest Data

Direct Database Connections (JDBC/ODBC)

Change Data Capture (CDC)

Batch CDC

Continuous CDC

CDC vs Replication

APIs as Data Sources

Message Queues vs Streams

Managed Data Connectors

Object Storage as Ingestion Layer

File Formats

CSV (Problematic)

Modern Formats

Legacy and Practical Methods

EDI (Email, Flash Drives)

Shell and CLI

SSH, SCP, SFTP

Webhooks (Reverse APIs)

Web Scraping

Large-Scale Data Transfer

Data Sharing

Key Takeaways

References