Skip to content

Stream Ingestion

Data Migration Basics

  • Moving data between systems (e.g., SQL Server → Snowflake) is rarely seamless

  • Schema differences almost always exist

  • Best practice:

  • Test with sample data first

  • Validate schema compatibility before full migration

  • Bulk transfer is preferred over row-by-row transfer

  • Use object storage (S3, GCS) as an intermediate layer

  • Biggest challenge is often pipeline migration, not just data movement

  • Use automated tools whenever possible

Example Migrating a user table:

  • Source: VARCHAR(255)
  • Target: TEXT
  • Edge case: encoding differences → corrupted strings if not tested

Streaming Ingestion: Core Challenges

Schema Evolution

  • Event schemas change over time:

  • New fields added

  • Fields removed
  • Data types changed

  • Risks:

  • Downstream pipeline failures

  • Silent data corruption

Mitigation

  • Use schema registry with versioning
  • Maintain communication with upstream teams
  • Use dead-letter queues for failures

Example IoT event:

{ "temp": 22 }

After firmware update:

{ "temp": 22, "humidity": 60 }

Older consumers may break if not handled


Late-Arriving Data

  • Event time ≠ ingestion time

  • Causes:

  • Network delays

  • Device offline periods

  • Problem:

  • Incorrect analytics if using ingestion time

Solution

  • Define cutoff window (e.g., accept events within 10 minutes delay)

Example

  • Sensor sends reading at 10:00
  • Arrives at 10:05
  • Without correction → wrong time-series analysis

Ordering and Multiple Delivery

  • Distributed systems →

  • Messages may arrive out of order

  • Messages may be duplicated

  • Delivery semantics:

  • At-least-once (duplicates possible)

Example

  • Payment event processed twice → double charge if not idempotent

Replay

  • Ability to reprocess historical data

  • Supported by systems like Kafka, Kinesis

Use case

  • Bug fix → reprocess last 24 hours of events

Time to Live (TTL)

  • Defines how long events are retained

Trade-offs:

  • Too short → data loss
  • Too long → backlog and latency

Examples

  • Pub/Sub: up to 7 days
  • Kinesis: up to 365 days
  • Kafka: configurable (even indefinite)

Message Size Constraints

  • Systems limit event size

Examples:

  • Kinesis: 1 MB
  • Kafka: default 1 MB (configurable higher)

Design implication

  • Large payloads → store in object storage, send reference

Error Handling and Dead-Letter Queues

  • Failed events must not block pipeline

  • Solution: Dead-letter queue (DLQ)

  • Store problematic events separately

  • “Good” events → consumer

  • “Bad” events → DLQ Src: blog.algomaster.io

Example

  • Invalid JSON → routed to DLQ
  • Later fixed and reprocessed

Push vs Pull Consumption

  • Pull:

  • Consumer fetches data (Kafka, Kinesis)

  • Push:

  • System sends data to consumer (Pub/Sub, RabbitMQ)

Trade-off

  • Pull = more control
  • Push = simpler integration

Location and Latency

  • Ingest close to data source → lower latency

Trade-off:

  • Cross-region data transfer cost

Example

  • IoT data processed in same region vs sent globally

Ways to Ingest Data

Direct Database Connections (JDBC/ODBC)

Src: networkencyclopedia.com

Key ideas:

  • JDBC → JVM-based portability
  • ODBC → OS-dependent drivers

Limitations:

  • Poor support for nested data
  • Row-based transfer inefficient

Optimization

  • Parallel queries (but increases DB load)

Change Data Capture (CDC)


Batch CDC

  • Uses timestamps (e.g., updated_at)

Limitation:

  • Only captures latest state

Example

  • Bank account updated 5 times → only final balance captured

Continuous CDC

  • Captures every change as an event

  • Uses database logs

Example

  • PostgreSQL WAL → streamed to Kafka

CDC vs Replication

  • Synchronous replication:

  • Strong consistency

  • Used for read replicas

  • Asynchronous CDC:

  • Flexible

  • Supports analytics pipelines

APIs as Data Sources

  • No standardization → high engineering effort

Trends:

  • API client libraries
  • Managed connectors
  • Data sharing platforms

Advice

  • Avoid building connectors from scratch unless necessary

Message Queues vs Streams

  • Message queue:

  • Transient

  • Event disappears after consumption

  • Stream:

  • Persistent log

  • Supports replay

  • As seen in diagram on page 12:

  • Multiple producers → consumers → recombined → new stream


Managed Data Connectors

  • Prebuilt connectors (SaaS or OSS)

Capabilities:

  • CDC
  • Scheduling
  • Monitoring

Key idea

  • Avoid reinventing ingestion plumbing

Object Storage as Ingestion Layer

  • Highly scalable
  • Secure
  • Supports any file type

Use cases

  • Intermediate storage
  • Cross-team sharing

File Formats


CSV (Problematic)

  • No schema
  • Ambiguous delimiters
  • No support for nested data

Modern Formats

  • Parquet, Avro, ORC, Arrow

Advantages:

  • Schema included
  • Efficient
  • Supports nested data

Example

  • JSON stored as nested structure in Parquet

Legacy and Practical Methods


EDI (Email, Flash Drives)

  • Still used in real-world systems

Improvement

  • Automate ingestion from email → object storage

Shell and CLI

  • Used for scripting ingestion workflows

SSH, SCP, SFTP

  • Secure data transfer methods

  • Common in enterprise pipelines


Webhooks (Reverse APIs)


Src: bannerbear.com

  • Data provider sends events

  • Consumer must expose endpoint

  • Webhook → serverless → stream → storage

Challenges:

  • Hard to maintain
  • Requires robust architecture

Web Scraping

  • Extract data from websites

Challenges:

  • Legal risks
  • Fragile HTML structure
  • Maintenance overhead

Large-Scale Data Transfer

  • For 100TB+:

  • Use physical devices (Snowball, Snowmobile)

Insight

  • “Truck > Internet” at massive scale

Data Sharing

  • Access data without copying

  • Read-only datasets from providers

Trade-off:

  • No ownership
  • Access can be revoked

Key Takeaways

  • Ingestion is not just data movement—it impacts:

  • Schema

  • Storage
  • Processing

  • Trade-offs exist everywhere:

  • Latency vs cost

  • Flexibility vs complexity

  • Prefer:

  • Managed solutions

  • Object storage
  • Streaming systems with replay

  • Always design for:

  • Failures

  • Schema changes
  • Scalability

References

  1. Chapter 7 Fundamentals of Data Engineering