Schema Evolution
Introduction
Schema evolution is a core feature of sparksneeze that allows automatic adaptation of target schemas as source data structure changes over time. This enables robust data pipelines that can handle evolving data sources without manual intervention.
All strategies that support schema evolution provide two key parameters:
auto_expand(default: True) - Automatically add new columns from source to targetauto_shrink(default: False) - Automatically remove columns from target that don’t exist in source
How Schema Evolution Works
Schema evolution occurs during strategy execution in the following order:
Source Analysis: Compare source schema with existing target schema
Type Compatibility: Verify column types are compatible or can be safely cast
Schema Expansion: Add missing columns to target if
auto_expand=TrueSchema Shrinkage: Remove extra columns from target if
auto_shrink=TrueData Alignment: Align source data with evolved target schema
Metadata Application: Apply metadata with appropriate hash calculations
Auto-Expand Behavior
When auto_expand=True (default behavior):
# Day 1: Target has columns [id, name, age]
# Day 2: Source has columns [id, name, age, email, phone]
# Result: Target schema expands to [id, name, age, email, phone]
# Existing data gets NULL values for new columns
# New data populates all columns normally
Key Points: - New columns are added to target schema - Existing records get NULL values for new columns - Existing metadata (including hashes) remains unchanged - New records get metadata calculated on full expanded schema
Auto-Shrink Behavior
When auto_shrink=True:
# Day 1: Target has columns [id, name, age, email, phone] with 100 existing records
# Day 2: Source has columns [id, name, age] (missing email, phone)
# Result: Target schema shrinks to [id, name, age]
# IMPORTANT: email and phone columns are physically DROPPED
# Data in dropped columns is permanently lost
# Existing records preserve their original hash values
# New records get hash calculated on shrunk schema
Critical Behavior:
Warning
When auto_shrink=True, data in removed columns is permanently lost.
This operation cannot be undone.
Hash Calculation During Schema Evolution
One of the most important aspects of schema evolution is maintaining data integrity through proper hash calculations:
Existing Data Hash Preservation
When schema evolution occurs (expand or shrink), existing target data maintains its original hash values:
# Example: Target evolution from [id, name, age, salary] to [id, name, age]
# Existing records keep original hash: hash(id, name, age, salary)
# Even though 'salary' column is dropped, hash remains unchanged
# This preserves data lineage and audit trails
New Data Hash Calculation
New source data gets hash values calculated based on the final evolved schema:
# After schema shrink from [id, name, age, salary] to [id, name, age]
# New records get hash: hash(id, name, age)
# Hash calculation excludes dropped columns
Mixed Hash Scenarios
This creates scenarios where a single target contains records with different hash calculation bases:
# Target after auto_shrink evolution:
# - 100 existing records: hash(id, name, age, salary)
# - 50 new records: hash(id, name, age)
# This is normal and expected behavior
Strategy-Specific Evolution Behavior
Append Strategy
Existing target data: Completely preserved, including original metadata and hashes
Schema changes: Applied to target structure
New source data: Aligned to evolved target schema
Data loss: Only occurs with
auto_shrink=True(columns dropped permanently)
from sparksneeze.strategy import Append
# Conservative: Only expand schema, never shrink
strategy = Append(auto_expand=True, auto_shrink=False)
# Aggressive: Allow both expansion and shrinkage
strategy = Append(auto_expand=True, auto_shrink=True)
Truncate Strategy
Existing target data: Completely cleared before processing
Schema changes: Applied to target structure
New source data: All data processed uniformly with evolved schema
Hash consistency: All records have same hash calculation base
from sparksneeze.strategy import Truncate
# All data processed with consistent schema evolution
strategy = Truncate(auto_expand=True, auto_shrink=True)
Upsert and Historize Strategies
Key preservation: Key columns are never dropped during
auto_shrinkExisting data: Hash calculations follow same mixed-hash patterns as Append
Schema evolution: Applied before key-based operations
Data Loss Scenarios
When Data is Lost:
auto_shrink=True: Columns not in source are permanently dropped
Type incompatibility: Incompatible data types may cause data loss during casting
Delta Lake: Column drops are irreversible (even with time travel)
Prevention Strategies:
# 1. Use conservative settings
strategy = Append(auto_expand=True, auto_shrink=False)
# 2. Backup before schema evolution
target.backup("backup_table_name")
# 3. Test schema evolution on sample data first
test_result = sparksneeze(sample_df, test_target, strategy).run()
Best Practices
Start Conservative: Use
auto_expand=True, auto_shrink=FalseinitiallyMonitor Schema Changes: Log and review schema evolution operations
Backup Critical Data: Always backup before enabling
auto_shrinkTest Evolution: Validate schema evolution on development data first
Document Changes: Track schema evolution decisions and impacts
Example: Complete Schema Evolution Workflow
from sparksneeze import sparksneeze
from sparksneeze.strategy import Append
from sparksneeze.metadata import MetadataConfig
# Setup with schema evolution enabled
metadata_config = MetadataConfig()
strategy = Append(
auto_expand=True, # Allow new columns
auto_shrink=False, # Prevent data loss
metadata_config=metadata_config
)
# Execute with automatic schema adaptation
result = sparksneeze(
source_df,
"target_table",
strategy
).run()
if result.success:
print(f"Schema evolution completed: {result.message}")
# Check for any schema changes in logs
else:
print(f"Schema evolution failed: {result.message}")
Troubleshooting Schema Evolution
Common Issues:
Type Mismatch: Source and target have incompatible column types
Solution: Ensure compatible types or implement custom type casting
Metadata Conflicts: Existing metadata conflicts with new schema
Solution: Use consistent MetadataConfig across operations
Hash Validation Failures: Mixed hash calculations cause validation issues
Solution: Understand that mixed hashes are expected behavior after evolution
Data Loss Surprise: Unexpected data loss from auto_shrink
Solution: Always backup before enabling auto_shrink operations
Debug Schema Evolution:
import logging
# Enable detailed schema evolution logging
logging.getLogger("sparksneeze.schema_evolution").setLevel(logging.DEBUG)
# Run operation with verbose output
result = sparksneeze(source_df, target, strategy).run()