Strategies

Introduction

A strategy is given as a parameter to the sparksneeze class. For example:

sparksneeze(source_entity, target_entity, strategy=Strategy(args, kwargs)).run()

Create a sparksneeze instance with a specific strategy, and then run it.

Parameters:
  • strategy (SnarpSneezeStrategy) – A strategy class instance

  • args – Positional arguments

  • kwargs – Key-word arguments

Returns:

Strategy result

Return type:

SparkSneezeResult

Automatic Metadata Enrichment

All strategies automatically add standardized metadata fields to every record. This ensures consistent tracking and auditing across all data operations.

Metadata Fields Added:

  • _META_valid_from (TimestampType) - Record validity start timestamp

  • _META_valid_to (TimestampType) - Record validity end timestamp (2999-12-31 for active records)

  • _META_active (BooleanType) - Active record indicator (True for all records except in Historize operations)

  • _META_row_hash (StringType) - Hash of data columns (excludes metadata and key columns)

  • _META_system_info (StringType) - JSON containing system metadata (strategy, version, timestamp, etc.)

Customizing Metadata:

from sparksneeze.metadata import MetadataConfig
from sparksneeze.strategy import DropCreate

# Custom metadata configuration
config = MetadataConfig(
    prefix="_AUDIT",  # Custom prefix instead of _META
    hash_columns=["id", "name"]  # Only hash specific columns
)

# Use with any strategy
strategy = DropCreate(metadata_config=config)
result = sparksneeze(df, "target_table", strategy).run()

Strategies

DropCreate

class DropCreate(metadata_config=None)

Remove the target entity and create it anew based on the schema of the source entity. There are no parameters, no data or schema will be kept of the old target.

All records will have metadata fields added with _META_active=True.

Parameters:

metadata_config (MetadataConfig) – Optional metadata configuration. Uses default if None.

Returns:

SparkSneezeResult

Return type:

dict

Example

Randy takes over as the new park supervisor and needs to replace the old resident roster with a fresh start.

Source (New Resident Data):

name

age

occupation

location

Ricky

31

Convenience Store

Trailer Park

Julian

34

Bar Owner

Trailer Park

Bubbles

33

Cat Caretaker

Shed

Existing Target (Old Registry):

name

age

job_title

Randy

36

Supervisor

Mr Lahey

54

Manager

Result (Completely New Registry):

name

age

occupation

location

_META_valid_from

_META_valid_to

_META_active

_META_row_hash

_META_system_info

Ricky

31

Convenience Store

Trailer Park

2024-03-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”DropCreate”}

Julian

34

Bar Owner

Trailer Park

2024-03-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”DropCreate”}

Bubbles

33

Cat Caretaker

Shed

2024-03-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”DropCreate”}

The old registry was completely wiped and replaced with the new resident data.

Truncate

class Truncate(auto_expand=True, auto_shrink=False, metadata_config=None)

Clear the target entity and load the data from the source entity. By default it automatically expands the schema when new columns are found. Columns that are removed from the source entity will remain in the target entity. By enabling auto_shrink it will automatically drop columns from the target entity as well.

When auto_expand and auto_shrink are turned on simultaneously it will mimic the DropCreate strategy. This is useful for database logging or when there is no DROP and CREATE permissions granted. In the case of Delta tables history is preserved, whereas a DropCreate would replace the Delta table entirely.

All records will have metadata fields added with _META_active=True.

Parameters:
  • auto_expand (bool) – Automatically add new columns to the target_entity

  • auto_shrink (bool) – Automatically remove nonexistent columns from the target_entity

  • metadata_config (MetadataConfig) – Optional metadata configuration. Uses default if None.

Returns:

SparkSneezeResult

Return type:

dict

Example

Julian gets a new liquor shipment and needs to clear out the old inventory and load the fresh stock.

Source (New Liquor Inventory):

name

age

job

Ricky

31

Sales Agent

Julian

34

Manager

Bubbles

33

Analyst

Existing Target (Old Inventory):

name

age

job

Randy

36

Supervisor

Mr Lahey

54

Manager

Result (Fresh Inventory Only):

name

age

job

_META_valid_from

_META_valid_to

_META…

_META…

_META_system_info

Ricky

31

Sales Agent

2024-05-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Truncate”}

Julian

34

Manager

2024-05-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Truncate”}

Bubbles

33

Analyst

2024-05-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Truncate”}

All old inventory was cleared and replaced with the new shipment data.

Append

class Append(auto_expand=True, auto_shrink=False, metadata_config=None)

Load the data from the source entity into the target entity. By default it automatically expands the schema when new columns are found. Columns that are removed from the source entity will remain in the target entity. By enabling auto_shrink it will automatically drop columns from the target entity as well.

No rows will be removed from the target entity. All records will have metadata fields added with _META_active=True.

Parameters:
  • auto_expand (bool) – Automatically add new columns to the target_entity

  • auto_shrink (bool) – Automatically remove nonexistent columns from the target_entity

  • metadata_config (MetadataConfig) – Optional metadata configuration. Uses default if None.

Returns:

SparkSneezeResult

Return type:

dict

Example

New people are moving into Sunnyvale Trailer Park and need to be added to the existing resident registry.

Source (New Residents):

name

age

business

income

Ricky

31

Get Rich Quick

2500

Julian

34

Bar Business

5000

Bubbles

33

Cart Business

1200

Existing Target (Current Residents):

name

age

business

income

Randy

36

Security

3000

Mr Lahey

54

Supervisor

4500

Result (All Residents Combined):

name

age

business

income

_META_valid_from

_META_valid_to

_META…

_META…

_META_system_info

Randy

36

Security

3000

2024-07-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Append”}

Lahey

54

Supervisor

4500

2024-07-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Append”}

Ricky

31

Get Rich Quick

2500

2024-07-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Append”}

Julian

34

Bar Business

5000

2024-07-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Append”}

Bubbles

33

Cart Business

1200

2024-07-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Append”}

Existing residents were preserved and new residents were added to the registry.

Upsert

class Upsert(key=[col1, col2], auto_expand=True, auto_shrink=False, metadata_config=None)

Load data from the source entity into the target entity by using one or more keys. After the key comparison, the following happens:

  • New keys have their records inserted into the target entity

  • Existing keys will have their records updated in the target entity

  • (Optional) Nonexistent keys will have their records removed from the target entity

All records will have metadata fields added with _META_active=True. Key columns are automatically excluded from row hash calculation.

Parameters:
  • key (list or str) – The key(s) that will be used to upsert

  • auto_expand (bool) – Automatically add new columns to the target_entity

  • auto_shrink (bool) – Automatically remove nonexistent columns from the target_entity

  • metadata_config (MetadataConfig) – Optional metadata configuration. Uses default if None.

Returns:

SparkSneezeResult

Return type:

dict

Example

Characters are getting out of jail and updating their status in the park’s character tracking system.

Source (Status Updates):

character_id

name

age

status

money

1

Ricky

32

Out of Jail

150

2

Julian

35

Running Bar

5500

4

Bubbles

34

Cart Business

1800

5

Corey

20

Working Store

800

Existing Target (Current Status):

character_id

name

age

status

money

1

Ricky

31

In Jail

50

2

Julian

34

Planning

3000

3

Randy

36

Supervisor

2500

Result (Updated Status):

id

name

age

status

money

_META_valid_from

_META_valid_to

_META…

_META…

_META_system_info

1

Ricky

32

Out of Jail

150

2024-09-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Upsert”}

2

Julian

35

Running Bar

5500

2024-09-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Upsert”}

3

Randy

36

Supervisor

2500

2024-09-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Upsert”}

4

Bubbles

34

Cart Business

1800

2024-09-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Upsert”}

5

Corey

20

Working Store

800

2024-09-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Upsert”}

Ricky and Julian got updated with their new status and money, while Bubbles and Corey were inserted as new entries.

Historize

class Historize(key=[col1, col2], auto_expand=True, auto_shrink=False, valid_from=datetime.now(), valid_to=datetime(2999, 12, 31), prefix='_META', metadata_config=None)

Load data from the source entity into the target entity by using one or more keys and add validity time tracking attributes. The metadata columns to store a valid from date, a valid to date and active attribute will be added to the target entity, regardless of the auto_expand parameter.

After the key comparison, the following happens:

  • New keys have their records inserted into the target entity, valid_from and valid_to will be set

  • Existing keys will have their records updated in the target entity, setting the valid_to and active values

  • Existing keys will have a new record inserted in the target entity, setting the valid_from and valid_to values

  • Nonexistent keys will have their records in the target_entity updated setting the valid_to and active values

Key columns are automatically excluded from row hash calculation. Uses custom valid_from/valid_to timestamps for metadata fields.

Parameters:
  • key (list or str) – The key(s) that will be used to SCD2

  • auto_expand (bool) – Automatically add new columns to the target_entity

  • auto_shrink (bool) – Automatically remove nonexistent columns from the target_entity

  • valid_from (datetime) – The datetime value to set for the start of record validity, defaults to datetime.now()

  • valid_to (datetime) – The datetime value to set for the end of the record validity, defaults to datetime(2999, 12, 31)

  • prefix (string) – The prefix to use for the metadata columns, defaults to ‘_META’. E.g. _META_valid_from and _META_valid_to.

  • metadata_config (MetadataConfig) – Optional metadata configuration. Uses default if None.

Returns:

SparkSneezeResult

Return type:

dict

Example

Tracking residents’ rent payment status changes over time for park management records. In October 2024, residents had their initial status recorded. By December 2024, some changes occurred that needed to be tracked historically.

Source (Updated Rent Status):

resident_id

name

age

trailer_number

rent_status

1

Ricky

32

1

Behind

2

Julian

35

2

Paid

3

Bubbles

34

0

Shed Owner

4

Randy

37

5

Free Housing

Existing Target (Historical Records):

resident_id

name

age

trailer_number

rent_status

_META_valid_from

_META_valid_to

_META_active

1

Ricky

31

1

Paid

2024-10-01

2999-12-31

true

2

Julian

34

2

Paid

2024-10-01

2999-12-31

true

5

Mr Lahey

54

3

Supervisor

2024-10-01

2999-12-31

true

Result (Historized Records):

id

name

age

trail#

rent_status

_META_valid_from

_META_valid_to

_META…

_META…

_META_system_info

1

Ricky

31

1

Paid

2024-10-01 00:00:00

2024-12-01 00:00:00

false

[hash]

{“strategy”:”Historize”}

1

Ricky

32

1

Behind

2024-12-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Historize”}

2

Julian

34

2

Paid

2024-10-01 00:00:00

2024-12-01 00:00:00

false

[hash]

{“strategy”:”Historize”}

2

Julian

35

2

Paid

2024-12-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Historize”}

3

Bubbles

34

0

Shed Owner

2024-12-01 00:00:00

2999-12-31 23:59:59

true

[hash]

{“strategy”:”Historize”}

Changes were tracked over time: Ricky’s status changed from “Paid” (Oct 2024) to “Behind” (Dec 2024), Julian’s age updated, while preserving full history of all changes. Mr Lahey’s record was ended since he’s no longer in the source data.