Transfer · Do Tasks · Do

The task with type Do, is the task that handles data migration and manipulation in your data lake. Well known use cases are data aggregation (for reporting purposes), data backups or storage loads. Depending on how you configure the particular root properties, the task behaves differently. The exact type of task is broadcasted in the log. You can use the Do task for the following tasks in your data lake:

Do: Table to storage

Materializes a table in a storage location in your data lake. Well known use-case is to backup a table.

Properties

root propertydescription
extractConfigure a table identifier for BigQuery or for Snowflake. Don't use query or template. See Query to storage for this.
deduplicateConfigure a storage deduplications
loadConfigure as a storage load.

Example usage

task:
    type: do

extract:
    source: database
    dataset_id: osb_capture
    # Use the $ sign to export a partition (only in BigQuery)
    #   This generates a partion of the previous hour in UTC timezone
    table_id: hits${{ task.trigger_date.subtract(hours=1).in_timezone('UTC').strftime('%Y%m%d%H') }}

deduplicate:
    # Purge all files in storage destination
    type: delete_all

load:
    destination: storage
    object_name_or_glob: |
                            {{ task.trigger_date.subtract(hours=1).in_timezone('UTC').strftime('%Y') }}/{{ task.trigger_date.subtract(hours=1).in_timezone('UTC').strftime('%m') }}/{{ task.trigger_date.subtract(hours=1).in_timezone('UTC').strftime('%d') }}/{{ task.trigger_date.subtract(hours=1).in_timezone('UTC').strftime('%H') }}/hits_*.json.gz
    bucket: osb-capture-backup
    compression: gzip
    export_format: json

Do: Query to storage

Materializes a query in a storage location in your data lake.

Properties

root propertydescription
extractConfigure as a query extract for BigQuery or query extract for Snowflake.
deduplicateConfigure a storage deduplications
loadConfigure as a storage load.

Example usage

task:
    type: do

#Extract
extract:
    query: |
            SELECT date, id, index
            FROM `factbase-augment.transfer_testdata.some_table`

#Delete duplicate rows
deduplicate:
    type: delete_all

load:
    destination: storage
    bucket: some_bucket
    object_name_or_glob: test_folder/data_*.csv.gz
    compression: gzip
    export_format: csv

Do: Query to table

Materializes a query in a table. Well known use-case is data aggregation for reporting.

Properties

root propertydescription
extractConfigure as a query extract for BigQuery or query extract for Snowflake.
deduplicateConfigure a table deduplications
loadConfigure as a database load for BigQuery or Snowflake.

Example usage

task:
    type: do

#Extract
extract:
    query: |
            SELECT date, id, index
            FROM `factbase-augment.transfer_testdata.some_table`

#Delete duplicate rows
deduplicate:
    type: keys
    keys:
        - id
        - date
        - index

load:
    destination: database
    dataset_id: transfer_testdata
    table_id: storage_to_table

Do: Query

Executes a query. Well known use-case is DML statements on tables

Properties

root propertydescription
extractConfigure as a query extract for BigQuery or query extract for Snowflake.

Example usage

task:
    type: do
    
    #Extract
    extract:
        query: |
                UPDATE `factbase-augment.transfer_testdata.storage_to_table`
                SET id=TO_BASE64(MD5(id))
                WHERE 1=1

Do: Storage to table

Loads blobs from storage in a table.

Properties

root propertydescription
extractConfigure as a storage location.
fileConfigure a file format
schemaConfigure a schema if you use Snowflake or want to force some field names and types in BigQuery.
deduplicateConfigure a table deduplications
loadConfigure as a database load for BigQuery or Snowflake.

Example usage

task:
    type: do
    start_date: today -2 days
    
#Extract
extract:
    source: storage
    bucket: my-bucket
    prefix: test_folder/storage_to_table.csv.gz

file:
    type: txt
    delimiter: ','
    encoding: UTF-8
    text_enclosure: '"'
    new_line: '\n'
    escape_char: '\'
    has_header_row: yes

    #Delete duplicate rows
deduplicate:
    type: date_range
    field: delivery_date

load:
    destination: database
    dataset_id: transfer_testdata
    table_id: storage_to_table

Do: Validate

Validates a query result.

Properties

root propertydescription
validateConfigure a validation like in a from task.

Example usage

# A test with nothing but a validate query in it
task:
    type: do

validate:
    - query: |
                SELECT COUNT(1) AS lines,
                    COUNT(DISTINCT index) AS indexes,
                    COUNT(DISTINCT id) AS ids,
                    COUNT(DISTINCT date) AS dates
                FROM `factbase-augment.transfer_testdata.storage_to_table`
        field: lines
        type: number
        operator: '=='
        value: 6000
    - field: ids
        type: number
        operator: '=='
        value: 3000
    - field: indexes
        type: number
        operator: '=='
        value: 3000
    - field: dates
        type: number
        operator: '=='
        value: 2