Transfer · From Tasks

From tasks handle all data imports to your data lake. You can configure the items below. Click on the name for more info.

Extract

The `extract` part of the configuration considers the extract part of a task. The vocabulary of the datasource is leading in this part. Will Transfer talk about fields and records, it will use columns and rows here if the datasource uses those.

All from tasks come with pre-formatted schemas where possible. Examples of these are from_facebook or from_xandr. Tasks where a pre-formatted schema is not possible, we give you the opportunity to create one. Examples of these tasks are from_imap (which imports mail attachments) or from_aws_s3 (which imports files from Amazon S3)

Transfer gets new tasks every month. Currently transfer supports the following from tasks:

  1. from_appfigures
  2. from_aws_s3
  3. from_bigquery
  4. from_bing
  5. from_dcm
  6. from_dcbm
  7. from_facebook
  8. from_google_ads
  9. from_google_analytics_v3
  10. from_google_analytics_management
  11. from_google_drive
  12. from_google_search_console
  13. from_imap
  14. from_imap_message_counter
  15. from_looker
  16. from_url
  17. from_salesforce
  18. from_snowflake
  19. from_ftp
  20. from_xandr

File

The file part of the configuration contains the definition of the data file being downloaded. Use for from task where the format is not pre-defined. E.g. with from_aws_s3 to download a file.

Usage

file:
    type: txt
    transforms:
        - type: uncompress
    delimiter: ;
    encoding: UTF-8
    text_enclosure: '"'
    new_line: '\n'
    escape_char: '\'
    has_header_row: yes
    skip_rows: 9
    skip_bottom_rows: 8

Properties

propertytypeoptionaldescription
typeenumerator (txt, json, xml, parquet, avro)yesFile type. Use txt for delimited files, json for newline delimited json, xml for XML, parquet for Parquet files and avro for AVRO files.
nodestringnoOnly when type=xml. Node to take as the root node.
delimitercharnoDefault is comma(,). Use where type=txt. Contains the char that delimits the fields.
encodingenumerator (valid encoding type)noDefault is UTF-8. Use where type=txt. Use when you need a different encoding type.
has_header_rowyesno (boolean)yesDefault is no. Use where type=txt. If file contains a header or not.
text_enclosurecharyesDefault is ". Use where type=txt. Quoting character to enclose fields.
new_linecharyesDefault is \n. Use where type=txt. Newline character. Use \n for unix based systems, or \n\r for Windows based systems.
escape_charcharyesDefault is \. Use where type=txt. Escape character being used in text files.
quotingenumerator (minimal, all, nonnumeric, none)yesDefault is minimal. How verbose is the quoting in the file? all = All fields, nonumeric = Only non-numeric fields, minimal = Only when needed (default), none = Never use quotes.
skip_rowsint (>= 0)yesDefault is 0. Amount of rows to skip at the top of the file. Happens before has_header_row is executed.
skip_bottom_rowsint (>= 0)yesDefault is 0. Amount of rows to skip at the bottom of the file.

transforms

The `transforms` part of the configuration enables you to manipulate data before you load it in your data lake. You have to add all transform as an array.

Example usage

transforms:
    # Field transforms
    - type: fields_to_snake_case
    - type: fields_to_lower_case
    - type: fields_regexp_and_replace
        search: '_'
        replace: ''

Transform types

Below are the types of transforms. You have to add them as an array. Below is an explanation per transform on how to use it exactly.

transform: add_current_date_field

Adds the current date (trigger_date) as a field.

propertytypeoptionaldescription
typestringnoSet to add_current_date_field
fieldstringnoField name of the new column, e.g. trigger_date

transform: convert_type

Converts the type of a field. Only useful with BigQuery and no schema is set.

propertytypeoptionaldescription
typestringnoSet to convert_type
fieldsarray of stringsnoArray of field names
toenumerator (supported field type of BigQuery)noType the field should be transformed to

transform: rename_field

Rename a field

propertytypeoptionaldescription
typestringnoSet to rename_field
fromstringnoCurrent name of the field
tostringnoName the field should be renamed to

transform: copy_field

Duplicate a field.

propertytypeoptionaldescription
typestringnoSet to copy_field
fromstringnoCurrent name of the field
tostringnoName the field should be duplicated to

transform: drop_fields

Drops fields.

propertytypeoptionaldescription
typestringnoSet to drop_fields
fieldsarray of stringsnoArray of field names the should be dropped.

transform: encrypt

Encrypt field content. Support for several hash algorithms.

propertytypeoptionaldescription
typestringnoSet to encrypt
encryptionenumerator (sha1, md5, sha1, sha224, sha256, sha384, sha512)noHash algorithm that should be used to hash the content of a field.
fieldsarray of stringsnoArray of field names the should be dropped.

transform: fields_search_and_replace

Search and replace a field name.

propertytypeoptionaldescription
typestringnoSet to fields_search_and_replace
searchstringnoString to search.
replacestringnoString to replace search string with.

transform: fields_to_snake_case

Transform field names to snake_case. E.g. OneSecondBefore to one_second_before. Also replaces the field names in a schema.

propertytypeoptionaldescription
typestringnoSet to fields_to_snake_case

transform: fields_to_lower_case

Transform field names to lower case. E.g. ONESECONDBEFORE to onesecondbefore. Also replaces the field names in a schema.

propertytypeoptionaldescription
typestringnoSet to fields_to_lower_case

transform: fields_regexp_and_replace

Regular expression replacement of field names.

propertytypeoptionaldescription
typestringnoSet to fields_regexp_and_replace
searchstringnoRegular expression to search.
replacestringnoRegular expression to replace found string with.

transform: filter_records

Filter records from incoming dataset.

propertytypeoptionaldescription
typestringnoSet to filter_records
fieldsarray of stringsnoArray of field names on which the filter must be applied.
valuestringnoIf a field contains this value, the whole record will be filtered.
filterenumerator (include, exclude)noUse include if you want all records to be included when the field(s) contain the value. Use exclude if you want all records to be excluded when the field(s) contain the value.

transform: find_all

Find all occurrences of a regular expression and return an array of matches. Based on Python's re.findall

propertytypeoptionaldescription
typestringnoSet to find_all
fieldsarray of stringsnoArray of field names on which the filter must be applied.
searchstringnoRegular expression to search for in the field(s).

transform: find_first

Find the first occurrence of a regular expression and the match. Based on Python's re.findall

propertytypeoptionaldescription
typestringnoSet to find_first
fieldsarray of stringsnoArray of field names on which the filter must be applied.
searchstringnoRegular expression to search for in the field(s).

transform: regexp_and_replace

Regular expression replacement of field data.

propertytypeoptionaldescription
typestringnoSet to regexp_and_replace
fieldsarray of stringsnoArray of field names on which the filter must be applied.
searchstringnoRegular expression to search.
replacestringnoRegular expression to replace found string with.

transform: replace_with_null

Conditionally replace field data with null value.

propertytypeoptionaldescription
typestringnoSet to replace_with_null
fieldsarray of stringsnoArray of field names on which the filter must be applied.
valuesarray of stringsnoIf the field value matches any of the strings in this array, the field will be replaced with a null value.

transform: search_and_replace

Search and replace a substring in field data.

propertytypeoptionaldescription
typestringnoSet to search_and_replace
searchstringnoArray of field names on which the filter must be applied.
fieldsarray of stringsnoIf the field value matches any of the strings in this array, the field will be replaced with a null value.

transform: strptime

Parse a field value as a time string. Uses Python's datetime.strptime.

propertytypeoptionaldescription
typestringnoSet to strptime
formatstringnoFormat of the string that should be parsed as a date & time. Use these format codes
fieldsarray of stringsnoIf the field value matches any of the strings in this array, the field will be replaced with a null value.

Schema

Don't use a schema for the API's that use pre-formatted schemas. The extract section of the task type you chose contains this information. The `schema` part of the configuration is used to create the destination table. Obliged for Snowflake, but optional for BigQuery.

Example usage

schema:
    time_partitioning:
        field: date_field
        type: day
        require_partition_filter: yes
    fields:
        - name: date_field
          type: DATE
          mode: REQUIRED
          description: Description for date_field will be visible as field comment. Field will be used for partitioning in BigQuery and clustering in Snowflake.
        - name: some_string_field
          type: STRING
          mode: NULLABLE
          description: Description for some_string_field will be visible as field comment.
        - name: some_int_field
          type: INTEGER
          mode: NULLABLE
          description: Description for some_int_field will be visible as field comment.
        - name: some_repeated_field
          type: OBJECT
          mode: REPEATED
          description: Description for some_int_field will be visible as field comment.
          fields:
            - name: date_field
                type: DATE
                mode: REQUIRED
                description: Description for date_field will be visible as field comment. Field will be used for partitioning in BigQuery and clustering in Snowflake.
            - name: some_string_field
                type: STRING
                mode: NULLABLE
                description: Description for some_string_field will be visible as field comment.

Properties of time_partitioning

Optional. Time partitioning considers either clustering (for Snowflake) or time partitioning (for BigQuery).

propertytypeoptionaldescription
fieldstringyesField name. Must be present in the field definitions
typeenumerator (hour, day, month, year)yesTimeframe of the partition. Default is day. Per hour, day, month or year.
require_partition_filteryesno (boolean)yesDefault is no. BigQuery only. If the partition filter should be required in your SQL.

Properties of fields

Obliged. Array of table fields.

propertytypeoptionaldescription
namestringnoField name. Use either valid names for BigQuery or Snowflake.
typeenumerator (valid data type)noSQL data type of field. Use only valid types of BigQuery or Snowflake
modeenumerator (nullable, required, repeated)yesBigQuery only. Define the mode of this field. Nullable = may be NULL, required = may not be NULL, repeated = array.
descriptionstringyesDescription of the table field. Is added in Snowflake as a column comment during table creation. Is updated with every load in BigQuery.
fieldsarray of fieldsyesContains nested fields. Use mode=repeated if this should be an array of fields. Use same format for fields as described here.

Deduplicate

Optional. Deduplicate the destination table to make sure you don't import the same data twice. Configure once, deduplicates automatically. If no deduplication is added, all imported data will be added to the destination table.

You can deduplicate a storage or table destination. Click on a deduplicate type for an explanation.

deduplicate: storage

Below are the types of deduplicate. Below is an explanation per deduplicate type.

Example usage

deduplicate:
    type: delete_all

deduplicate storage: none

Default setting. No deduplication. Adds all files to the destination storage location. Overwrites if the source files have the same name as the destination files.

propertytypeoptionaldescription
typestringnoSet to none

deduplicate storage: delete_all

Deletes all files in the destination storage location.

propertytypeoptionaldescription
typestringnoSet to delete_all

deduplicate: table

Below are the types of deduplicate. Below is an explanation per deduplicate type.

Example usage

deduplicate:
    type: date_range
    field: booking_date

deduplicate: append

Default setting. No deduplication. Appends all data to the destination table.

propertytypeoptionaldescription
typestringnoSet to append

deduplicate: drop

Preferred over type=truncate. Drops destination table and recreates the table (with latest fields and descriptions) before importing new data.

propertytypeoptionaldescription
typestringnoSet to drop
with_loopenumerator (all, once)noOnly needed when task.loop_by is set. Use all to drop and recreate the table with every loop. Use once to only drop and recreate the table with the first loop.

deduplicate: truncate

Prefer type=drop over truncate. Truncates (or creates if not exists) destination table before importing new data.

propertytypeoptionaldescription
typestringnoSet to truncate
with_loopenumerator (all, once)noOnly needed when task.loop_by is set. Use all to drop and recreate the table with every loop. Use once to only drop and recreate the table with the first loop.

deduplicate: date_range

Deletes the date range from deduplicate.start_date (or task.start_date) till deduplicate.end_date (or deduplicate.end_date) (inclusive) before importing new data. Currently only supports deletion by days. Contact support if you want a different timeframe, e.g. hour.

propertytypeoptionaldescription
typestringnoSet to date_range
fieldstringnoName of the date field.

deduplicate: keys

Basically a merge of the imported dataset into the destination table. First deletes all records in the destination table if the key(s) value(s) are present in the imported dataset. Then appends the imported dataset.

propertytypeoptionaldescription
typestringnoSet to date_range
fieldstringnoName of the date field.

Load

Optional. Configures where in the data lake the imported data should be stored.

Load types

Below are the types of load. Click on a load type for an explanation.

load: database (BigQuery)

Load imported data into a BigQuery table.

Example usage

Example of loading the data files to table `my-data-lake.osb_raw.weather-data`

load:
    destination: database
    conn_id: google_cloud_default
    project_id: my-data-lake
    dataset_id: osb_raw
    table_id: weather-data
propertytypeoptionaldescription
destinationenumerator (database, storage)yesDefault is database
conn_idstringyesName of the connection. If not declared, client_cloud.db_conn_id is used.
project_idstringyesProject ID of the destination table. If not declared, client_cloud.project_id is used.
dataset_idstringyesDataset ID of the destination table. If not declared, client_cloud.dataset_id is used.
table_idstringyesTable ID of the destination table. If not declared, task.id is used.
table_expiration(relative) date or datetimeyesSets expiration date of the table. If not declared, client_cloud.table_expiration is used. Set it to None if you don't want a default table_expiration, e.g. for your production tables. Recommended to add this to the sandbox files to make sure your sandbox tables will be deleted automatically. Besides absolute dates (like 2021-09-27), you can use relative dates as well.
max_bad_recordsinteger (>= 0)yesMaximum amount of bad records before the load operation fails. Must be an integer of 0 or greater.
autodetect_schemayesno (boolean)yesBigQuery can detect the schema of a file being loaded. Is automatically set to yes if no schema is available.

load: database (Snowflake)

Load imported data into a Snowflake table.

Example usage

Example of loading the data files to table PRODUCTION.OSB_RAW.WEATHER_DATA

load:
    destination: database
    conn_id: snowflake
    database: PRODUCTION
    schema: OSB_RAW
    table: WEATHER_DATA
propertytypeoptionaldescription
destinationenumerator (database, storage)yesDefault is database
conn_idstringyesName of the connection. If not declared, client_cloud.db_conn_id is used.
databasestringyesDatabase of the destination table. If not declared, client_cloud.database is used.
schemastringyesSchema of the destination table. If not declared, client_cloud.schema is used.
tablestringyesTable of the destination table. If not declared, task.id.upper() is used.
max_bad_recordsinteger (>= 0)yesMaximum amount of bad records before the load operation fails. Must be an integer of 0 or greater.

load: storage

Example usage

Example of loading the data files to s3://my-test-bucket/some/folder

load:
    destination: storage
    conn_id: s3
    bucket: my-test-bucket
    folder: some/folder

Load imported data to Amazon S3, Google Cloud Storage or Azure Blob Storage (in beta).

propertytypeoptionaldescription
destinationenumerator (database, storage)noSet to storage
conn_idstringyesName of the connection. If not declared, client_cloud.storage_conn_id is used.
bucketstringyesName of the bucket. If not declared, client_cloud.bucket is used.
folderstringyesName of the folder. If not declared, client_cloud.folder is used.
project_idstringyesBigQuery only. Project ID of the bucket. If not declared, client_cloud.project_id is used.

validate

The `validate` part of the configuration enables you to check the data you just imported. If the imported data does not pass the validation an error will be generated.

Validation types

Below are the validation types. You have to add each validation as an array. Below is an explanation per validation on how to use it exactly.

Special property: query

Set this property to the query you want to execute. Make sure the resultset is flattened (only rows and columns, not nested or repeated columns). The query property is obliged for the first validation, but optional for all subsequent validations. If you choose not to add query to the subsequent validations, Transfer will use the same resultset for the validation until it hits another query property. If you add a query property to any of the subsequent validations, Transfer will execute the query, and use the resultset in that validation and the subsequent validations until it hits a another query property again. You are allowed to use SQL Templating.

Validate: number

Validates a number field.

Example usage

The examples validates if the value of all rows in the resultset of column lines is greater than (>) the value (1000). The example uses SQL Templating, hence the {{ and }}.

validate:
    - query: |
              SELECT my_segment, COUNT(1) AS lines
              FROM `{{ load.project_id }}.{{ load.dataset_id }}.{{ load.table_id }}`
              GROUP BY 1
      type: number
      operator: '>'
      value: 1000
      field: lines
propertytypeoptionaldescription
querystringyes (obliged for first validation)Query to execute. See special property: query for more info.
typestringnoSet to number
fieldstringnoField name that will be validated
operatorenumerator (<=, <, ==, >, >=)noOperator to use in the validation
valueinteger or floatnoValue to match the field against.

Validate: total_rows

Validates the number of rows of a resultset.

Example usage

The examples validates if the amount of rows equals (==) eight. lines is greater than (>) the value (1000). The example uses SQL Templating, hence the {{ and }}.

validate:
    - query: |
              SELECT my_segment, COUNT(1) AS transactions
              FROM `{{ load.project_id }}.{{ load.dataset_id }}.{{ load.table_id }}`
              GROUP BY 1
      type: total_rows
      operator: '=='
      value: 8
propertytypeoptionaldescription
querystringyes (obliged for first validation)Query to execute. See special property: query for more info.
typestringnoSet to total_rows
operatorenumerator (<=, <, ==, >, >=)noOperator to use in the validation
valueinteger or floatnoValue to match the field against.

Validate: yesno

Validates if the value of a field is yes (True) in all rows.

Example usage

The examples validates if the value of field has_enough_transactions is yes (True). The example uses SQL Templating, hence the {{ and }}.

validate:
    - query: |
              SELECT COUNT(1) > 1000 AS has_enough_transactions
              FROM `{{ load.project_id }}.{{ load.dataset_id }}.{{ load.table_id }}`
      type: yesno
      field: has_enough_transactions
propertytypeoptionaldescription
querystringyes (obliged for first validation)Query to execute. See special property: query for more info.
typestringnoSet to yesno
fieldstringnoField name of which the value must be yes (True)