Transfer · Job Orchestration

A Transfer job can contain one or more tasks that are interdependent. This section describes how the folders and files should be structured in a job. And the structure and settings in a job file

They are described in more detail below.

job

Contains all properties relate to the job.

Usage

job:
    # Some comment
    email:
        - admin@onesecondbefore.com
    owner: osb
    email_on_failure: no
    email_on_retry: no
    retries: 0
    retry_delay: 5
    schedule_interval: @daily

Properties

propertytypeoptionaldescription
clientstringnoYour client code
idstringnoRead-only. Unique name of the job. Automatically set by Transfer with the name of the job folder.
max_active_runsintegeryesDefault is 1. Number of amount of job runs that may be active at the same time. Use 1, if only a single instance of the job may run.
backfillyesno (boolean)yesDefault is yes. Use `yes` if you want the job to backfill back since the job start date.
concurrencyintnoDefault value is 255. The number of task instances allowed to run concurrently
depends_on_pastyesno (boolean)noDefault value is yes. Causes a task instance to depend on the success of its previous task_instance. Default is `yes`
emailstringyesEmail address that will be used for notification. Use a comma(,) to add more email addresses.
email_on_failureyesno (boolean)yesDefault value is no. Sends an email to the email address(es) in the email property when a task fails.
email_on_failureyesno (boolean)yesDefault value is no. Sends an email to the email address(es) in the email property when a task fails.
email_on_retryyesno (boolean)yesDefault value is no. Sends an email to the email address(es) in the email property when a task will retry.
descriptionstringyesDescription for the job, which will show up in the console.
retriesintnoDefault value is 0. Amount of retries of the job, if it failed.
retry_delayintnoDefault value is 5 (minutes). Amount of minutes to wait before trying the job again.
schedule_intervalstringyesDefault value is None. Leave empty not to have it scheduled automatically. Use a cron job expression.
start_datestring (with a relative date), date or date & timeyesContains the start date of the job. This is important if you have set the backfill option to `yes`. Backfill will then start at the start date.
end_datestring (with a relative date), date or date & timeyesContains the end date of the job. The job will not be scheduled after this moment.
environmentenumerator (production, development)yesDefault is production. If set to development, the sandbox files will overwrite the job and task files.

client_cloud

Contains all properties relate to the cloud environment of the client.

Usage

BigQuery

client_cloud:
    type: google
    db_engine: bigquery
    db_conn_id: google_cloud_default
    storage: gs
    storage_conn_id: google_cloud_default
    project_id: your_product_id
    dataset_id: your_dataset_id
    dataset_id_tmp: your_dataset_id_tmp
    bucket: your_bucket
    folder: ''
    bucket_tmp: your_bucket_tmp
    folder_tmp: ''

Snowflake on Amazon Web Services

client_cloud:
    type: amazon
    db_engine: snowflake
    db_conn_id: snowflake
    database: PRODUCTION
    schema: YOUR_SCHEMA
    schema_tmp: YOUR_SCHEMA_TMP
    storage: s3
    storage_conn_id: amazon_s3
    bucket: your-bucket-id
    folder: production
    bucket_tmp: your-bucket-id-tmp
    folder_tmp: development

Parameters

property nametypeoptionaldescription
typeenumerator(google, amazon or azure)noCloud supplier. Either google (for Google), amazon (for Amazon Web Services) or azure (for Microsoft Azure, in beta)
db_engineenumerator(bigquery or snowflake)noDatabase engine. Currently Onesecondbefore supports Google BigQuery or Snowflake.
storageenumerator(s3 or gs)noStorage type. Supports Google Cloud Storage (gs), S3 of Amazon Web Services (s3) or Azure Blob Storage of Microsoft (as, in beta)
db_conn_idstringnoConnection ID for database access. Contains the name of the connection. Ask Onesecondbefore for your connections.
storage_conn_idstringnoConnection ID for access storage. Contains the name of the connection. Ask Onesecondbefore for your connections.
project_idstringnoBigQuery only. Contains the project ID.
dataset_idstringnoBigQuery only. Contains the dataset ID.
dataset_id_tmpstringnoBigQuery only. Will be deprectated. Contains the dataset ID where the temporary tables will be stored.
table_expirationstring (with relative date), date, date & timenoBigQuery only. Sets the table expiration of a table.
databasestringnoSnowflake only. Contains the database name.
schemastringnoSnowflake only. Contains the schema name.
storage_integrationstringyesSnowflake only. Contains the storage integration. Don't use this unless you have to.
bucketstringnoBucket name
folderstringnoFolder in which the files will be stored.
bucket_tmpstringnoBucket name of temporary bucket. Is used to store temporary files. Is used in combination with `folder_tmp`
folder_tmpstringnoFolder in which the temporary files will be stored. Is used in combination with `bucket_tmp`.
timezonestringnoContains the timezone of the client. Choose it wisely and it is recommended to keep the same across all jobs.

job.yml

Description of the job file that has to be present in all jobs.

Usage

config:
    job:
        schedule_interval: '05 2 * * *'
        description: Unit test tasks for Onesecondbefore Transfer
    task:
        slack_channel: my-alert-channel
        slack_on_failure: yes

tasks:
    - from_doubleclick
    - from_google_ads
    - from_facebook
    - from_bing
    - aggregate_data

relationships:
    - from_doubleclick: aggregate_data
    - from_google_ads: aggregate_data
    - from_facebook: aggregate_data
    - from_bing: aggregate_data

Parameters

property nametypeoptionaldescription
configdictnoContains the configuration settings that overwrite the default settings from the global.yml file.
tasksarraynoContains the tasks in this job. The name of the task must be the exact same name as the YAML file without the extension. E.g. if a task file is called from_doubleclick.yml, put from_doubleclick in the list.
relationshipsarray or dictsnoContains the relationships of the tasks in this job. Every item is a pair like `task from`: 'task to'. The relationships are directional, which means that the `task to` is executed when `task from` is done. All tasks that do not appear in a relationship will be executed regardless of the outcome of the other tasks.
statusenumerator (0 or 1)no1 if job is active, 0 if job is not active