Workflows · Job Orchestration

A Workflows job can contain one or more tasks that are interdependent. This section describes how the folders and files should be structured in a job. And the structure and settings in a job file

job
client_cloud
job.yml

They are described in more detail below.

job

Contains all properties relate to the job.

Usage

job:
    # Some comment
    email:
        - admin@onesecondbefore.com
    owner: osb
    email_on_failure: no
    email_on_retry: no
    retries: 0
    retry_delay: 5
    schedule: @daily

Properties

property	type	optional	description
`client`	string	no	Your client code
`id`	string	no	Read-only. Unique name of the job. Automatically set by Workflows with the name of the job folder.
`max_active_runs`	integer	yes	Default is 1. Number of amount of job runs that may be active at the same time. Use 1, if only a single instance of the job may run.
`backfill`	yesno (boolean)	yes	Default is yes. Use `yes` if you want the job to backfill back since the job start date.
`concurrency`	int	no	Default value is 255. The number of task instances allowed to run concurrently
`depends_on_past`	yesno (boolean)	no	Default value is yes. Causes a task instance to depend on the success of its previous task_instance. Default is `yes`
`email`	string	yes	Email address that will be used for notification. Use a comma(,) to add more email addresses.
`email_on_failure`	yesno (boolean)	yes	Default value is no. Sends an email to the email address(es) in the email property when a task fails.
`email_on_failure`	yesno (boolean)	yes	Default value is no. Sends an email to the email address(es) in the email property when a task fails.
`email_on_retry`	yesno (boolean)	yes	Default value is no. Sends an email to the email address(es) in the email property when a task will retry.
`description`	string	yes	Description for the job, which will show up in the console.
`retries`	int	no	Default value is 0. Amount of retries of the job, if it failed.
`retry_delay`	int	no	Default value is 5 (minutes). Amount of minutes to wait before trying the job again.
`schedule`	string	yes	Default value is None. Leave empty not to have it scheduled automatically. Use a cron job expression.
`start_date`	string (with a relative date), date or date & time	yes	Contains the start date of the job. This is important if you have set the backfill option to `yes`. Backfill will then start at the start date.
`end_date`	string (with a relative date), date or date & time	yes	Contains the end date of the job. The job will not be scheduled after this moment.
`environment`	enumerator (production, development)	yes	Default is production. If set to development, the sandbox files will overwrite the job and task files.

client_cloud

Contains all properties relate to the cloud environment of the client.

Usage

BigQuery

client_cloud:
    type: google
    db_engine: bigquery
    db_conn_id: google_cloud
    db_location: EU
    storage: gs
    storage_conn_id: google_cloud
    project_id: your_product_id
    dataset_id: your_dataset_id
    dataset_id_tmp: your_dataset_id_tmp
    bucket: your_bucket
    folder: ''
    bucket_tmp: your_bucket_tmp
    folder_tmp: ''

Snowflake on Amazon Web Services

client_cloud:
    type: amazon
    db_engine: snowflake
    db_conn_id: snowflake
    database: PRODUCTION
    schema: YOUR_SCHEMA
    schema_tmp: YOUR_SCHEMA_TMP
    storage: s3
    storage_conn_id: amazon_s3
    bucket: your-bucket-id
    folder: production
    bucket_tmp: your-bucket-id-tmp
    folder_tmp: development

Parameters

property name	type	optional	description
`type`	enumerator(google, amazon or azure)	no	Cloud supplier. Either google (for Google), amazon (for Amazon Web Services) or azure (for Microsoft Azure, in beta)
`db_conn_id`	string	no	Connection ID for database access. Contains the name of the connection. Ask Onesecondbefore for your connections.
`db_engine`	enumerator(bigquery or snowflake)	no	Database engine. Currently Onesecondbefore supports Google BigQuery or Snowflake.
`db_location`	enumerator(bigquery query locations)	no	Query location. Default is EU. Currently only supported in Google BigQuery.
`storage_conn_id`	string	no	Connection ID for access storage. Contains the name of the connection. Ask Onesecondbefore for your connections.
`storage`	enumerator(s3 or gs)	no	Storage type. Supports Google Cloud Storage (gs), S3 of Amazon Web Services (s3) or Azure Blob Storage of Microsoft (as, in beta)
`project_id`	string	no	BigQuery only. Contains the project ID.
`dataset_id`	string	no	BigQuery only. Contains the dataset ID.
`dataset_id_tmp`	string	no	BigQuery only. Will be deprectated. Contains the dataset ID where the temporary tables will be stored.
`table_expiration`	string (with relative date), date, date & time	no	BigQuery only. Sets the table expiration of a table.
`database`	string	no	Snowflake only. Contains the database name.
`schema`	string	no	Snowflake only. Contains the schema name.
`storage_integration`	string	yes	Snowflake only. Contains the storage integration. Don't use this unless you have to.
`bucket`	string	no	Bucket name
`folder`	string	no	Folder in which the files will be stored.
`bucket_tmp`	string	no	Bucket name of temporary bucket. Is used to store temporary files. Is used in combination with `folder_tmp`
`folder_tmp`	string	no	Folder in which the temporary files will be stored. Is used in combination with `bucket_tmp`.
`timezone`	string	no	Contains the timezone of the client. Choose it wisely and it is recommended to keep the same across all jobs.

job.yml

Description of the job file that has to be present in all jobs.

Usage

config:
    job:
        schedule: '05 2 * * *'
        description: Unit test tasks for Onesecondbefore Workflows
    task:
        slack_channel: my-alert-channel
        slack_on_failure: yes

tasks:
    - from_doubleclick
    - from_google_ads
    - from_facebook
    - from_bing
    - aggregate_data

relationships:
    - from_doubleclick: aggregate_data
    - from_google_ads: aggregate_data
    - from_facebook: aggregate_data
    - from_bing: aggregate_data

Parameters

property name	type	optional	description
`config`	dict	no	Contains the configuration settings that overwrite the default settings from the global.yml file.
`tasks`	array	no	Contains the tasks in this job. The name of the task must be the exact same name as the YAML file without the extension. E.g. if a task file is called from_doubleclick.yml, put from_doubleclick in the list.
`relationships`	array or dicts	no	Contains the relationships of the tasks in this job. Every item is a pair like `task from`: 'task to'. The relationships are directional, which means that the `task to` is executed when `task from` is done. All tasks that do not appear in a relationship will be executed regardless of the outcome of the other tasks.
`status`	enumerator (0 or 1)	no	1 if job is active, 0 if job is not active