A Workflows job can contain one or more tasks that are interdependent. This section describes how the folders and files should be structured in a job. And the structure and settings in a job file
They are described in more detail below.
Contains all properties relate to the job.
job:
# Some comment
email:
- admin@onesecondbefore.com
owner: osb
email_on_failure: no
email_on_retry: no
retries: 0
retry_delay: 5
schedule: @daily
property | type | optional | description |
---|---|---|---|
client | string | no | Your client code |
id | string | no | Read-only. Unique name of the job. Automatically set by Workflows with the name of the job folder. |
max_active_runs | integer | yes | Default is 1. Number of amount of job runs that may be active at the same time. Use 1, if only a single instance of the job may run. |
backfill | yesno (boolean) | yes | Default is yes. Use `yes` if you want the job to backfill back since the job start date. |
concurrency | int | no | Default value is 255. The number of task instances allowed to run concurrently |
depends_on_past | yesno (boolean) | no | Default value is yes. Causes a task instance to depend on the success of its previous task_instance. Default is `yes` |
email | string | yes | Email address that will be used for notification. Use a comma(,) to add more email addresses. |
email_on_failure | yesno (boolean) | yes | Default value is no. Sends an email to the email address(es) in the email property when a task fails. |
email_on_failure | yesno (boolean) | yes | Default value is no. Sends an email to the email address(es) in the email property when a task fails. |
email_on_retry | yesno (boolean) | yes | Default value is no. Sends an email to the email address(es) in the email property when a task will retry. |
description | string | yes | Description for the job, which will show up in the console. |
retries | int | no | Default value is 0. Amount of retries of the job, if it failed. |
retry_delay | int | no | Default value is 5 (minutes). Amount of minutes to wait before trying the job again. |
schedule | string | yes | Default value is None. Leave empty not to have it scheduled automatically. Use a cron job expression. |
start_date | string (with a relative date), date or date & time | yes | Contains the start date of the job. This is important if you have set the backfill option to `yes`. Backfill will then start at the start date. |
end_date | string (with a relative date), date or date & time | yes | Contains the end date of the job. The job will not be scheduled after this moment. |
environment | enumerator (production, development) | yes | Default is production. If set to development, the sandbox files will overwrite the job and task files. |
Contains all properties relate to the cloud environment of the client.
client_cloud:
type: google
db_engine: bigquery
db_conn_id: google_cloud
db_location: EU
storage: gs
storage_conn_id: google_cloud
project_id: your_product_id
dataset_id: your_dataset_id
dataset_id_tmp: your_dataset_id_tmp
bucket: your_bucket
folder: ''
bucket_tmp: your_bucket_tmp
folder_tmp: ''
client_cloud:
type: amazon
db_engine: snowflake
db_conn_id: snowflake
database: PRODUCTION
schema: YOUR_SCHEMA
schema_tmp: YOUR_SCHEMA_TMP
storage: s3
storage_conn_id: amazon_s3
bucket: your-bucket-id
folder: production
bucket_tmp: your-bucket-id-tmp
folder_tmp: development
property name | type | optional | description |
---|---|---|---|
type | enumerator(google, amazon or azure) | no | Cloud supplier. Either google (for Google), amazon (for Amazon Web Services) or azure (for Microsoft Azure, in beta) |
db_conn_id | string | no | Connection ID for database access. Contains the name of the connection. Ask Onesecondbefore for your connections. |
db_engine | enumerator(bigquery or snowflake) | no | Database engine. Currently Onesecondbefore supports Google BigQuery or Snowflake. |
db_location | enumerator(bigquery query locations) | no | Query location. Default is EU. Currently only supported in Google BigQuery. |
storage_conn_id | string | no | Connection ID for access storage. Contains the name of the connection. Ask Onesecondbefore for your connections. |
storage | enumerator(s3 or gs) | no | Storage type. Supports Google Cloud Storage (gs), S3 of Amazon Web Services (s3) or Azure Blob Storage of Microsoft (as, in beta) |
project_id | string | no | BigQuery only. Contains the project ID. |
dataset_id | string | no | BigQuery only. Contains the dataset ID. |
dataset_id_tmp | string | no | BigQuery only. Will be deprectated. Contains the dataset ID where the temporary tables will be stored. |
table_expiration | string (with relative date), date, date & time | no | BigQuery only. Sets the table expiration of a table. |
database | string | no | Snowflake only. Contains the database name. |
schema | string | no | Snowflake only. Contains the schema name. |
storage_integration | string | yes | Snowflake only. Contains the storage integration. Don't use this unless you have to. |
bucket | string | no | Bucket name |
folder | string | no | Folder in which the files will be stored. |
bucket_tmp | string | no | Bucket name of temporary bucket. Is used to store temporary files. Is used in combination with `folder_tmp` |
folder_tmp | string | no | Folder in which the temporary files will be stored. Is used in combination with `bucket_tmp`. |
timezone | string | no | Contains the timezone of the client. Choose it wisely and it is recommended to keep the same across all jobs. |
Description of the job file that has to be present in all jobs.
config:
job:
schedule: '05 2 * * *'
description: Unit test tasks for Onesecondbefore Workflows
task:
slack_channel: my-alert-channel
slack_on_failure: yes
tasks:
- from_doubleclick
- from_google_ads
- from_facebook
- from_bing
- aggregate_data
relationships:
- from_doubleclick: aggregate_data
- from_google_ads: aggregate_data
- from_facebook: aggregate_data
- from_bing: aggregate_data
property name | type | optional | description |
---|---|---|---|
config | dict | no | Contains the configuration settings that overwrite the default settings from the global.yml file. |
tasks | array | no | Contains the tasks in this job. The name of the task must be the exact same name as the YAML file without the extension. E.g. if a task file is called from_doubleclick.yml, put from_doubleclick in the list. |
relationships | array or dicts | no | Contains the relationships of the tasks in this job. Every item is a pair like `task from`: 'task to'. The relationships are directional, which means that the `task to` is executed when `task from` is done. All tasks that do not appear in a relationship will be executed regardless of the outcome of the other tasks. |
status | enumerator (0 or 1) | no | 1 if job is active, 0 if job is not active |