Workflows · Do Tasks · do_sessionize

Purpose

The do_sessionize task is used to sessionize the Analytics data in the analytical database.

Example usage

task:
    type: do_sessionize
    slack_channel: osb-platform-alerts
    slack_on_failure: yes
    start_date: today -3 days
    # end_date: today +1 days

extract:
    custom_where: | 
                    AND user_agent.is_crawler <> TRUE 

load:
    table_id: hits3_1
    table_id_longterm: hits3_longterm
    table_id_trash: trash3
    table_id_user_history: user_history
    table_id_profiles: profiles
    dataset_id: osb_analytics

    # Include / exclude hits from sessionization (but leave them in the hits table)
    # where:
    #   - field: user_agent.is_crawler
    #     operator: '!='
    #     value: yes

    # Which ids are stitching keys?
    stitching:
      - key: email
        regexp: ^[0-9a-f]{64}$

    # Which ids are whitelisted in profiles table?
    ids:
      - ga
      - xandr
      - email

Root properties

The properties below can be used at root level.

task
extract
transforms
deduplicate
load

Properties: task

property	type	required	description
`type`	enumerator	yes	Set to `do_sessionize`
`start_date`	relative or absolute date or date & time	yes	Start date of the period that will be selected in the datasource. Can be filled with an absolute or relative date. Read more about relative date and time here. It is used by this task to scan the hits table for unsessionized hits. The start_date is then corrected to timestamp of the earliest non-sessionized hit. This is done to reduce data scanning to a minimum.

Properties: extract

property	type	required	description
`custom_where`	string	no	String will be added to the query that selects the hits that will be sessionized. Use this to customize the selection, e.g. for ignoring hits from a certain IP address or particular user agent.

transforms

The `transforms` part of the configuration enables you to manipulate data before you load it in your data lake. You have to add all transform as an array.

Example usage

transforms:
    # IP address filter
    - type: filter
      exclude: yes
      field: ip_address.raw
      values:
        - ::ffff:127.0.0.1 # localhost IPv6
        - 127.0.0.1 # Locahost IPv4
    
    # Test browsers (e.g. HeadlessChrome)
    - type: filter_regexp
      exclude: yes
      field: user_agent.raw
      values:
        - .*HeadlessChrome.*
    
    # Development URL's
    - type: filter_regexp
      field: page.url.raw
      exclude: yes
      values:
        - file:///.*

Transform types

Below are the types of transforms. You have to add them as an array. Below is an explanation per transform on how to use it exactly.

filter
filter_regexp

transform: filter

Filter out the hits that match the string filter and move them to the `trash` table.

property	type	required	description
`type`	string	no	Set to `filter`
`field`	string	no	Name of the table field. Only works on fields at root level.
`values`	array of strings	no	Filter values. If filter matches positive on any of the values, the complete hit is moved to the trash table. The value of the system.status field will be 1 and the run_id will be put in the system.run_id.

transform: filter_regexp

Filter out the hits that match the regular expression filter and move them to the `trash` table.

property	type	required	description
`type`	string	yes	Set to `filter`
`field`	string	yes	Name of the table field. Only works on fields at root level.
`values`	array of regular expressions	yes	Filter regular expression values. If filter matches positive on any of the values, the complete hit is moved to the trash table. The value of the system.status field will be 2 and the run_id will be put in the system.run_id.

Properties: Deduplicate

property	type	required	description
`type`	enumerator(redo,reset)	no	`redo` = Use if you want to redo the sessionization partially. In combination with a start_date in task it will first undo the sessionization of the previous runs and the re-sessionize the hits as usual. `reset` = Use if you want to completely redo the sessionization. It will drop the profiles, sessions, user_profiles and trash tables before undoing all sessionization.
`start_date`	relative or absolute date or date & time	yes	Start date of the period that will be selected in the datasource. Can be filled with an absolute or relative date. Read more about relative date and time here. It is used to determine the run_id that sessionized the hit at the start_date. All hits from that run_id and the run_ids after will be re-sessionized. Handle with care. It might lead to new numbers in some cases, especially with changed filters.

Properties: load

property	type	required	description
`project_id`	string	no	BigQuery only. Contains the ID of the project that holds all tables.
`dataset_id`	string	no	BigQuery only. Contains the ID of the dataset that has all the tables.
`table_id`	string	yes	BigQuery only. Contains the ID of the table that holds the (short-term) hits. This table is time partitioned by the hour. It contains hits including and later than 160 days ago.
`table_id_longterm`	string	yes	BigQuery only. Contains the ID of the table that holds the longterm hits. It is time partitioned by day. It contains hit up till (and does not include) 160 days ago.
`table_id_trash`	string	yes	BigQuery only. Contains the ID of the trash table. Holds all the hits that matched the filters.
`table_id_user_history`	string	yes	BigQuery only. Contains the ID of the user history table. Contains the last session number per cookie of the last X runs.
`table_id_profiles`	string	yes	BigQuery only. Contains the ID of the profiles table.
`site_id`	string	no	Only sessionizes hits that have this site_id. Leaves all other hits in the hits table. It does not move hits to the trash. For that purpose, you should use a filter.
`stitching`	array of objects	no	Use `key`, `include` and `regexp`. Stitch users on these keys. Only consider key values that match the `regexp`.
`ids`	array of strings	no	Ids to add to the profiles table.