Connectors#

There are a number of open source Trel connectors that can be used. See trel_contrib for a complete list.

Let us review some important ones.

bigquery_table#

This sensor can monitor a given dataset in BigQuery to look for tables with an optional prefix and a date/time component. The time component is used to file the table appropriately.

E.g., table abc_20210201 will be cataloged as <dataset_class>,,20210201,<label>,<repository>

bq_table_prefix has to contain the full prefix. E.g. abc_ for abc_20210201. Just abc or ab will not be enough.

Credentials: gcp.service_json

TODO: Add option to query using table metadata and not API. This would scale better.

Given below is the template configuration file:

sensor_id: # specify

sensor.source_code.main:
  class: github
  branch: main
  path: https://github.com/cumulativedata/trel_contrib.git
sensor.main_executable: _code/sensors/bigquery_table/sensor_bigquery_table.py

manager_name: main
credentials.requested_name: default

dataset_class: # specify 
instance_ts_precision: D # change based on frequency of drop
label: master
repository: # specify destination repository

# The portion of the table name after the prefix will be parsed
# using this format into a timestamp
# instance_ts_format: # default "%Y%m%d" 

sleep_duration: 10 # The sensor sleeps for this long between checks.

bq_project: # specify
bq_dataset: # specify
bq_table_prefix: # specify

# Don't look for or insert the table into the catalog if
# it's date or time is more than this many seconds.
max_instance_age_seconds: 864000 

debug: [ ] # to debug the sensor add strings "writes" and "reads" into this list

odbc_table_load#

Find the code here.

This sensor loads from ODBC data sources. See the config.yml file below for possible options.

sensor_id: # specify

sensor.source_code.main:
  class: github
  branch: main
  path: https://github.com/cumulativedata/trel_contrib.git
sensor.main_executable: _code/sensors/odbc_table_load/odbc_table_load.py

manager_name: main
credentials.requested_name: default

dataset_class: # specify 
instance_prefix: null # it is recommended you specify a value here, even when null.
instance_ts_precision: D # change based on frequency of drop
label: master
repository: # specify destination repository

# See schema management for details. As for now, this is only used as the attribute
# for the destination dataset and does not influence the schema of the destination.
# That is decided by querying the source table only.
schema_version: null

sleep_duration: 10 # Sleep for this long after each check

# It will generate a dataset for each timestamp in this set, subject to the precision.
cron_constraint: "0 0 * * *"

# Connection / credentials information
driver: MariaDB # Pick a suitable on
server:  # of the the ODBC data source
port: 
username: # for accessing the ODBC data source
password: # for accessing the ODBC data source

# Instead of placing the credentials as shown above, these 5 entries
# can be stored in a key in credentials management and referenced here
# This takes precedence over the above section.
credentials.source_key: # e.g. odbc.<server_name>

database: # Where the table is located
table: # The table you want to load.

# How many seconds after the previous period is the next period considered ready?
# E.g., normally, with an hourly cadence, instance 20220101_04 will be ready to load
# as of 2022-01-01 05:00:00. With this parameter set to 60, it will be ready only
# as of 2022-01-01 05:00:30
delay_seconds: 30

# Specify a string here that you want run instead.
# The string will be formatted using the following variables:
# 1. table (with value given above)
# 2. instance_ts (will correspond to what this query result will be registered as)
# 3. instance_ts_precision (with value given above)
custom_sql: "select col1, col2 from {table} where col1 >= '{instance_ts}'"

# Don't insert into the catalog, entries with
# intance_ts older than this many seconds.
max_instance_age_seconds: 864000 

debug: [ ] # to debug the sensor add strings "writes" and "reads" into this list

The behavior is as follows:

  1. The sensor queries the relevant datasets are already exist in the catalog. It uses the following parameters for this:

  • dataset_class

  • instance_ts_precision

  • label

  • repository

  • instance_prefix

  • max_instance_age_seconds

  1. The sensor looks at the clock and identifies which datasets are relevant to load. The parameters of relevance are,

  • cron_constraint

  • delay_seconds

  • max_instance_age_seconds

  • backfill_newest_first (true / false)

Here, treat the cron_constraint as a set rather than a timer. It consists of a set of timestamps. The cron_constraint defines the periods. Each period starts at a timestamp in the cron_constraint and ends before the next one.

This sensor will try to create a dataset for each period. The resulting dataset’s instance_ts will match the period start.

A period is considered ready to load delay_seconds after the end of the period. Sensor will not try to load a period whose start is older than max_instance_age_seconds from now.

  1. Once a dataset that should be created is identified, a connection is established to the ODBC data source

  • driver

  • server

  • port

  • username

  • password

  • credentials.requested_name

  • credentials.source_key -> odbc.<server_name>

  1. Then the relevant data is downloaded in pieces and uploaded to the destination

  • database

  • table

  • custom_sql

  • repository

  • credentials.requested_name

Warning

When using custom_sql is used, make sure that for tabular destinations such as BigQuery, the result has the same schema as the table.

This restriction does not apply to object store destinations such as S3 and Google Storage.

Custom SQL#

custom_sql that is provided will be formatted using python’s .format rather than ODBC’s SQL parameterization. This is largely for convenience and readability. The rationale for not worrying about SQL injestion is that the SQL is provided by the developer and parameterized by Trel using strings not selectable by anyone.

Only three parameters are possible:

  1. table: This is the table selected by the developer in this config.

  2. instance_ts: This corresponds to the dataset and is a datetime object

  3. instance_ts_precision: This takes one of 4 values: ‘D’, ‘H’, ‘M’ and ‘S’.

  4. period_end: This is the end of this period. The period starts at instance_ts.

Please take a look at Pre-installed ODBC Drivers for a list of pre-installed ODBC drivers in the Trel instance for the driver parameter.

Supported repository classes as destination:

  1. s3

  2. bigquery

Pre-installed ODBC Drivers#

  1. MySQL

  2. MariaDB

  3. PostgreSQL