Connectors#
There are a number of open source Trel connectors that can be used. See trel_contrib for a complete list.
Let us review some important ones.
bigquery_table
#
This sensor can monitor a given dataset in BigQuery to look for tables with an optional prefix and a date/time component. The time component is used to file the table appropriately.
E.g., table abc_20210201 will be cataloged as <dataset_class>,,20210201,<label>,<repository>
bq_table_prefix
has to contain the full prefix. E.g. abc_ for abc_20210201. Just abc or ab will not be enough.
Credentials: gcp.service_json
TODO: Add option to query using table metadata and not API. This would scale better.
Given below is the template configuration file:
sensor_id: # specify
sensor.source_code.main:
class: github
branch: main
path: https://github.com/cumulativedata/trel_contrib.git
sensor.main_executable: _code/sensors/bigquery_table/sensor_bigquery_table.py
manager_name: main
credentials.requested_name: default
dataset_class: # specify
instance_ts_precision: D # change based on frequency of drop
label: master
repository: # specify destination repository
# The portion of the table name after the prefix will be parsed
# using this format into a timestamp
# instance_ts_format: # default "%Y%m%d"
sleep_duration: 10 # The sensor sleeps for this long between checks.
bq_project: # specify
bq_dataset: # specify
bq_table_prefix: # specify
# Don't look for or insert the table into the catalog if
# it's date or time is more than this many seconds.
max_instance_age_seconds: 864000
debug: [ ] # to debug the sensor add strings "writes" and "reads" into this list
odbc_table_load
#
Find the code here.
This sensor loads from ODBC data sources. See the config.yml file below for possible options.
sensor_id: # specify
sensor.source_code.main:
class: github
branch: main
path: https://github.com/cumulativedata/trel_contrib.git
sensor.main_executable: _code/sensors/odbc_table_load/odbc_table_load.py
manager_name: main
credentials.requested_name: default
dataset_class: # specify
instance_prefix: null # it is recommended you specify a value here, even when null.
instance_ts_precision: D # change based on frequency of drop
label: master
repository: # specify destination repository
# See schema management for details. As for now, this is only used as the attribute
# for the destination dataset and does not influence the schema of the destination.
# That is decided by querying the source table only.
schema_version: null
sleep_duration: 10 # Sleep for this long after each check
# It will generate a dataset for each timestamp in this set, subject to the precision.
cron_constraint: "0 0 * * *"
# Connection / credentials information
driver: MariaDB # Pick a suitable on
server: # of the the ODBC data source
port:
username: # for accessing the ODBC data source
password: # for accessing the ODBC data source
# Instead of placing the credentials as shown above, these 5 entries
# can be stored in a key in credentials management and referenced here
# This takes precedence over the above section.
credentials.source_key: # e.g. odbc.<server_name>
database: # Where the table is located
table: # The table you want to load.
# How many seconds after the previous period is the next period considered ready?
# E.g., normally, with an hourly cadence, instance 20220101_04 will be ready to load
# as of 2022-01-01 05:00:00. With this parameter set to 60, it will be ready only
# as of 2022-01-01 05:00:30
delay_seconds: 30
# Specify a string here that you want run instead.
# The string will be formatted using the following variables:
# 1. table (with value given above)
# 2. instance_ts (will correspond to what this query result will be registered as)
# 3. instance_ts_precision (with value given above)
custom_sql: "select col1, col2 from {table} where col1 >= '{instance_ts}'"
# Don't insert into the catalog, entries with
# intance_ts older than this many seconds.
max_instance_age_seconds: 864000
debug: [ ] # to debug the sensor add strings "writes" and "reads" into this list
The behavior is as follows:
The sensor queries the relevant datasets are already exist in the catalog. It uses the following parameters for this:
dataset_class
instance_ts_precision
label
repository
instance_prefix
max_instance_age_seconds
The sensor looks at the clock and identifies which datasets are relevant to load. The parameters of relevance are,
cron_constraint
delay_seconds
max_instance_age_seconds
backfill_newest_first (true / false)
Here, treat the
cron_constraint
as a set rather than a timer. It consists of a set of timestamps. Thecron_constraint
defines the periods. Each period starts at a timestamp in thecron_constraint
and ends before the next one.
This sensor will try to create a dataset for each period. The resulting dataset’s instance_ts
will match the period start.
A period is considered ready to load
delay_seconds
after the end of the period. Sensor will not try to load a period whose start is older thanmax_instance_age_seconds
from now.
Once a dataset that should be created is identified, a connection is established to the ODBC data source
driver
server
port
username
password
credentials.requested_name
credentials.source_key -> odbc.<server_name>
Then the relevant data is downloaded in pieces and uploaded to the destination
database
table
custom_sql
repository
credentials.requested_name
Warning
When using custom_sql
is used, make sure that for tabular destinations such as BigQuery, the result has the same schema as the table.
This restriction does not apply to object store destinations such as S3 and Google Storage.
Custom SQL#
custom_sql
that is provided will be formatted using python’s .format
rather than ODBC’s SQL parameterization. This is largely for convenience and readability. The rationale for not worrying about SQL injestion is that the SQL is provided by the developer and parameterized by Trel using strings not selectable by anyone.
Only three parameters are possible:
table
: This is the table selected by the developer in this config.instance_ts
: This corresponds to the dataset and is a datetime objectinstance_ts_precision
: This takes one of 4 values: ‘D’, ‘H’, ‘M’ and ‘S’.period_end
: This is the end of this period. The period starts atinstance_ts
.
Please take a look at Pre-installed ODBC Drivers for a list of pre-installed ODBC drivers in the Trel instance for the driver
parameter.
Supported repository classes as destination:
s3
bigquery
Pre-installed ODBC Drivers#
MySQL
MariaDB
PostgreSQL