Pipeline configuration

How to configure Baker using TOML files

A Baker pipeline is declared in a configuration file in TOML format. We use this file to:

  • define the topology (i.e the list of components) of the pipeline we want to run
  • configure each component
  • setup general elements such as metrics

Configuration file

Baker is configured using a TOML file, which content is processed by the NewConfigFromToml function.

The file has several sections, described below:

Section Required Content
[general] false General configuration
[metrics] false Metrics service configuration
[fields] false Array of record fields names
[validation] false Input record field validation
[[user]] false Array of user-defined configurations
[input] true Input component configuration
[filterchain] false Filter chain global configuration
[[filter]] false Array of filters configuration
[output] true Output component configuration
[upload] false Upload component configuration

General configuration

The [general] section is used to configure the general behaviour of Baker.

Key Type Effect
dont_validate_fields bool Reports whether records validation is skipped (by not calling Components.Validate)

Fields configuration

The name configuration in the [fields] section provides a declarative way to define the structure of the records processed by Baker, without asking the user to define the FieldByName and FieldName functions.

names is a list of strings declaring the names of the fields and their position in the record (that is inherited by the position of the name in the list).

So, to make an example:

[fields]
names = ["foo", "bar"]

defines a structure of the records with two fields: foo as first element and bar as second.

Validation configuration

The [validation] section is an optional configuration that contains one or more field names each of which is associated with a regular expression. If the validation section is specified Baker automatically generates a validation function, which checks that each input record satisfies the provided regular expression. The record is discarded at the first field that doesn’t match its associated regular expression. The user could choose to not provide record validation at all or to implement a more sophisticated validation function using a Go function specified in the Components struct. However, the validation could not be present both in the TOML and in the Components.

To make an example:

[validation]
foo = "^\w+$"
bar = "[0-9]+"

defines that foo field must be a not empty word and bar field must contain a number. In this case, a valid record could be:

foo bar
hello_world hello23

The regular expression reference could be found at golang.org/s/re2syntax

Components configuration

Components sections are [input], [[filter]], [output] and [upload] and contain a name = "<component name>" line and an optional config subsection (like [input.config]) to set specific configuration values to the selected component.

Components’ specific configuration can be marked as required (within the component code). If a required config is missing, Baker won’t start.

This is a minimalist Baker configuration TOML, reading records from files (List), applying the TimestampRange filter and writing the output to DynamoDB, with some specific options:

[input]
name="List"

    [input.config]
    files=["records.csv.gz"]

[[filter]]
name="TimestampRange"

    [filter.config]
    StartDatetime = "2020-10-30 15:00:00"
    EndDatetime = "2020-11-01 00:00:00"
    Field = "timestamp"

[output]
name="DynamoDB"
fields=["source","timestamp","user"]

    [output.config]
    regions=["us-west-2","us-east-1"]
    table="MyTable"
    columns=["s:Source", "n:Timestamp", "s:User"]

[input] selects the input component, or where to read the records from.
In this case, the List component is selected, which is a component that fetches files from a list of local or remote paths/URLs. [input.config] is where component-specific configuration can be specified, and in this case we simply provide the files option to List.
Notice that List would accept http:// or even s3:// URLs there in addition to local paths,
and some more (run ./Baker-bin -help List in the help example for more details).

[filterchain] defines the configuration for the whole filter chain. Filter-specific configurations are provided by [[filter]] (see below). The only accepted configuration in [filterchain] is procs = <int> that defines the number of concurrent filter chains. The default value is 16.

[[filter]] In TOML syntax, the double brackets indicates an array of sections.
This is where you declare the list of filters (i.e filter chain) to sequentially apply to your records. As other components, each filter may be followed by a [filter.config] section.

[output] selects the output component; the output is where records that made it to the end of the filter chain without being discarded end up. In this case, the DynamoDB output is selected, and its configuration is specified in [output.config].

In the example topology above we don’t specify an [upload] section since the output doesn’t create files on the local filesystem, it makes queries to DynamoDB.

The fields option in the [output] section selects which fields of the record are sent to the output.
In fact, most pipelines don’t want to send the full records to the output, but they select a few important fields out of the many available fields.
Notice that this is just a selection: it is up to the output component to decide how to physically serialize those fields. For instance, the DynamoDB component requires the user to specify an option called columns that specifies the name and the type of the column where the fields are written.

Metrics configuration

The [metrics] section allows to configure the monitoring solution to use. Currently, only datadog is supported.

See the dedicated page to learn how to configure DataDog metrics with Baker.

User defined configurations

The baker.NewConfigFromToml function, used by Baker to parse the TOML configuration file, can be also used to add custom configurations to the TOML file (useful as Baker can be used as library in a more complex project).

This is an example of a TOML file defining also some of those user defined configurations (along with the input and output configurations):

[input]
name="random"

[output]
name="recorder"

[[user]]
name="MyConfiG"

	[user.config]
	field1 = 1
	field2 = "hello!"

Using NewConfigFromToml is then possible to retrieve those configurations:

cfg := strings.NewReader(toml) // toml is the content of the toml file

// myConfig contains the user-defined configurations we expect from the toml file
type myConfig struct {
    Field1 int
    Field2 string
}
mycfg := myConfig{}

// comp is the baker components configuration.
// Here we use Inputs and Outputs in addition to User because
// they are required configurations
comp := baker.Components{
    Inputs:  []baker.InputDesc{inputtest.RandomDesc},
    Outputs: []baker.OutputDesc{outputtest.RecorderDesc},
    User:    []baker.UserDesc{{Name: "myconfig", Config: &mycfg}},
}

// Use baker to parse and ingest the configuration file
baker.NewConfigFromToml(cfg, comp)

// Now mycfg has been populated with the user defined configurations:
// myConfig{Field1: 1, Field2: "hello!"}
// and can be used anywhere in the program

More examples can be found in the dedicated test file.

Environment variables replacement

Baker supports environment variables replacement in the configuration file.

Use ${ENV_VAR_NAME} or $ENV_VAR_NAME and the value in the file is replaced at runtime.
Note that if the variable doesn’t exist, then an empty string is used for replacement.

Last modified October 29, 2020