Basic: build a simple pipeline

A step-by-step tutorial to learn how to build a Baker pipeline using the included components

In this tutorial you’ll learn how to create a Baker-based program to process a given dataset (in CSV format), filter records based on your needs and save the result to S3.

The dataset we’re going to use is an open dataset containing ratings on many Ramens, the famous japanese noodle soup!

Our goal is to discard all ramens that have never been on a top-ten ranking, split the results into multiple folders named after the ramens source countries, and upload the resulting lists to S3.

The dataset

The dataset file has 7 columns:

review_num: the number of the review (higher numbers mean more recent reviews)
brand: the name of the restaurant
variety: the name of the recipe
style: the type of the ramen (cup, pack, bowl, etc)
country: self-explanatory
stars: ratings stars (from 0 to 5)
top_ten: whether the ramen has been included in a top-ten ranking

Warning

The original CSV file can’t be immediately used with Baker because:

it includes a header row
some fields have values with commas and thus are enclosed in double-quotes. Baker doesn’t support it
the file is uncompressed

For the purpose of this tutorial we’ve already prepared the final file for you and it is available for downloading here.

The required components

List: reads the input file from disk
NotNull: discards all ramens without a top-ten entry
FileWriter: saves the resulting file to disk
S3: uploads the file to S3

Baker configuration

An essential thing to do is to create a configuration file for Baker, in TOML format, selecting the aforementioned components:

[fields]
names = ["review_num", "brand", "variety", "style", "country", "stars", "top_ten"]

[input]
name = "List"

    [input.config]
    Files = ["/tmp/db.csv.gz"] # put the file wherever you like

[[filter]]
name = "NotNull"

    [filter.config]
    Fields = ["top_ten"] # discard all records with an empty top_ten field

[output]
name = "FileWriter"
procs = 1 # With our PathString, FileWriter doesn't support concurrency
fields = ["country"]

    [output.config]
    PathString = "/tmp/out/{{.Field0}}/ramens.csv.gz"

[upload]
name="S3"

    [upload.config]
    Region = "us-east-1"
    Bucket = "myBucket"
    Prefix = "ramens/"
    StagingPath = "/tmp/staging/"
    SourceBasePath = "/tmp/out/"
    Interval = "60s"
    ExitOnError = true

Create the program

Baker is a Go library. To use it, it is required to create a Go main() function, define a baker.Components object and pass it to baker.MainCLI():

package main

import (
	"log"

    "github.com/AdRoll/baker"
)

func main() {
    components := baker.Components{/* define components */}
    if err := baker.MainCLI(components); err != nil {
		log.Fatal(err)
	}
}

Define baker.Components

The only required fields in baker.Components are the components that we need to use (the complete guide to baker.Components is here).

The simplest and more generic way to add the components to Baker is to add all of them:

components := baker.Components{
    Inputs:      input.All,
    Filters:     filter.All,
    Outputs:     output.All,
    Uploads:     upload.All,
}

The complete program (that is available in the tutorials/ folder in the Baker repository) is the following:

package main

import (
	"log"

    "github.com/AdRoll/baker"
    "github.com/AdRoll/baker/input"
    "github.com/AdRoll/baker/filter"
    "github.com/AdRoll/baker/output"
    "github.com/AdRoll/baker/upload"
)

func main() {
    if err := baker.MainCLI(baker.Components{
        Inputs:      input.All,
        Filters:     filter.All,
        Outputs:     output.All,
        Uploads:     upload.All,
    }); err != nil {
		log.Fatal(err)
	}
}

Run the program

Once the code and the configuration files are ready, we can run the topology:

$ go build -o myProgram ./main.go 
# Test it works as expected
$ ./myProgram -help
# run the topology
$ ./myProgram topology.toml

Among the messages that Baker prints on stdout, the stats messages are particularly interesting:

Stats: 1s[w:0 r:0] total[w:41 r:2584 u:11] speed[w:20 r:1292] errors[p:0 i:0 f:2543 o:0 u:0]

Take a look at the dedicated page to learn how to read the values.

Verify the result

The resulting files are split into multiple folders, one for each country, and then uploaded.

The S3 upload removes the files from the local disk once uploaded, so you’ll only find empty directories in the output destination folder:

~ ls -l /tmp/out/
drwxrwxr-x   - username 16 Nov 11:43 China
drwxrwxr-x   - username 16 Nov 11:43 Hong Kong
drwxrwxr-x   - username 16 Nov 11:43 Indonesia
drwxrwxr-x   - username 16 Nov 11:43 Japan
drwxrwxr-x   - username 16 Nov 11:43 Malaysia
drwxrwxr-x   - username 16 Nov 11:43 Myanmar
drwxrwxr-x   - username 16 Nov 11:43 Singapore
drwxrwxr-x   - username 16 Nov 11:43 South Korea
drwxrwxr-x   - username 16 Nov 11:43 Taiwan
drwxrwxr-x   - username 16 Nov 11:43 Thailand
drwxrwxr-x   - username 16 Nov 11:43 USA

The files have been uploaded to S3:

~ aws s3 ls --recursive s3://myBucket/ramens/
2020-11-16 11:43:59        115 ramens/China/ramens.csv.gz
2020-11-16 11:43:59         83 ramens/Hong Kong/ramens.csv.gz
2020-11-16 11:43:59        223 ramens/Indonesia/ramens.csv.gz
2020-11-16 11:43:59        236 ramens/Japan/ramens.csv.gz
2020-11-16 11:43:59        240 ramens/Malaysia/ramens.csv.gz
2020-11-16 11:43:59         99 ramens/Myanmar/ramens.csv.gz
2020-11-16 11:43:59        219 ramens/Singapore/ramens.csv.gz
2020-11-16 11:43:59        265 ramens/South Korea/ramens.csv.gz
2020-11-16 11:43:59        159 ramens/Taiwan/ramens.csv.gz
2020-11-16 11:43:59        181 ramens/Thailand/ramens.csv.gz
2020-11-16 11:43:59         94 ramens/USA/ramens.csv.gz

Conclusion

This is it for this basic tutorial. You have learned:

how to create a simple Baker program to process a CSV dataset with minimal filtering and upload the results to S3
how to create the Baker TOML configuration file
how to execute the program and verify the result

You can now improve your Baker knowledge by taking a look at the other tutorials and learning more advanced topics.

Last modified November 13, 2020