Basic: build a simple pipeline
In this tutorial you’ll learn how to create a Baker-based program to process a given dataset (in CSV format), filter records based on your needs and save the result to S3.
The dataset we’re going to use is an open dataset containing ratings on many Ramens, the famous japanese noodle soup!
Our goal is to discard all ramens that have never been on a top-ten ranking, split the results into multiple folders named after the ramens source countries, and upload the resulting lists to S3.
The dataset
The dataset file has 7 columns:
- review_num: the number of the review (higher numbers mean more recent reviews)
- brand: the name of the restaurant
- variety: the name of the recipe
- style: the type of the ramen (cup, pack, bowl, etc)
- country: self-explanatory
- stars: ratings stars (from 0 to 5)
- top_ten: whether the ramen has been included in a top-ten ranking
Warning
The original CSV file can’t be immediately used with Baker because:
- it includes a header row
- some fields have values with commas and thus are enclosed in double-quotes. Baker doesn’t support it
- the file is uncompressed
For the purpose of this tutorial we’ve already prepared the final file for you and it is available for downloading here.
The required components
List
: reads the input file from diskNotNull
: discards all ramens without a top-ten entryFileWriter
: saves the resulting file to diskS3
: uploads the file to S3
Baker configuration
An essential thing to do is to create a configuration file for Baker, in TOML format, selecting the aforementioned components:
[fields]
names = ["review_num", "brand", "variety", "style", "country", "stars", "top_ten"]
[input]
name = "List"
[input.config]
Files = ["/tmp/db.csv.gz"] # put the file wherever you like
[[filter]]
name = "NotNull"
[filter.config]
Fields = ["top_ten"] # discard all records with an empty top_ten field
[output]
name = "FileWriter"
procs = 1 # With our PathString, FileWriter doesn't support concurrency
fields = ["country"]
[output.config]
PathString = "/tmp/out/{{.Field0}}/ramens.csv.gz"
[upload]
name="S3"
[upload.config]
Region = "us-east-1"
Bucket = "myBucket"
Prefix = "ramens/"
StagingPath = "/tmp/staging/"
SourceBasePath = "/tmp/out/"
Interval = "60s"
ExitOnError = true
Create the program
Baker is a Go library. To use it, it is required to create a Go main()
function,
define a baker.Components
object and pass it to
baker.MainCLI()
:
package main
import (
"log"
"github.com/AdRoll/baker"
)
func main() {
components := baker.Components{/* define components */}
if err := baker.MainCLI(components); err != nil {
log.Fatal(err)
}
}
Define baker.Components
The only required fields in baker.Components
are the components that we need to use (the complete
guide to baker.Components
is here).
The simplest and more generic way to add the components to Baker is to add all of them:
components := baker.Components{
Inputs: input.All,
Filters: filter.All,
Outputs: output.All,
Uploads: upload.All,
}
The complete program (that is available in the
tutorials/
folder in
the Baker repository) is the following:
package main
import (
"log"
"github.com/AdRoll/baker"
"github.com/AdRoll/baker/input"
"github.com/AdRoll/baker/filter"
"github.com/AdRoll/baker/output"
"github.com/AdRoll/baker/upload"
)
func main() {
if err := baker.MainCLI(baker.Components{
Inputs: input.All,
Filters: filter.All,
Outputs: output.All,
Uploads: upload.All,
}); err != nil {
log.Fatal(err)
}
}
Run the program
Once the code and the configuration files are ready, we can run the topology:
$ go build -o myProgram ./main.go
# Test it works as expected
$ ./myProgram -help
# run the topology
$ ./myProgram topology.toml
Among the messages that Baker prints on stdout, the stats messages are particularly interesting:
Stats: 1s[w:0 r:0] total[w:41 r:2584 u:11] speed[w:20 r:1292] errors[p:0 i:0 f:2543 o:0 u:0]
Take a look at the dedicated page to learn how to read the values.
Verify the result
The resulting files are split into multiple folders, one for each country, and then uploaded.
The S3
upload removes the files from the local disk once uploaded,
so you’ll only find empty directories in the output destination folder:
~ ls -l /tmp/out/
drwxrwxr-x - username 16 Nov 11:43 China
drwxrwxr-x - username 16 Nov 11:43 Hong Kong
drwxrwxr-x - username 16 Nov 11:43 Indonesia
drwxrwxr-x - username 16 Nov 11:43 Japan
drwxrwxr-x - username 16 Nov 11:43 Malaysia
drwxrwxr-x - username 16 Nov 11:43 Myanmar
drwxrwxr-x - username 16 Nov 11:43 Singapore
drwxrwxr-x - username 16 Nov 11:43 South Korea
drwxrwxr-x - username 16 Nov 11:43 Taiwan
drwxrwxr-x - username 16 Nov 11:43 Thailand
drwxrwxr-x - username 16 Nov 11:43 USA
The files have been uploaded to S3:
~ aws s3 ls --recursive s3://myBucket/ramens/
2020-11-16 11:43:59 115 ramens/China/ramens.csv.gz
2020-11-16 11:43:59 83 ramens/Hong Kong/ramens.csv.gz
2020-11-16 11:43:59 223 ramens/Indonesia/ramens.csv.gz
2020-11-16 11:43:59 236 ramens/Japan/ramens.csv.gz
2020-11-16 11:43:59 240 ramens/Malaysia/ramens.csv.gz
2020-11-16 11:43:59 99 ramens/Myanmar/ramens.csv.gz
2020-11-16 11:43:59 219 ramens/Singapore/ramens.csv.gz
2020-11-16 11:43:59 265 ramens/South Korea/ramens.csv.gz
2020-11-16 11:43:59 159 ramens/Taiwan/ramens.csv.gz
2020-11-16 11:43:59 181 ramens/Thailand/ramens.csv.gz
2020-11-16 11:43:59 94 ramens/USA/ramens.csv.gz
Conclusion
This is it for this basic tutorial. You have learned:
- how to create a simple Baker program to process a CSV dataset with minimal filtering and upload the results to S3
- how to create the Baker TOML configuration file
- how to execute the program and verify the result
You can now improve your Baker knowledge by taking a look at the other tutorials and learning more advanced topics.