The Open Source Data Platform
Part 2 — Extract and Load: Airbyte
For those who haven’t read Part 1: I highly recommend you read and follow along. If you do, by the end of the series you should have a fairly comprehensive understanding of how to build a Data Platform. This is a tongue-in-cheek comment, but a full-stack data developer can do data engineering, science, analytics, architecture, design, and deployments. My hope is that by the end of the series, you will be a full-stack data developer as well. I will link Part one here so y’all can check it out at your discretion.
In Part 2: We are going to build on top of the best orchestration tool on the market, in my opinion, Prefect; while using Airbyte to perform lifts and shifts of data around our databases to ensure data parity and consistency in our Data Platform.
Airbyte — Automate the boring Extract and Load steps of Data Engineering
What is Airbyte? It is an Extract and Load tool, designed with the ELT(Extract, Load, Transform) paradigm in mind. With hundreds of connectors, and the ability to design your own custom connectors, Airbyte can handle your Lift and Shift loads with ease. But now I hear you say, ELT? Lift and Shift? Why is he using those acronyms and buzzwords: These are large concepts that have been reduced so that it is easier to talk about these issues. EL tools like Airbyte are becoming an essential part of every company's data engineering stack. This article will not go into the differences between ELT and ETL models. I believe in a healthy data platform, both models should exist and be used when appropriate. For example, you might have a legacy database so large, and transformations so foreign, that an ELT approach would fail a data engineer because the source models require so many joins they would be unusable by the data warehouse loaded as they were in the source of the data saved needs to be rebalanced through the use of a Dask or Spark cluster. Conversely, some datasets are so small, that spinning up a Dask/Spark Cluster to munge data from Postgres to Snowflake would be considered overkill and expensive. As a data engineer, one of your responsibilities to the business is to inform them when they are making a bad decision when it comes to data strategy.
Extract, Load, and Transform: Basic Data Engineering
I was once asked by the founder of a company what data engineering is? There is a lot to data engineering beyond ELT, however; knowing your audience is half the battle when it comes to explaining who you are and how you add value to their company. This is what I would consider the most basic and fundamental form of data engineering. These three words make up a lot of what data engineers do on a day-to-day basis. If I were to explain it like I was five, data engineers do ELT. Summarized here is what each letter in ELT does:
Extract — The extract step extracts data from a source system. Data Engineers spend most of their time harmonizing datasets, whether that be legacy databases, website scraping, or stream ingestion, the data engineer is responsible for extracting data from these systems. You must be able to extract the data from any system. Fortunately, there are a wide variety of ways to do this.
Load — The load step takes the data we extracted and loads it into a warehouse. This loading step used to be a combination of CREATE STATEMENTS, JDBC connection strings, and other more complicated methods/custom-written code. This is why Airbyte is so powerful for traditional data warehouses and for loading copies of data sets in experimental databases to build data science feature stores. You just set up the Destination and it will load the data into the database. No Code at All!
Transform — The transform step is where business logic is applied to generate insights into your data. Some engineers swear by different theories and organizational patterns of data warehousing and/or database design. (Inmon versus Kimball, Snowflake/Star Schema, Data Vault 2.0, ODS, OLAP, OLTP, Dimensional Modeling, etc) However, it's the application of the right pattern given the context of the data set that will set you apart from your peers if you intend to work in data.
Enough Talk, Where’s the Code?
I couldn’t agree more! I like to do a little foundation building for those who have zero understanding of what is going on. Airbyte is very easy to install simply run, here below will bring up several containers running locally on your machine
git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker-compose up
Assuming you have docker installed this is all you need to do. You can then type localhost:8000 into your browser and you should be greeted by this page.
Airbyte, first asks a bunch of questions like your email and if you want certain features turned on or off. Turn on and off whichever features you desire! Next, you’ll next be greeted by the main UI hub with several tabs. If I were a first-time user, I would recommend that you use the onboarding steps to help set up your first connection. However, we will be using Prefect to schedule our Airbyte tasks so once we set up the connections, sources, and destinations we will be registering the Airbyte Operator with Prefect to run each of our commands. This will be demonstrated later in the article.
Now, what data source should we consume? Well, I am in need of some modified monsters for a D&D session I have planned this weekend, so I figured I would modify some Pokemon and use them as monsters. The PokéAPI, which is an API that keeps a complete list of all Pokemon and their stats; is a real-life Pokédex that I can import statistics facts, and attacks from which I can base my homebrew D&D monsters on! I am now going to set up a couple of connections for different Pokémon I am planning on modifying heavily.
Connections
Here is an example of the Connections tab after you build out a connection. Airbyte stores your Source and Destination here. This is what defines how data will flow. You can set up multiple sources going to one destination and vice versa.
Sources
Next, I would click on the sources tab to bring up the sources UI. Then you can click the ‘New Sources’ button in the top right corner. On the next tab, you can be able to select ‘PokeAPI’ from the list.
We are consuming directly from an API so we do not need to feed any authentication parameters, worry about being in the same VPC, or tag the resource to make it easier to track costs as we are running from our laptop! Once your source is defined, click the ‘Set up source’ button at the bottom of your screen and Airbyte will test the source to ensure that it can be reached. Once that is done you can head on over to ‘Destinations’
Destinations
Finally, we create a Destination, you should select MongoDB if you have been following along at home and have set up a MongoDB instance in docker. Then you will be presented with a screen that looks like this:
You are able to give the Destination a Name and set the host (For those following along at home your MongoDB instance is localhost and the port is the default 27017) You won’t need to mess with authentication as we have not set up a password or user name for the MongoDB instance. Once that is done you will scroll to the bottom and click the ‘Test Destination’. Airbyte will then fire off a test to Mongo and Ensure that the source is available and accepting data. Once that is done you will go back to the connections and create a new connection. You should see your Source under Source and your Destination under your Destination! On the final page, you can set a schedule for a run and schedule your job and test it. Woohoo, you did it! How easy was that? Airbyte really makes it easy to move data from Source to Destination in a snap. Depending on the Source and the Destination full CDC and data logging are available. One caveat here, not all sources/destinations have every feature enabled as certain connectors are only in Alpha or Beta and not available for GA.
Orchestrate Airbyte with Prefect
Now, the internal scheduler for Airbyte is very cool, but linking jobs together can be tedious. As a data engineer, I would like to be able to stitch jobs together rather than rely on the internals of each tool within our stack. Having said that, Prefect provides an easy way to schedule your data pipelines by using the Airbyte Operator and registering those to your flow. Here is the code I will run to register the Prefect flow that will orchestrate Airbyte.
This code is similar to the code we worked on last week, however; instead of defining tasks with the decorator class function, we use the AirbyteConnectionTask to connect to Airbyte and trigger the Connections to run. The main variables all make sense, one for the host, one for the port, one for the version of the API, and a UUID for the connection_id. Now I may not be aware of a better place to get the connection_id, but I had to parse the last UUID associated with the connection URL. Here is an example of what I mean:
This is probably the most unintuitive thing I had to do with Airbyte, I had to copy and paste the bolded id from my URL. This seemed counterintuitive to the experience with Airbyte thus far. If anyone from Airbyte has the inside scoop on how to get that ID, I would be dying to know. Grabbing them for three jobs was kind of tedious, I couldn’t imagine if you had to do that for hundreds of jobs. What would be really cool is if Airbyte could put that information in the connection tab under the settings with an easy copy to the clipboard button. This would make the experience more enjoyable than failing to copy the id a couple of times as I did. What would be even cooler would be the ability to tag the ids and store them in a key vault so that you could call these ids at run time.
You can see I am importing 3 pokemon JSON packets into and loading them into MongoDB. I could have made a method to re-use the same ingestion code. If someone wants to see that I’ll make an update to this article with that code in gist form. Next, you are going to want to register the project as we did in the previous article. To spare you from clicking over there, run prefect create project pokedex in your terminal to create your project. Then head over to localhost:8080 if your prefect server is still up, if not you will need to start your prefect server so you can see the UI. In the Terminal, run the python script above and you will register your code with the agent. Next, click the little blue rocketship and you should see your tasks fire, one after the other!
Now you can handle your EL demands with Airbyte and schedule multiple Airbyte tasks through Prefect. How cool is that? You could schedule many different EL in preparation to join the data together via some Transformation. You can even create more Destinations and copy the same Source to create multiple experimental databases where you could convert the JSON API data into a relational model for Analytics or a Feature Store with data that is separately curated like a clean room for Data Science!
The reception for my article last week was sensational! Thank you to everyone who read Part 1 and now Part 2. Please follow me to read each part to learn more about how to build an Open Source Data Platform right on your laptop. Spoiler Alert, I am intending to eventually turn this into a Terraform deployment so that anyone can easily deploy the Open Source Data Platform, locally and in the cloud. More importantly, I hope this inspires you to use these technologies to build something awesome! Let me know if you do, I would love to know what you are working on and can offer help/advice when I can. Build something amazing!
UPDATE: Looks like I can use octavia, Airbyte’s cli tool to get the connection id’s I need to generate these programmatically, but I will need a way to parse the output of the cli command to grab these. I will link the download here as well. I may attempt to hack something together, but it might be cool to build a key vault pusher for this so that the connection id’s can be read in at run time, rather than be hard coded. This may seem like an insignificant change, but I could see it being used to write less code and pass in the keys to generate a list of id’s to be used programatically. That way they could be stitched together by prefect using even less code than seen here and allow you better control over the flow of your data. https://github.com/airbytehq/airbyte/tree/master/octavia-cli#secret-management