BYODP — Standardized Notebooks

Build Your Own Data Platform — enable your data scientists by turning them into Data Science Kunai, helping your Analytics department slice and dice your data like a true data shinobi.

In fact, you can use all three at once if you like!

We hold this data truth to be self-evident: Data is visible to some, truly understood by a few, and that the truth of the data is constantly changing. There are frustrations for those among us who seek the unequivocal truth. Some questions that may arise are: How many customers accessed a certain table? What is the average cost of this procedure? How many clients are also seeing this error? You and a colleague investigate the issue and come to two distinct conclusions. Well, who is right? The next hours are usually spent arguing.“Well it works on my machine, maybe you didn’t install “insert package name here”. While it’s an easy fix with a requirements.txt or pipenv file for most devs, sometimes it can be tricky to manage versions of Python across your stack, especially if you have a data science team that needs that one package that only works in a specific version of Python.

This is where it could be useful to use a pre-built, dockerized image of a Jupyter with some slight alterations to allow for quick and easy access to data in your AWS cloud environment. This includes data stored in S3, Athena, and even AWS RDS or a Cloud Data Warehouse where you would find your Data Marts and Data Vaults. This is the beginning of a multi-part series called Build Your Own Data Platform.

When setting up your data platform there are many things to consider; from how data is stored, all the way up to how data is presented to the end-user, it is safe to say that data has a lifecycle of its own. Often this life cycle goes like so:

  1. Inception — An idea is formed where product and data work hand in hand to build an initial concept for the customer. This is often done at a high level between the leaders of the company and technical leaders. This often allows for a good level setting and encourages diverse and different viewpoints some might not consider.
  2. Proof of Concept (POC) — This is often a dangerous area, there are a lot of sacrifices that need to be had on both sides from a product and technical perspective. The product needs something to show the customer to keep them interested. While this brings in new business; this POC cannot be sold as a product to clients for many reasons. For starters, engineers cannot be expected to maintain something that was “duct-taped” together for the sake of speed. In essence, you are asking an engineer to betray their better nature to build something that will be unable to be supported into the future. However, engineering needs to sacrifice some of the ideas that they need to craft the perfect solution the first time.
  3. Reinforce — Now that a POC has been established, some reinforcement needs to occur. Building a Dev, Test, and Prod environment for the POC to encourage repeatable processes. This is where the robust data pipeline is built. Allowing time to build something that can be deployed will increase customer happiness and retention. This is where engineering shines in the Sales pipeline. It is easy to sell a product once, it is much harder to delight customers repeatably, day in and day out.
  4. Repeat — Move on to the next value add that has been identified by Product and Engineering. Time is Money, however; you need a solid Reinforce step to confidently and safely move on to the next thing. Often it is nice to set an EOL agreement with the customer and the engineering team. Thus establishing a timeline for the life of the product you are selling.

It’s important to understand why I am going through this life cycle. I want to establish the need for surfacing data in a fast and efficient way to build data and data science pipelines. From this, you could submit jobs on a Dask cluster and run data engineering pipelines and data science models on demand. This helps generate a quick and efficient way to start building POC’s faster.

Code Time: Setting up the environment

First, let’s containerize our notebooks. We can use any image we want really. I prefer using the jupyter data science notebook images.

version: "3"
image: jupyter/datascience-notebook
user: root
- .env
- $dir
- $ssh
- $aws
- 8888:8888
- w=/home/jovyan
container_name: cloud-data-lab

So I am using a couple of things to obfuscate my environment and provide further functionality to the user. By setting a .env file here I can use it to store the volumes of my paths on my local machine. This functions in two capacities

  1. It allows me to copy my code from my repository on my computer into the docker container so it can be accessed by the jupyter environment running inside the computer.
  2. I can copy the files stored in my ssh and AWS folders allowing me to pass through credentials I have set up via the AWS CLI and access resources inside of AWS from the docker container. I can spin up dask clusters and interact with RDS and S3. This is a powerful paradigm as it allows for faster prototyping of data pipelines and data science models.

There are a couple of other things I will note about the setup here. the environment tag is being used to manipulate the jupyter install env as if you installed it on your local laptop. This is critical as sometimes you want a standard notebook environment as opposed to Juypter Lab.

Here is an example of the .env file for those that are curious. Please note this is not the local directory. I made a ‘default’ template so you could copy and adjust as you see fit.


That’s it! Then all you need to do is:

docker-compose up

Next Time I will show you how to use this environment to Build, Test and Run Data Pipelines in Dask. Until then, Stay Frosty!

Data Engineer Love blogging about new technologies and sharing simple tutorials to explain the tech.