3 reasons why you should drop Databricks and move to Saturn Cloud
Everyone has different needs, wants and desires when it comes to programming. From Data Engineer to Data Scientist to Business Analyst and back, everyone just wants things to work. I have seen many Big Data Projects fail because Data Scientists do not have effective tools or data sets to do what they do best. Often, they are forced to use Spark where it doesn’t make sense. While I don’t think it’s necessarily time for everyone to give up on Spark, I think it is time for Data Scientists to move on from Databricks.
Things I hear often when Data Scientists use Spark are:
“I don’t care how the data gets there as long as it’s clean.”
“I don’t want to learn Spark, it doesn’t make sense.”
“Why can’t I just do this all in Python?”
Spark is not a data science tool. Apache Spark started as a university project that grew into a business; it was not designed with the data scientist in mind. It was designed to massively process data, not to perform analysis. Databricks, by extension, masks the deficiencies with a nice GUI, some bolted on Machine Learning features, and their new file format Delta. While these improvements are nice, Databricks has only recently started to truly care about machine learning and data science with the Spark 3.0.0 pre-release just this past month. I think we all know this is too little, too late.
Saturn Cloud, a native data science tool, is designed by people who just like things to work. A Data Scientist is no longer forced to waste time outside of Python, debugging long scala error messages that make zero sense. Saturn Cloud uses Dask, a flexible python library that allows a developer to process massive datasets in parallel. This is how data science should be. Here are 3 things you might not have known about Saturn Cloud:
1. Natively Pythonic
The first language I ever learned was Scheme, it is a language that is related to Lisp. I hated Scheme; it was dynamically typed and had a heavy emphasis on parentheses. My biggest problem with Scheme was I could not do anything useful with it. I was trapped on my local machine unable to call out to the rest of the world. Maybe this was because of how junior I was at the time, but I aspired to build something useful. I ventured out into the data space back in 2014 and looked for a language that I could build with. I learned Python and I never looked back.
Fast forward to today and Python is the most used language in the world. It connects to just about anything and there are thousands of useful packages that make producing business-grade cloud solutions a snap. Databricks tried to make itself friendly to Python by creating what essentially amounts to an API layer between Python and Scala. This adds a layer of abstraction that makes it hard to debug and slower than running natively on Scala. While Databricks has added an optimizer, it is not enough to when running typical analytics workloads.
Enter Saturn Cloud: Instead of using spark, they use the purely pythonic Dask to organize, schedule and distribute your python code natively across a cluster. Saturn Cloud treats your python code as a first-class citizen and allows it to be scaled up and down as you see fit. This is thanks to the Dag Operators that Dask uses to massively parallelize your python to process your data faster. Dags or Directed Acyclic Graphs are best likened to a recipe. You act as the chef, telling python how you want to organize your data. I find this works best with an example.
def load(): configs = kwargs[‘source’] df = dd.read_csv(‘configs’) return dfdef clean(df): sf = df.grouby(‘Year’).max(thing) return sfdef analyze(sequence_of_data): lr = Linear Regression() # fit some data for a regression # return lr.score_def store(result): with open(…, ‘w’) as f: f.write(result) dsk = {‘load-1’: (load, ‘myfile.a.data’), ‘load-2’: (load, ‘myfile.b.data’), ‘load-3’: (load, ‘myfile.c.data’), ‘clean-1’: (clean, ‘load-1’), ‘clean-2’: (clean, ‘load-2’), ‘clean-3’: (clean, ‘load-3’), ‘analyze’: (analyze, [‘clean-%d’ % i for i in [1, 2, 3]]), ‘store’: (store, ‘analyze’)}from dask.multiprocessing import getget(dsk, ‘store’) # executes in parallel
An example of a typical ML workflow, built on Dask/Saturn Cloud
2. Built with the Data Scientist in mind:
Saturn Cloud has done something really special here. They have decided not to reinvent the wheel when it comes to popular data science packages. Dask, and by extension Saturn Cloud, allows you to utilize your favorite libraries (sklearn, xgboost, tensorflow, etc). By contrast, Spark and Databricks require you to utilize their spark-ml package. Any Data Scientist knows why this is horrible, but I will explain relatively quickly; Spark-ML functions that come pre-built do not come with as many customizable parameters and are rather limited in their scope. Trying to add any customizable functionality to match pre-existing python packages using spark-udfs is not only nearly impossible, but you can also experience a huge hit to your program’s performance. By utilizing this “Bring Your Own Package” feature, you can use your data science packages you know and love without sacrificing on performance.
3. Allows for fast visualization using python packages
Saturn Cloud allows you to utilize beautiful custom python packages such as bokeh and seaborn as if you were developing on your laptop. To make these packages work inside of Databricks. You need at least intermediate SysAdmin experience to play around with installing the right packages and libraries to make it work seamlessly. What person, let alone Data Scientist wants to perform SysAdmin duties on their cluster, just to be able to provide business insights. By utilizing a native python stack with Saturn Cloud, there is less to manage; and therefore; more time to code meaningful solutions that provide realized business value.
There are many reasons to switch to Saturn Cloud that I have not outlined,(Cost, Cloud Native, Nice Integration with CI/CD) but I think that these are enough to make any Data Scientists ears burn with the excitement of the possibility of becoming a Python-Stack Data Scientist.
As people have been asking me for the link, Here is the link to Saturn Cloud’s website. Code on!