Cached Takes: 80% of Companies do not need Snowflake or Databricks

The cost for something that can be replicated free and open source is absurd. The Fortune 100 have a use case for these companies, the rest are overpaying.

Kieran Healey
5 min readJul 14, 2023
Recession is coming are you ready to explain your expensive bills to your CEO? AI generated image from craiyon.com
A Recession is coming are you ready to explain your expensive bills to your CEO? AI generated image from craiyon.com

I have been working in Data Engineering for about 8 years now. When I started, I caught the tail end of the Hadoop era. I got my start on a technology known then as MapR at a boutique Energy Firm in Houston, Texas. This was a period of personal growth and learning. From Python, Spark, Scala, Docker, Airflow and how to check in code to Gitlab. It was an exciting time. As a junior developer you are more fascinated by what you can do with a language rather than what you can solve. This is the way.

shamelessly stolen from tenor.com

As I progressed in my career, became somewhat wiser (debatable), more business-oriented (less debatable), and apathetic with the technology I was using, I realized that most companies overspent and under-deliver on their data projects. But why? It seems like there are tons of companies who are buying into the marketing hype surrounding the two major Data Platform as a Service offering: Databricks and Snowflake. My Cached Take for the Week: 80% of companies do not need what these companies are offering for the price they are paying. The global economy is headed for a recession. That’s not my opinion, that’s the Federal Reserve’s. So in a time where most companies should be cutting back, how and where should you cut?

Both of these platforms are charging their customers for Ferraris when most companies would be happy with a Toyota Camry with some After Market parts. Don’t get me wrong, both platforms are doing impressive things; Snowflake’s Data Cloud is really coming to life with their partnership with NVIDIA and Databricks is putting runs on the board with the Spark Human API. These are not reasons you should use their platforms. These are marketing gimmicks to garner the attention of the decision-makers to loosen the company's purse strings. This is a fundamental mistake in the way businesses operate. Pay to make a problem disappear. We operate as if these platforms solve problems when fundamentally they do not and often exacerbate large inefficiencies that already exist within an organization. If you want to make a big impact be Anti-Hype.

Alright, so how do you become Anti-Hype, you ask? Trust in people. People solve problems; thus, we should encourage businesses to get to the core of their data problems first before they start to spend $$$. Here is a list of things you can do as a cost-conscious, Anti-Hype data person.

1. Sit down with your customer and be their Data Advocate

This is probably the hardest thing to do because ultimately, your business wants “The Art of the Possible”. What a way to pass any and all responsibility to the engineer while maintaining none of the responsibility of defining your product. It’s a great reminder that despite all of the hype around data, people do not know what they want. This is a huge red flag in any project I commit to because it shows immaturity and a lack of understanding of how the customer wishes to translate data into business value. The most important thing about being Anti-Hype is truly understanding customer desire and translating that into an appropriate and measured response. Questions I would ask to engage the customer are: How do you see yourself using the data? Are you going to passively look at the data daily, or are you going to require some functionality to send data to another vendor? What is the frequency of the Data? What is the volume? Any engineer can code, a great Anti-Hype engineer asks questions until they hear “I do not know”. Once this is said, then you can draw on your experience and expertise to help design and architect a solution that meets their needs.

2. Do the boring stuff: Make a Data Model

This art has been lost amidst the Data Hype Train. Most people are focused on going as fast as possible without realizing the architectural hole they might be burying their careers under. Modern Data Stack is great at producing velocity, but if you are fast at doing things wrong it might spell doom if your other systems rely on the Modern Data Stack to finish performing analysis on very wide and long tables. Think about your architecture, and decide on a data architecture for your Data Warehouse (Kimball, Inmon, etc) and a visualization layer (Snowflake Schema, Star Schema, Galaxy Schema). If you are doing data lake, define the columns you want to partition on and colocate columns that you would filter on. Define when you should compact files, and when to Z-order. If Data needs to move bidirectionally decide on how data updates on load in both directions. Talk with your stakeholders. Ask the question how they want to see the data, colocate the fields they want to see together and push the data fields they don’t want to see to other tables. If you can sniff out the inefficiencies in your Data early and make architecture that handles your specific data, you often find you do not need the power that a Snowflake or Databricks has to offer. Do the real work. Work with people. The Code will write itself. I promise.

3. Accept and adopt Open Source technologies that can replace Expensive SaaS Solutions.

If you are looking to replicate the core functionality of Snowflake for your Analytics Platform, look no further than DuckDB. It can run on FS-like systems (s3, Azure Blob, etc) and Databases. I have been using the Postgres duck-db extension with much success and could see it replacing the separation of Storage and Computing that Snowflake offers. Keeping a pulse on these technologies can help improve your bottom line, while this does not include all of the security features that come out of the box in both Snowflake and Databricks, a talk with your local Dev Ops Engineer/ manager to discuss how to secure your implementations. As always, the human element of your job is the most important step. By interacting, exploring, and translating technologies from Hype to business value, you can help save money while maintaining nearly the same level of core functionality as Databricks and/or Snowflake.

That’s all for now! I hope you all have learned to be a little more Anti-Hype. Kieran Healey is a doer at heart, a video game coder at night, a DnD cultist on Sundays, and a lover of technology. When not coding, he is hanging with his dog Tazzie, or out and about in the Houston area. If you would like to get in touch, I have an email I use for all anti-hype questions. Please feel free to reach out to antihypedataguy@gmail.com

--

--

Kieran Healey

Full Stack Data Guy — likes blogging about new technologies and sharing simple tutorials to explain the tech.