Spark is currently experiencing its EOL.

1 min readJun 2, 2021

Spark is currently experiencing its EOL. This opinion is controversial but needs to be said. While Spark was great when data engineering happened on Hadoop Clusters years ago, it doesn't serve the entire data stack. This is where Dask shines, it is more generic, allows for integrations with native python ML packages. Databricks is just creating patches on top of scala code if you are using the pyspark lib. While their efforts are valiant, they are in vain. I personally haven't played with the new distributed Python Pandas API so I won't comment on that directly. The pyspark package is an API layer ontop of an API layer. (Python -> Scala -> Java) This leads to terrible performance when not using the native functions in Spark. Dask on the other hand is not limited to this paradigm as they consider functions more generically. Azure ML is not really considered in the same league.... I would argue Databricks has been the king for a long time, but soon using things like Saturn Cloud for your ETL processes is more realistic and far more scalable. Plus, your data scientist and data engineers use the same platform which provides better collaboration.

Written by Kieran Healey

No responses yet