Introduction
In 1961, the Berlin Wall was constructed to prevent people from leaving East Berlin. It blocked movement, split families, and symbolized a lack of freedom. Decades later, the wall fell in 1989, and people could finally cross without fear. Its collapse symbolized a new era of freedom and opportunity, where barriers no longer restrained movement. There is an (hyperbolic) analogy to be drawn with what’s currently happening with data science platforms.
Countries that allow people to leave freely provide security and flexibility. Similarly, platforms which allow you to easily move your data and pipelines in and out of the platform offer a kind of long-term security. The concept of vendor lock-in is what happens when a company makes it hard to move your data out of their system. If a platform uses proprietary formats or hides critical processes behind closed tools, leaving can become complicated. This risk grows when prices rise or when performance declines, since you have no simple way to exit.
Open Source Software (OSS) is one way to avoid this type of barrier. When software is publicly accessible, it means anyone can use, modify, or share it. That OSS model keeps control in the hands of the community, instead of a single private vendor. People who value freedom often support OSS because it spreads power across more voices and encourages honest feedback. This democratic culture is what drives many data engineers toward platforms built on open formats and open development. They know that if something changes or a new need appears, OSS-based tools give them the best chance of leaving without penalty or simply improving the tools themselves.
Databricks & Open Source
Databricks frequently highlights its commitment to open source on its official site. Databricks began with open source at its core. It was founded by the creators of Apache Spark, one of the most widely adopted engines for large-scale data analytics. Apache Spark revolutionized distributed computing by enabling fast, scalable data processing for a wide range of analytics and machine learning tasks. Apache Spark allows distributed processing for huge datasets, making tasks like data transformation and machine learning faster and more scalable. This technology created a revolution in the data science community by offering a single unified platform for batch processing, streaming, and advanced analytics. This is a tool that I have been using since 2015, and it has stood the test in hundreds of applications that I have personally implemented.
The impact of Spark helped Databricks build an ecosystem around open standards. Over the years, the company released multiple open-source tools. For example, Delta Lake introduced ACID transactions on data lakes, and MLflow provided an open way to track, package, and deploy machine learning models. Delta Lake is now an open project governed by the Linux Foundation, ensuring its long-term sustainability and community oversight. Databricks also developed Koalas, which gave Spark a pandas-like interface. Koalas has since been merged into Apache Spark as the pandas API on Spark, demonstrating its open accessibility and ongoing evolution. Databricks acquired Redash, an open-source SQL analytics tool and kept it open-source. Finally, Databricks has open-sourced its universal catalog for data and AI: Unity Catalog. What does all of this prove? It shows that Databricks is committed to a culture of OSS and has been since its inception.
In contrast, Snowflake was built as a cloud data warehouse with a proprietary approach from the start. For a long time, Snowflake stored data in its own format, and queries only ran inside its engine. This design gave Snowflake control over data handling and made it more difficult to leave once you loaded your data into their environment.
Counter Points
In a recent discussion, Nick Akincilar countered that the version of certain tools running on the Databricks platform isn’t always the same as the open-source edition. Unity Catalog, for example, has a proprietary version that provides advanced governance features. For example, while Unity Catalog has an open-source component, its platform implementation includes proprietary features for advanced governance that are not available in the open version. Even if the underlying technology has been open-sourced in some form, replicating Databricks exactly can still be tough if you want to move away. This is certainly true, and can be seen on Databrick’s own page, which shows that there is an open source version of each tool and a managed version. I support keeping up the pressure on Databricks to continue to open-source their innovations on these tools, even if there is always a “premium” version on the platform which is released later.
Another point raised in that discussion was that Snowflake is not a stranger to OSS. While it is true that Snowflake has begun venturing into open source, this is a recent phenomenon. Snowflake recently (2022) began offering support for Apache Iceberg, which is an open table format, and has also developed Polaris (2024). Snowflake’s official communications have also highlighted its support for Apache Iceberg and initiatives like Polaris, indicating a shift towards embracing open standards in recent years. These moves suggest Snowflake is recognizing the importance of open standards for data storage and cataloging for their clients.
In addition, it’s worth noting that migration tools do exist for moving data in and out of Databricks & Snowflake, including for proprietary formats and components. It’s not always seamless or cheap, but it demonstrates that once enough demand arises, solutions appear in the market. Though lock-in can still be real, the industry is finding ways to reduce its sting.
Tear Down That Wall!
Snowflake’s recent support for open source standards, such as Apache Iceberg and Polaris, represents a break in the old model of closed systems. I see this as a victory for open competition. I am glad that even established proprietary platforms are recognizing the importance of openness and relying less on vendor lock-in. The appearance of Iceberg support and new open projects like Polaris shows that competition with Databricks has driven Snowflake toward more open standards. This can benefit everyone by reducing long-term risks around data lock-in.
One might argue that Databricks also has some obstacles in terms of replication. Having both an open-source and a proprietary version of some tools does make it tricky to run exactly the same environment on your own. However, that concern misses the main point about open-source culture. Databricks has been leading the charge for OSS in big data and data science for over a decade now, and that has created trust. If someone created a stronger variant of Delta Lake, Databricks would likely adopt or align with it, since the company relies on community-driven improvements. The open-source culture at Databricks encourages continuous community improvement and innovation, and this is one of the main pillars of my support for Databricks.
In politics, a culture of freedom leads to stability. Countries that encourage a culture of freedom find their governments responsive to public needs, just as companies that commit to a culture of OSS must continually improve or risk losing support from their users.
Similarly, a diverse, community-based development environment prevents any single entity from controlling all the power, which ultimately benefits users. The more people are involved in developing a platform, the harder it is for any single player to abuse that power. Databricks has proven that its growth comes from a large ecosystem of users and developers who want to share code openly. In an environment like that, the best solutions often rise to the top, even if they come from a community member instead of the vendor itself.
Conclusion
The Berlin Wall eventually crumbled, and in the data world we’re seeing something similar as Snowflake embraced open-source options like Iceberg. As proprietary platforms integrate and spearhead a greater number of open-source projects, we see progress towards greater freedom, mirroring the historical move towards open borders. Clients celebrate this because it gives them options for storing and moving data without any artificial moats. A market where lock-in is weaker means companies can choose the best tool for the job, compete on features, and switch if needed.
On the whole, those of us who have an affinity for open source would lean toward Databricks if all else is equal. If long-term flexibility and control over your data are top priorities, Databricks’ commitment to open formats and community-driven projects offers a greater sense of comfort. Databricks’ history with Apache Spark, Delta Lake, MLflow and Unity Catalog shows it aims to keep the core technology accessible rather than locking it behind proprietary rules.
This does not mean Snowflake is the “wrong” choice, especially if it meets your immediate needs more closely. But if freedom from lock-in and a strong open-source culture rank high on your list, Databricks will likely offer peace of mind in the long run. Ultimately, while both platforms are improving, a culture of open source ensures that power and innovation are distributed across a community of data enthusiasts and practitioners, reducing the risk of any single vendor abusing its position.