
Introduction
As a data scientist, I’ll admit I had my doubts about whether notebooks could ever find their place in a professional coding environment. My skepticism stemmed from the disconnect I felt between the data science world and the realm of established best practices in software development. The question loomed: Could notebooks, known for their interactive and exploratory nature, truly be harnessed for professional coding? My journey with Databricks eventually led me to a different perspective, but it began with valid concerns (in fact, insecurities) and skepticism that many data scientists share.
The Skepticism
At the heart of my skepticism was the prevailing wisdom about professional coding. Best practices were etched in stone, or so it seemed. Version control, continuous integration, disciplined coding practices, and the use of Integrated Development Environments (IDEs) were not just norms but sacred principles. It felt like heresy to believe that all that could be done with notebooks.
Notebooks have their place in the data scientist’s toolkit, no doubt. They are perfect for exploration, experimentation, and sharing insights with teammates. However, when it comes to the meticulous and regimented world of professional coding, notebooks appear to fall short. Namely:
-
- Version Control: How can we possibly maintain version control when using notebooks? Unlike traditional code files that neatly track changes, notebooks were like living documents, constantly evolving with each execution. The fear of losing track of code changes is haunting.
-
- Continuous Integration: Continuous integration and continuous delivery (CI/CD) pipelines are the bedrock of professional software development. How could notebooks fit into this highly structured workflow? Their dynamic and ad-hoc nature seems incompatible with the rigorous CI/CD processes we rely upon.
-
- Disciplined Coding: Best practices demand disciplined coding habits, adherence to coding standards, and rigorous code reviews. Notebooks, often perceived as a mix of code snippets and explanatory text, seem to defy this discipline.
Databricks
Amidst this skepticism about Notebooks, Databricks emerged as a transformative technology in the world of data science. Databricks offers an ecosystem where data science can thrive while meeting the standards of practice upheld in the world of professional coding.
At the heart of Databricks’ mission is the commitment to simplify and unify data analytics and machine learning. Databricks envisions a world where data-driven insights are within the reach of every organization, regardless of size or industry. They strive to empower organizations to unlock the full potential of data through a collaborative and integrated platform. Databricks seeks to democratize data and accelerate innovation by providing tools and solutions that bridge the gap between data science and engineering, enabling organizations to harness the power of data for improved decision-making and transformative discoveries.
Databricks Repos
Databricks Repos totally transformed the way I interact with notebooks, and ultimately my opinion about notebooks.Databricks Repos offers a seamless way to automatically connect and push notebook work to external repositories using Databricks CLI and GitHub actions, making version control possible.
Databricks Repos was a game-changer for my work because it allows data scientists (like me) to navigate the intricacies of version control and CI/CD pipelines without compromising the flexibility and interactivity that notebooks offer. Databricks Repos became an example of how existing tools and ideas could evolve to adapt to the changing landscape of technology, bridging the gap between established best practices and the unique needs of data scientists.
Integration Tools
VS Code
Another pivotal moment was the integration between Databricks and Visual Studio Code (VS Code). This integration made it possible to work within the familiar confines of an Integrated Development Environment (IDE) while harnessing the capabilities of Databricks notebooks. This newfound synergy gave me the space to experiment and explore my data in a notebook environment, and then seamlessly transition my work to an IDE for more structured coding.
With this integration, users gain access to a powerful IDE that is renowned for its versatility and feature-rich environment. VS Code is more than just a text editor; it’s a hub of functionality that can profoundly impact your coding experience. Here are some key reasons why this integration matters:
- Enhanced Code Editing: VS Code provides advanced code editing features like autocompletion, syntax highlighting, and error checking. This means you can write code faster, with fewer mistakes and with better readability— the latter being a crucial aspect of professional coding.
- Seamless Debugging: Debugging your code becomes a breeze with VS Code. You can set breakpoints, inspect variables, and step through code to identify and fix issues quickly. This level of debugging is often challenging to achieve within a notebook environment.
- Extensive Extensions: VS Code boasts a vast library of extensions that cater to different programming languages and tools. You can customize your IDE to meet your specific needs, such as incorporating extensions that enhance your data science or coding workflow, from Git integration to Jupyter Notebook support.
- Git Integration: Managing version control and collaborating with team members becomes streamlined with VS Code. VS Code seamlessly integrates with Git, allowing you to commit, push, pull, and manage branches—all without leaving your coding environment.
- Integrated Terminals: VS Code provides built-in terminals that enable you to run shell commands and scripts directly from your IDE. This functionality simplifies tasks like data preprocessing, model training, or any command-line operations that are part of your workflow.
In essence, integrating Databricks with VS Code empowers you to work in a familiar and feature-rich coding environment while harnessing the capabilities of Databricks notebooks. This integration made it easier to move from the experimental-exploratory space of notebooks to more structured and professional coding. When used together, Databricks and VS Code creates a holistic coding environment that empowers data scientists more than either tool on their own.
Databricks Asset Bundles (DABS)
The introduction of Databricks Asset Bundles marked another profound shift in notebook technology. This tool streamlined my notebook experiments into a structured and CI/CD-friendly workflow. The emphasis shifted from mere adaptation to actively reshaping best practices to accommodate new tools.
Databricks Asset Bundles represent a significant evolution in the data science and coding landscape by streamlining the management and deployment of resources, pipelines, jobs, and more within the Databricks ecosystem. These bundles introduce a YAML file that serves as a comprehensive blueprint, mapping out the entire infrastructure of projects, from development to staging and production workspaces. This file not only defines the structure of your particular workspace, but also outlines the pipelines, compute resources, models, and experiments that reside within Databricks.
The true power of Databricks Asset Bundles lies in its ability to bring CI/CD systems into the Databricks environment. Using the Databricks Bundles CLI, data scientists and developers can execute commands to deploy code, run jobs, and manage resources with ease. These CLI commands can be seamlessly integrated into existing CI/CD pipelines, making it possible to incorporate tests, job executions, and code deployments as automated steps.
This integration empowers teams to enjoy the best of both worlds: the flexibility and interactivity of Databricks UI and the structured, automated workflows of traditional CI/CD systems. It allows data scientists and developers to work efficiently within their preferred IDE while harnessing the capabilities of Databricks when needed. Whether you need to run tests, deploy code, or orchestrate complex data science workflows, Databricks Asset Bundles provide a cohesive solution that enhances collaboration, streamlines processes, and ensures the seamless integration of data science into the broader software development lifecycle
New Paradigms
Certainly, challenges persist. Databricks Repos is an evolving tool, and the ability to resolve merge conflicts and create pull requests remains a work in progress. However, this underscores the philosophy that best practices should be adaptable, not static. The paradigms of the past are giving way to new, adaptive practices. These emerging paradigms may differ from their predecessors, but they are resilient and well-suited to the dynamic demands of modern data science.
Additionally, tools like LakeFS are reshaping our perceptions of professional code by introducing robust version control for data. With LakeFS, we may see a shift toward unified workspaces rather than traditional development, staging, and production environments (this paradigm shift in data science development is a topic we’ll explore in a future blog).
In hindsight, my initial skepticism about notebooks in a professional coding setting was rooted in some very valid practical concerns, but Databricks responded to those concerns with innovative solutions. Databricks and its suite of tools have shown me that the data science community is not just keeping pace with change; we are driving that change through the adaptation of tools once shunned by professional coders . Without a doubt, professional coding has given way to standards that are invaluable to data scientists — and yet data scientists continue to change our understanding of best practices and what’s possible in the wild world of data.