Striking a balance with ‘open’ at Snowflake

0
157

The relative merits of “open” have been hotly debated in our industry for years. There is a sense in some quarters that being open is beneficial by default, but this view does not always fully consider the objectives being served. What matters most to the vast majority of organizations are security, performance, costs, simplicity, and innovation. Open should always be employed in service of those goals, not as the goal in itself.

When we develop products at Snowflake, we evaluate where open standards, open formats, and open source can create the best outcome for our customers. We believe strongly in the positive impact of open and we are grateful for the open source community’s efforts, which have propelled the big data revolution and much more. But open is not the answer in every instance, and by sharing our thinking on this topic we hope to provide a useful perspective to others creating innovative technologies.

[ Also on InfoWorld: What’s next for the cloud data warehouse ]

Open is often understood to describe two broad elements: open standards and open source. We’ll look at them each in more detail here.

Open standards

Open standards encompass file formats, protocols, and programming models, which include languages and APIs. Although open standards generally provide value to users and vendors alike, it’s important to understand where they serve higher-level priorities and where they do not.

File formats

We agree that open file formats are an important counter to the very real problem of vendor lock-in. Where we differ is in the assertion that those open formats are the optimal way to represent data during processing, and that direct file access should be a key characteristic of a data platform. 

At first glance, the ability to directly access files in a standard, well-known format is appealing, but it becomes troublesome when the format needs to evolve. Consider an enhancement that enables better compression or better processing. How do we coordinate across all possible users and applications to understand the new format?

Or consider a new security capability where data access depends on a broader context. How do we roll out a new privacy capability that reasons through a broader semantic understanding of the data to avoid re-identification of individuals? Is it necessary to coordinate all possible users and applications to adopt these changes in lockstep? What happens if one of these is missed?

Our long experience with these trade-offs gives us a strong conviction about the superior value of providing abstraction and indirection versus exposing raw files and file formats. We strongly believe in API-driven access to data, in higher-level constructs abstracting away physical storage details. This is not about rejecting open; it’s about delivering better value for customers. We balance this with making it very easy to get data in and out in standard formats.

A good illustration of where abstracting away the details of file formats significantly helps end users is compression. An ability to transparently modify the underlying representation of data to achieve better compression translates to storage savings, compute savings, and better performance. Exposing the details of file formats makes it next to impossible to roll out better compression without incurring long migrations, breaking changes, or added complexity for applications and developers. 

Similar issues arise when we think about enhancements to security, data governance, data integrity, privacy, and many other areas. The history of database systems offers plenty of examples, like iSAMS or CODASYL, showing us that physical access to data leads to an innovation dead end. More recently, adopters of Hadoop found themselves managing costly, complex, and unsecured environments that didn’t deliver the promised performance.

In a world with direct file access, introducing new capabilities translates into delays in realizing the benefits of those capabilities, complexity for application developers, and, potentially, governance breaches. This is another point arguing for abstracting away the internal representation of data to provide more value to customers, while supporting ingestion and export of open file formats. 

Open protocols and APIs

Data access methods are more important than file formats. We all agree that avoiding vendor lock-in is a key customer priority. But while some believe that open formats are the solution, the heavy lifting in any migration is really about code and data access, whether it’s protocols and connectivity drivers, query languages, or business logic. Those who have gone through a system migration can likely attest that the topic of file formats is a red herring.

For us, this is where open matters most — it’s where costly lock-in can be avoided, data governance can be maximized, and greater innovation is possible. Focusing on open protocols and APIs is key to avoiding complexity for users and enabling continuous, transparent innovation.

Open source

The benefits cited for open source include a greater understanding of the technology, increased security through transparency, lower costs, and community development. Open source can deliver against some of these goals, and does so primarily when technology is installed on-premises, but the shift to managed services greatly alters these dynamics.

When it comes to greater understanding of code, consider that a sophisticated query processor is typically built and optimized over several years by dozens of Ph.D. graduates. Making the source code available will not magically allow its users to understand its inner workings, but there may be greater value in surfacing data, metadata, and metrics that provide clarity to customers.

Another aspect of this discussion is the desire to copy and modify source code. This can provide value and optionality to organizations that can invest to build these capabilities, but we’ve also seen it lead to undesirable consequences, including fragmented platforms, less agility to implement changes, and competitive dysfunction. 

Increased security

This has traditionally been one of the main arguments for open source. When an organization deploys software within its security perimeter, source code availability can indeed increase confidence about security. But there is a growing awareness of the risks in software supply chains, and complex technology solutions often aggregate multiple software subsystems without an understanding of the full end-to-end impact on security.

Luckily there is a better model, which is the deployment of technology as managed cloud services. Encapsulation of the inner workings of these services allows for faster evolution and speedy delivery of innovation to customers. With additional focus, managed services can remove the configuration burden and eliminate the effort required for provisioning and tuning. 

Lower cost

Most organizations have recognized by now that not paying a software license does not necessarily mean lower costs. Besides the cost of maintenance and support, it ignores the cost and complexity of deploying, updating, and break-fixing software. Cost should be measured in terms of total cost and price/performance out of the box. Here, too, managed services are preferable, removing among other things the need to manage versions, work around maintenance windows, and fine-tune software.

Community

One of the most powerful aspects of open source is the notion of community, by which a group of users work collaboratively to improve a technology and help one another. But collaboration does not need to imply source code contribution. We think of community as users helping one another, sharing best practices, and discussing future directions for the technology. 

As the shift from on-premises to the cloud and managed services continues, these topics of control, security, cost, and community recur. What’s interesting is that the original goals of open source are being met in these cloud environments without necessarily providing source code for everyone—which is where we started this discussion. We must not lose sight of the desired outcomes by focusing on tactics that may no longer be the best route to those outcomes.

Open at Snowflake

At Snowflake, we think about first principles, about desired outcomes, about intended and unintended consequences, and, most importantly, about what’s best for our customers. As such, we don’t think of open as a blanket, non-negotiable attribute of our platform, and we are very intentional in choosing where and how we embrace it. 

Our priorities are clear: 

  1. Deliver the highest levels of security and governance; 
  2. Provide industry-leading performance and price/performance through continuous innovation; and 
  3. Set the highest levels of quality, capabilities, and ease of use so our customers can focus on deriving value from data without the need to manage infrastructure. 

We also want to ensure that our customers continue to use Snowflake because they want to and not because they’re locked in. To the extent that open standards, open formats, and open source help us achieve those goals, we embrace them. But when open conflicts with those goals, our priorities dictate against it.

Open standards at Snowflake

With those priorities in mind, we have fully embraced standard file formats, standard protocols, standard languages, and standard APIs. We’re intentional about where and how we do so, and we have invested heavily in the ability to leverage the capabilities of our parallel processing engine so that customers can get their data out of Snowflake quickly should they need or choose to. However, abstracting away the details of our low-level data representation allows us to continually improve our compression and deliver other optimizations in a way that is transparent to users. 

We can also advance the controls for security and data governance quickly, without the burden of managing direct (and brittle) access to files. Similarly, our transactional integrity benefits from our level of abstraction and not exposing underlying files directly to users. 

We also embrace open protocols, languages, and APIs. This includes open standards for data access, traditional APIs such as ODBC and JDBC, and also REST-based access. Similarly, supporting the ANSI SQL standard is key to query compatibility while offering the power of a declarative, higher-level model. Other examples we embrace include enterprise security standards such as SAML, OAuth, and SCIM, and numerous technology certifications.

With proper abstractions and promoting open where it matters, open protocols allow us to move faster (because we don’t need to reinvent them), allow our customers to re-use their knowledge, and enable fast innovation due to abstracting the “what” from the “how.” 

Open source at Snowflake

We deliver a small number of components that get deployed as software solutions into our customers’ systems, such as connectivity drivers like JDBC or Python connectors or our Kafka connector. For all of these we provide the source code. Our goal is to enable maximum security for our customers, and we do so by delivering our core platform as a managed service, and we increase the peace of mind for installable software through open source.

However, a misguided application of open can create costly complexity instead of low-cost ease of use. Offering stable, standard APIs while not opening up our internals allows us to quickly iterate, innovate, and deliver value to customers. But customers cannot create—deliberately or unintentionally—dependencies on internal implementation details, because we encapsulate them behind APIs that follow solid software engineering practices. That is a major benefit for both sides, and it’s key to maintaining our weekly cadence of releases, to continuous innovation, and to resource efficiency. Customers who have migrated to Snowflake tell us consistently that they appreciate those choices.

The interface to our fully managed service, run in its own security perimeter, is the contract between us and our customers. We can do this because we understand every component and devote a great amount of resources to security. Snowflake has been evaluated by security teams across the gamut of company profiles and industries, including highly regulated industries such as healthcare and financial services. The system is not only secure, but the separation of the security perimeter through the clean abstraction of a managed service simplifies the job of securing data and data systems for customers.

On a final note, we love our user groups, our customer councils, and our user conferences. We fully embrace the value of a vibrant community, open communications, open forums, and open discussions. Open source is an orthogonal concept, from which we do not shy away. For example, we collaborated on open sourcing FoundationDB, and made significant contributions to evolving FoundationDB further. 

However, we don’t extrapolate from this to say there is an inherent merit to open source software. We could equally have used a different operational store and a different model of making it to suit our requirements if needed. The FoundationDB example illustrates our key thesis: Open is a great collection of initiatives and processes, but it’s one of many tools. It is not the hammer for all nails and is the best choice only in some situations. 

Source