All the sessions from Transform 2021 are readily available on-demand now. Watch now.
Let the OSS Enterprise newsletter guide your open supply journey! Sign up right here.
While information lakes and information warehouses are conceptually related, they are in the end extremely distinct beasts. If a corporation is seeking to property uncomplicated-to-query structured information for everyone to use, then a information warehouse is probably its greatest bet. Conversely, if the corporation desires to leverage massive information in its purest, most versatile type, they are most probably seeking for a information lake — in its native unprocessed format, there are limitless strategies to query this information as a business’ requires evolve.
However, huge information lakes constituting petabytes of distinct datasets can turn into unwieldy and tough to handle. And this is a trouble that fledgling startup Treeverse desires to solve with an open supply platform known as LakeFS, which is developed to aid enterprises handle their information lake in a way related to how they handle their code — “transform your object storage into a Git-like repository,” as the corporation puts it. This signifies version handle and other Git-like operations such as branch, commit, merge, and revert and complete reproducibility of all information and code.
“The number one problem LakeFS solves is the manageability of large-scale data lakes featuring many datasets that are maintained by lots of different people — at this scale, a lot of the workflows people are familiar with start to break,” Treeverse cofounder and CEO Einat Orr told VentureBeat. “The Git-like operations exposed by LakeFS can solve these problems, similar to the way Git allows many developers to collaborate over a large codebase without causing code quality issues.”
Founded out of Tel Aviv in 2020, Treeverse has largely flown beneath the radar just before now, but today the Israeli corporation revealed that it has raised $23 million in a series A round of funding from Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures. The funding will be used to expedite each the development and adoption of LakeFS in enterprise information teams, although currently laying claim to customers at organizations such as Volvo, Intuit, and Similarweb.
How it performs
As an open supply platform, LakeFS is versatile and can be deployed on the cloud — AWS, Azure, or Google Cloud — or on-premises. It also performs out-of-the-box with most of the modern day information frameworks, like Kafka, Apache Spark, Amazon Athena, Delta Lake, Databricks, Presto, and Hadoop.
But exactly where does LakeFS sit in the information stack, precisely? And what other tools could match into that stack?
A modern day enterprise information stack normally comprises many tools like information ingestion smarts from organizations such as Fivetran and cloud-based information lakes or information warehouses like Snowflake or Google’s BigQuery. The method of pooling information from several sources (e.g. CRM and advertising and marketing tools), unifying it into a regular format so that it is uncomplicated to run queries and analytics against, is commonly completed by way of “extract, transform, and load” (ETL), exactly where the information is transformed just before entry to the warehouse, or by means of “extract, load, and transform” (ELT), exactly where the information is transformed on-demand inside a warehouse or lake.
LakeFS sits in between the ELT technologies and the information lake. “Integrating ELT technologies with LakeFS enables writing new data to a designated branch, and testing it to ensure quality before exposing to consumers,” Orr explained. “This workflow provides important guarantees about production data to consumers of the data.”
Existing merchandise on the industry comparable to LakeFS contain machine mastering operations (MLOps) tools such as DVC, which is created by a corporation known as Iterative.ai that raised $20 million just last month, and Pachyderm. However, they are aimed chiefly at information scientists developing machine mastering models. “LakeFS takes an holistic infrastructure approach and provides data version control capabilities across all providers and consumers of data through the applications they use,” Orr mentioned.
Elsewhere, open table storage formats such as Databricks’ Delta Lake offer you anything related in terms of permitting “time travel” (reverting to information in a earlier type) on a per-table basis, although LakeFS enables this more than an whole information repository that could stretch across thousands of distinct tables.
There has been substantial activity across the broader information engineering space of late. Fishtown Analytics lately rebranded as Dbt Labs and raised $150 million in funding at a $1.5 billion valuation to aid analysts transform information in the warehouse, although Airbyte also secured venture backing this year just before opening up its information integration platform to assistance information lakes. And GitLab lately spun out a new information integration platform known as Meltano as an independent corporation.
One point all these industrial organizations have in prevalent is that they are constructed on open supply projects. And so the most apparent outstanding query when any young VC-backed corporation pitches its open supply wares is this: What’s your company model? For Treeverse, the answer to that query is that there is no quick plans to monetize for now, although of course the longer-term strategy is to construct a industrial solution on leading of LakeFS.
“Our goal is to develop the open source project and foster a vibrant community around it,” Orr explained. “Once we achieve our targets there, we’ll shift focus to providing an enterprise version of LakeFS that offers common premium features like managed-hosting and predefined workflows that bring best practices and ensure high quality data and resilient pipelines.”