Exclusive: Voltron Data brings new power to AI with Theseus distributed query engine

Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.


The fictional Voltron robot (from the animated science fiction show of the same name) is all about combining several robot lions into one big robot that is able to accomplish great tasks.

Voltron Data, which made its splashy debut in 2022 with $110 million in funding, is all about bringing the power of multiple open source technologies, including Apache Arrow, Apache Parquet and Ibis, together to help improve data access. Today, Voltron Data is taking the next step, announcing the new Theseus distributed query engine, in a bid to help dramatically accelerate data queries for increasingly demanding AI workloads. 

Theseus is designed to accelerate large-scale data pipelines and queries using GPUs and other hardware accelerators.

“We built Theseus based on the exact same principles of what we were doing open source support for, with modular, composable, accelerated libraries that make data systems better,” Josh Patterson, co-founder and CEO of Voltron Data told VentureBeat in an exclusive interview. “This is our next product as we continue to go down this journey of trying to be the leading designer and builder of data systems.”

VB Event

The AI Impact Tour

Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!

 

Learn More

Theseus is built for massive volumes of data

Theseus is optimized for running distributed queries on large datasets of 10 terabytes or more. It is targeted at companies with petabyte-scale data processing needs across Fortune 500 companies, government agencies, hedge funds, telcos, and media entertainment firms.

A key goal of Theseus is to accelerate ETL (extract, transform, load), feature engineering, and other data preparation work to feed downstream AI and analytics systems faster. As AI systems get faster, they need more real-time data transformation.

“A lot of our users are saying their biggest problem today is they’re starving their AI systems because they can’t get data fast enough,” Patterson said. “That was the main driver behind Theseus.” 

A challenge with data queries today is they typically are limited by CPU compute capacity and performance. Theseus looks beyond traditional CPU approaches and makes use of accelerated computing technologies including GPUs. Patterson said that Theseus is “accelerator native” – meaning it is optimized to leverage Nvidia GPUs, networking, storage, and other accelerators. 

According to Patterson, the accelerator native approach allows it to run queries faster than traditional CPU-based distributed engines like Apache Spark at scale.

One AI use case where Patterson sees Theseus being particularly useful is for hyper 

parameter optimization. He explained that an organization can churn through a lot of parameters for optimization and feature engineering as part of the process of adjusting inputs to build better models.

“The faster you can do feature engineering, the faster you can do ETL the faster you can bring in fresher data, the better your models are,” he said.

Theseus is interoperable from the ground up

Theseus embraces open standards like Apache Arrow, Apache Parquet, and Ibis for interoperability. 

Patterson emphasized that it is not a proprietary siloed system and data in any Apache Arrow-compatible data lake can be queried by Theseus. Patterson explained that data can be fed directly into many different popular machine learning tools and frameworks including PyTorch, Tensorflow and different types of graph databases.

“We have this seamless way to basically move data in and out of the systems,” Patterson said.

Theseus itself is just the distributed query system. Patterson explained that it doesn’t have its own front end user interface, rather it uses things like SQL queries and Ibis where people can map other front ends to it. The basic idea is to enable organizations to easily integrate Theseus into existing workflows.

Going to market with HPE and more partners

Voltron Data is going to market with Theseus via partnerships and the first is with Hewlett Packard Enterprise (HPE). 

Voltron Data has partnered to bring Theseus to the HPE GreenLake hybrid cloud platform. HPE GreenLake provides the infrastructure for Theseus while also giving customers a way to unify queries across other engines using Ibis.

Looking forward, Patterson said that Voltron Data plans to expand Theseus partnerships and add more functionality like user-defined functions. The goal is tighter integration into full data science pipelines.

“I think 2024 will primarily be about making it faster and easier to integrate with new different parts of the data science pipeline, because that really empowers users.” Patterson

Originally appeared on: TheSpuzz

iSlumped