DuckDB, the in-process analytics database management system used by Google, Facebook, and Airbnb, has released its 0.5.0 iteration.
Born from an idea of the academics of the Mathematical and Theoretical Informatics Research Center Centrum Wiskunde & Informatica in Amsterdam, DuckDB is integrated into a host process. There is no DBMS server software to install, update or maintain.
For example, the Python DuckDB package can directly query data in the Pandas Python software library without importing or copying data. Written in C ++, DuckDB is free and open source under the MIT license.
Advice and support are provided by DuckDB Labs. Co-founder and CEO Hannes Mühleisen, who is also co-author of the code and maintains the project, said The register was inspired by SQLite, the serverless OLTP database engine, where he saw the opportunity for a similar approach, but for analysis.
“We worked a lot with data science professionals and they all had these problems that were no longer theoretical problems in computer research – they were solved centuries ago – but somehow the software wasn’t there for them. With software vendors advertising, the technology was contained in some of these packages, but it wasn’t accessible or hidden behind many, many layers of corporate bullshit, “he said.
Mühleisen and his co-founder began to realize that OLAP may need to rethink the database architecture. “We took the idea into process data management systems where the entire database manager runs within the process you’re in, for example Python or even Excel, and we redesigned a system to be the first in the class for OLAP using this approach, ”said Mühleisen, who is still a senior researcher at his academic institution.
DuckDB is also often used as part of a larger data management or parsing stack. For example, if someone builds a custom application that collects data and then wants to build an SQL interface, they may have to copy the data in the past and move it to another system, which could cause synchronization issues, he said. But DuckDB can query third-party datasets as if they were its own data. “You can design it on an existing application or dataset. And people do it,” she said.
The popularity of the system among data tool builders has even increased his own meme.
The first release was in 2019 and has been steadily gaining popularity ever since, with users including Google, Facebook, and Airbnb.
This week the project released its 0.5.0 iteration.
Highlights among the new features include “out of core”, which aims to address problems that can occur when data in flight is larger than memory by offering intermediate results. The project also added optimization of merge orders, a perennial problem in analytical databases. Hyoun Park, CEO and chief analyst at Amalgam Insights, said DuckDB’s differentiation comes from being a small application that works within code-based processes to quickly analyze large data stores.
“This is increasingly important as workloads are deployed, performance is needed in a variety of analytical use cases, and as analytic data continues to double year over year in large organizations,” Park said. “As an open source database that can be easily integrated into specific analytical jobs, DuckDB is well suited to fill the gaps where traditional monolithic OLAP databases are more rigid, more expensive, or require transfer and duplication efforts to support analytical variety.
“DuckDB can often query data directly without intermediate processing, which improves processing. From a purely technological standpoint, it is somewhat similar to Actian Vector, which also takes a column-vectorized OLAP query approach, although Actian is designed to input data rather than working within a specific process or workload. “
But there are clear limits on when and where the system should and shouldn’t be used. While it somehow offers a cost-effective alternative to a data warehouse and can offer every data scientist a system on their laptop, it doesn’t necessarily replace corporate data warehouse systems from companies like Teradata, Oracle, and IBM. The home page clearly states that it should not be used for “large client / server installations for centralized enterprise data warehousing”.
“Is it a matter of priorities for your organization or problems with the data. Does it really depend on everyone working on the same data? If so, maybe this isn’t the best solution,” Mühleisen said.
Being an open source database, the project comes with an unusual name. While CockroachDB was named after its supposedly immortal nature, and MongoDB was a contraction of “huge”, DuckDB obviously was named after Mühleisen Wilbur’s pet, who, incidentally, appeared in The Guardian newspaper.
The project is working on its version 1.0, after which the backbreak changes will not be present. “I think we will get there with a lot of work. We always say by the end of the year, but I fear that this year will not happen,” said Mühleisen. ®
#DuckDB #reaches #version