Tabular Editor x Databricks (Part 2)

Borp has been asked to develop their first Power BI semantic model using Databricks. But they’ve never used Databricks before – to date they’ve built their career as an Enterprise Power BI Developer using CSV files or relational databases such as SQL Server. 

Databricks all sounds a bit scary. 

What exactly is it? And where does it fit amongst the types of data sources Borp is used to using? 

Databricks: The data intelligence platform

Databricks refers to itself as the Data Intelligence Platform. But what exactly does that mean? Hmm… maybe it’s worth digging a little deeper. Databricks has also been referred to in the past as a Unified Analytics platform. 

What that means is that it’s a one stop shop for all types of analytics workloads, be that business intelligence, data engineering, data science or artificial intelligence. 

The evolution of data platforms

Historically, you might have needed separate platforms for this. Relational databases such as SQL Server or Oracle were ideal for building Data Warehouses used for business intelligence and reporting purposes. 

But with the explosion of big data in the early 2010s, a new approach emerged – the data lake. With the ability to store both structured and unstructured formats, data lakes helped go beyond some of the limitations of data warehouses, catering to the growing volume, variety, and velocity of data. The file-based nature of data lake storage also made it a cost-effective option, and having a central repository for all this raw data provided a great platform for data science and machine learning initiatives.  

However, data lakes did not support ACID transactions, leading to data integrity and reliability issues. And their schema-on-read nature made them complex to query, impacting their usability as well as query performance. 

The need to supplement a data lake with a data warehouse that had structured and reliable data and had been modelled into an easily understandable, queryable, and performant structure was still apparent. 

The rise of the Data Lakehouse

Around about the turn of the decade, the next new concept for data platforms began to surface – the Data Lakehouse. Data Lakehouse is a portmanteau of Data Lake and Data Warehouse. The idea is that this data architecture can give you the best of both worlds; the reliability and usability of a data warehouse alongside the flexibility and scalability of a data lake. 

The enabler for this is the advent of Open Table Formats. These take the flexible storage of file formats such as parquet or orc and supplement them with an extra layer of metadata that allow you to interact with files as though they are a table in a database. Crucially, these new table formats are now ACID compliant, adding the data integrity and reliability of databases to the mix.  

Further to this, it’s become common to add a “catalog” layer of metadata to open table formats, tying together all these files as though they are a database themselves. 

What has this got to do with Databricks?

Databricks was founded in 2013 by the founders of the open-source data processing engine Apache Spark. Initially, the purpose was to commercialise Spark into a more accessible platform. In 2019, Databricks launched the open-source storage framework Delta Lake. Delta Lake is one of the most popular, open table formats. 

The use of both Spark and Delta Lake in Databricks makes it ideal for building a Data Lakehouse. In fact, Databricks are often cited with inventing the term! 

Bear in mind that “Data Lakehouse” is not a technology specific term. There are other processing engines, such as Apache Arrow, and open table formats, such as Iceberg and Hudi, which can be used to build a Data Lakehouse. There are other vendors in the market, too. Microsoft Fabric is another example, also utilising Spark and Delta Lake, whilst companies such as Dremio and Snowflake use some of the other technologies mentioned to lean into the Data Lakehouse paradigm. 

Databricks is certainly one of the pioneers in this space though, and their label of a Data Intelligence Platform is ultimately their way of describing the fact that you can take advantage of the underlying technologies used to build Databricks in order to unify the data work an organisation needs to do under a single umbrella, utilising the Data Lakehouse architecture alongside AI features to help increase your productivity. 

What does this mean for Borp?

That’s an awful lot of big words and clever sounding sentences strung together… Is Borp really feeling any less overwhelmed with getting to terms with Databricks? 

Databricks as a platform covers a very wide range of analytics use cases. But the most important thing for Borp to know is this: Databricks is a platform where you can build a Data Warehouse, just like the platforms Borp has been using for years. 

You can connect tools like Power BI and Tabular Editor 3 to Databricks, and it will feel just like interacting with any other Data Warehouse platform Borp has used previously. 

Should they wish to broaden their horizons and explore other data disciplines, those features are available too, and you can use them in a single, soon to be familiar, interface. 

There is nothing to be afraid of here. Borp is already starting to feel better about this new world. 

In our next part in the series, we’ll take a look at how Borp can start to explore the data available to them in Databricks. 

Related articles