Excuse me, which way to Fabric?

Data Factory, Dataflows Gen 2, Synapse Lakehouse

Excuse me, which way to Fabric?

Microsoft has been very generous with updates in 2023. Many interesting features were announced, but the biggest news was made by the introduction of Microsoft Fabric – a unified SaaS Data Platform.

Fabric invites different data personas into one collaboration space. It offers a comprehensive set of experiences, tailored for each persona. If you are or planning to be a Fabric user, do you have to master all of these experiences? Of course not. But, if you are constantly on a learning curve, you might consider exploring this area. As mentioned, there are plenty of components to choose from. If you’re unsure which direction to choose, I’ll try to point it out in this article.

Brief introduction to Fabric components

In my very first article, I talked about MS Fabric and how it is connected with Power BI. Let’s bring an overview of all components here as well:

I will begin from components that we are not going to tackle today. Synapse Real Time Analytics is very interesting, but it’s very much use case-specific, and this is not something that you can use in your daily work, without specific project on your table. On the other hand, Synapse Data Science is very much skill – specific area. Power BI is our field, that’s why I am going to assume we are all in the permanent learning process of this tool 🙂 All of the remaining components will be covered today.

OneLake – Fabric foundation

Before I move to the learning plan, it is important to talk about OneLake. It goes a bit outside of the learning objectives, because it doesn’t really matter if you want to learn Fabric in general, as OneLake is this one concept you must be familiar with anyway.

OneLake is simply a Microsoft Fabric Data Lake, built on top of ADLS Gen 2 (Azure Data Lake Storage). However, it eliminates all the complexities from Azure realm, like role-based access control (RBAC), redundancy, data regions, etc. Being perfectly honest with you, I am that sure if this is a good thing for all the complexities that are removed. There is a lot of Power BI user, that have no idea where their data is located. And it is not only their fault, as Microsoft decided to make Power BI easy to start with, and treated some of the important topics as complexities that should be removed.

What is a benefit of OneLake without any doubt is that it simplifies the entire solution, by reducing the number of platforms needed to deliver it. It also makes the data a lot more discoverable, which is crucial for building a healthy Data Culture.

OneLake supports any type of files. You can load your csv, excel files to lake and use them for any workload. But the one file type that is the most important here is a Delta Parquet format. Parquet file is a columnar data store, so it works in a similar way to Power BI Semantic Model. It is very efficient in terms of storage needed and optimized for analytical workloads. If you create Lakehouse or Warehouse in Fabric, they will use mentioned file format to store the data in OneLake. Delta Parquet means that when your data is refreshed, instead of creating a completely new parquet file containing the entire dataset, the new file contains only updates vs. the original dataset. Therefore, it is storage efficient, it provides logs of all the changes done against your data and allows to “go back in time” using time travel feature.

From purely Power BI perspective, OneLake will become a very important, if not the most important data source for our reports. Not to mention, that we are allowed to create new Semantic Models directly from data stored in OneLake. These Semantic Models will leverage the power of new data connectivity mode – Direct Lake. This mode enables us to re-use existing copy of the data, without bringing it to Power BI through import mode, while providing much better performance than Direct Query.

First stop – Dataflows Gen 2

First, I believe the most natural experience for a Power BI Developer within MS Fabric is Dataflows Gen 2, which is an offering within Data Factory experience. Many of you work with Dataflows on a daily basis, and for sure even more of you are familiar with Power Query. Let’s have a look at the graphic below, representing a very common scenario.

Dataflows is very often used to promote re-usability of data. It allows us to apply transformation steps only once and redistribute it between Power BI Semantic Models. While many would draw a diagram like this one, real deal is a bit more complex than that:

I mentioned already that Dataflow allows to transform the data. Indeed, Power Query engine (Mashup Engine) is the core here. If you are Power BI Premium Users, you may leverage Enhanced Compute Engine. Thanks to this, staged data is loaded to a SQL Cache, where subsequent transformations may happen much faster, especially if we deal with data sources like excel or csv files. Unfortunately, it is not that simple. Loading the data to SQL Cache requires some time, so it might be that by implementing this solution you will even slow down a Dataflow refresh. As always, take it with a grain of salt, and test your solution to find out, which approach is best for your use case. After your data is transformed it is loaded in Azure Data Lake Storage in a Common Data Model (CDM) format.

This breakdown of a Dataflow allows us to draw a conclusion, that Dataflows Gen 1 and Gen 2 are not that different. Let’s look at the related scenario:

We have two differences here. We will not go too much into the details, as I will cover Dataflows Gen 2 in separate article. It is important to notice here that when Dataflows Gen 2 is created for the first time in your workspace, Staging area for Compute and Storage is created as well. You can think of it as Enhanced Compute Engine for legacy Dataflows. Another important reason is that at the end instead of ADLS Storage we see Data Destination. This is because Dataflows Gen 2 allows us to select, where we want to ingest our data. So far, the choices are:

Azure SQL Database
Azure Data Explorer
Fabric Lakehouse
Fabric Warehouse

I highlighted a Lakehouse, as it is the closest destination to what we have today for legacy Dataflows. There are of course differences between Gen 1 and Gen 2 Dataflows, but in principle they work very much in the same way. User Interface is almost identical, and it’s powered by our beloved Power Query. That is why this is my pick for you, as a first step in mastering Fabric offerings.

Second stop – Fabric Lakehouse

Lakehouse is a part of the Synapse Data Engineering experience. There are several reasons why I picked it as a second area on your learning list:

By using it as a data destination we stay close to the legacy Dataflow scenario.
At this moment, data load to Lakehouse performs better than to Warehouse.
Lakehouse (as well as Warehouse) allows us to leverage OneLake, the foundation of Fabric.

Lakehouse creation takes minutes, which is a big deal. Together with Lakehouse two more objects are created: Default Semantic Model and SQL Endpoint. Semantic Model works in Direct Lake mode, and we may decide which tables will be included in the Semantic Model. We may always create new Models from the same Lakehouse, as per your reporting needs. All of them will be using the same copy of data that exists in Lakehouse. SQL Endpoint (TDS endpoint) allows you to consume the data using SQL. It is important to mention, that you can’t use SQL to perform any operation on Lakehouse tables, other than read. You may use it also to create views from your tables, as you would do in SQL Server DB, and grant access to them directly to other data consumers. There are nuances related to Lakehouse sharing capabilities, it took some time to crack it, but I will not cover it today. Views can also be added to semantic models built on top of your lake; however, they will not work in Direct Lake mode, but in Direct Query instead. Please, keep that in mind.

SQL Endpoint may also be accessed from desktop tools of your choice like SQL Server Management Studio (SSMS) or Azure Data Studio. I actually recommend doing it when you are querying the data, or building your views, as it works much smoother than online experience.

Sharing data from Lakehouse is much better than sharing Dataflow Gen 1 today. To share legacy Dataflow we must grant at least Viewer role on workspace level. This is not something that you would like to do in many cases. With Lakehouse, especially with SQL Endpoint, you may share specific views only, therefore you may decide which tables will be queried, which columns and rows are accessible. You don’t have to grant any role on Workspace level, but you may share it with individual users or Security Groups.

Lakehouse makes much more sense in terms of re-sharing the data with other users, or redistribution between all your solutions. On top of that, it allows you to implement Medallion Architecture to have:

Bronze layer for raw data
Silver layer for clean, pre-filtered data
Gold layer with data ready for analytical workload

While people that are used to this architecture tend to go even further than that, I don’t think for most of the Power BI use cases you need more than 3 layers. On top, I would even say that the “Gold Layer” could be simply a Power BI Semantic Model (or Warehouse for broader usage). But this is just my opinion. Feel free to decide what is best for you.

Combination of Dataflows Gen 2 and Lakehouse will allow you to build scalable, enterprise grade solutions, with relatively low entry level to start using these technologies.

Third stop – Data Activator

I don’t know how many times I have heard a request to implement alerts on data. We could do it with help of Dashboard Tiles, but it was not perfect solution. Data Activator allows you to monitor specific visuals directly in your Power BI Report (yes, there is no longer a need for a dashboard). In case you have multiple measures in your visual, you may select which one should be monitored. As a part of standard actions, you may select Email or Teams notifications. But this is just a beginning. You may also setup Power Automate flows, that will be triggered by Data Activator Alerts. This will allow you build more advanced processes. This is not yet enabled in Preview mode, but the goal is to have Data Activator working with Lakehouses and Warehouses as well. You may use it then to build processes ensuring proper Data Quality for your analytics solutions.

Conclusion

Considering everything we covered today, high level diagram for your solution could look like this:

This already looks very good, even though it’s simplified a lot. Starting with the mentioned areas of Microsoft Fabric, you will be able to take your solutions to another level. If you wonder what to do next, I believe this is something you will figure out while learning what we covered today. For example, Dataflows Gen 2 don’t have out of the box solution for Incremental refresh. To be able to deliver it, you may need to start looking into Fabric Notebooks. Very short piece of code will help you here, so you don’t need to be a python pro to start. Once you have more Dataflows created, loading data to Lakehouses, you may want to orchestrate the entire process using Data Pipelines. Playing with SQL Analytics Endpoint within Lakehouse allow you to practice your SQL skills. Therefore, maybe at some point you will consider creating your first Data Warehouse. We could continue exploring your options even more, but I hope you get my point. Learning Dataflows Gen 2, Lakehouses and Data Activator is great foundation to start building Fabric Data Engineering skills.

If you feel a bit overwhelmed by recent changes, don’t worry. Start with small steps and find your own learning path. Whatever is your starting point, begin your journey and go down to the road to MS Fabric 🙂

I hope you enjoyed this article. As always, thank you for reading, and see you in next post 🙂

Pawel Wrona

Lead author and founder of the blog | Works as a Power BI Architect in global company | Passionate about Power BI and Microsoft Tech

Did you enjoy this article? Share with others :)

5 6 votes

Article Rating

Comments notifications

0 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments