Welcome back in next Power Query article. Last time we started to talk about this amazing tool, if you missed that, here is the link:
Introduction to Power Query | Off The Charts (otcbi.net)
In Why should you learn Power Query section, I described a simple scenario, where Power Query is extremely useful. It helps to get rid of manual work and significantly reduce delivery time of both ad-hoc and recurring analysis. I am not covering UI introduction yet on purpose, to show you how intuitive this tool really is. Let’s go once again through Demo Scenario.
Demo Scenario
There are two differences compared to scenario described in previous article: number of files and data retention period, the rest remains the same. There are 312 files containing 3 years of sales for Adventure Works company. Data is stored locally in .csv format. We must prepare the list of Top 5 Customers with highest Sales Amount. There are couple of obstacles here. There is a lot of files available, and we don’t know if columns are arranged in the same way. Price of sold items is available in different file, as well as Customers Names. In our data files we have only IDs of Products and Customers, therefore, we must somehow join them together. Let’s start.
Connect to the Data
Let’s open a fresh instance of Power BI Desktop. Assuming your tool is up to date, you should see similar window:
This is the new welcome screen, announced as General Availability as of June 2024. We could start connecting to the data here, but let’s open a blank report instead.
You can go straight forward to Power Query editor by clicking on Transform Data, but let’s begin with more intuitive option (in my opinion), similar to what we have in Excel – Get Data. Our files are stored in .csv format. There is of course connector for Text/CSV files, but we have 312 of them. Do we need to connect to each file separately? Of course not. Open Get Data -> More -> Folder and provide path to your data:
You will see short list of available files and couple of available options at the bottom:
Cancel is obvious, but what other options are about?
- Load will simply load the data in a format as you can see in the snapshot. This will basically be list of files available in the folder – for sure not what we need.
- Combine & Load will actually get inside each file and join them together before loading to our data model. Sounds more reasonable, but we must apply some transformations before data is loaded.
- Transform Data will open Power Query editor, you can start all the data transformations here.
- Combine & Transform Data will also open Power Query editor, but it will combine the data first, so at least one transformation will already be done. Let’s select this option.
Since we deal with csv files, Power BI will ask few questions to ensure that files will be combined in a proper way:
- You can select file that will be used as a sample. There is 312 of them, but maybe you opened one and you checked that this is the correct format, and all other files should be compared to that one.
- File Origin is very important for Power BI to understand which set of characters was used to generate csv files.
- Delimiter choice is probably obvious here.
- Data Type Detection is a feature that allows Power BI to assign data types to all columns in your data set. It is usually doing a good job, but you can decide to turn it off. For us, since it was supposed to be easy and quick demo, we keep it like that.
- There is an option not to load any file that will result with error during transformation. It may happen for example when content of the file will not be the same as the file used as sample. I assume that in most cases you would like to find the files with errors, so let’s not select that option.
- Once all is confirmed, hit OK button.
We are almost done. Since we must use other two files to pull the information on Product Cost and Customer Names, I will connect other two files quickly without leaving Power Query editor:
You will see very much the same dialog window, with main difference – since we are in Power Query editor, these is no Load/Transform choice:
Now we are done. Let’s have a look what is available to us now in Power Query editor.
Power Query editor – first look
We will not cover everything today. Only what is relevant for this exercise and more will come in dedicated article.
On top we can see the ribbon, with quite familiar look and feel if you are using other Office 365 applications. We will not go through all the options, but only cover some of them when needed. Today I want to highlight following areas:
- Queries – List of all objects related to your transformations. It is common to refer to tables that you want to transform and load as Queries, but this section can contain more items like parameters, functions, etc.
- Other Queries – in this section you can see the tables that we connected to. There is a table with combined sales data. My folder was named csv so this is also applied automatically as a Query name. We can see also Products and Customers tables. You can move Queries around to different folder with more meaningful name. Other Queries was simply created automatically as well.
- Here you can see the function that Power Query invoked to combine our sales data.
- On the right you can find Query Settings, with Properties at the top. Most important property is the name of the Query, but there are others hidden under All Properties button.
- One of the best features of Power Query – List of Applied Steps. Whatever transformation you are performing in Power Query is reflected here. The names you see on the screen are auto generated, but they can be renamed to something more descriptive, which is considered as a good practice.
- There are column statistics that you can enable.
- Formula bar showing M code generated by the selected step in Applied Steps list. While we are not going to focus on coding, I highly recommend enabling the Formula Bar and start looking at the code generated for you. This is I believe the best way to start learning the M language. Soon you will realize that even though you can’t code yet, you can at least tweak the generated code to achieve different results.
To enable column statistics and formula bar, go to View tab in ribbon:
I will use Query Properties to rename my tables to: Top 5 Customers, Products and Customers. Now we can start transforming the data.
Initial transformation
The dataset I have here is very simple. In this pre-eliminary phase I must do only two things. First one is to check if files are combined properly. In the problem statement I mentioned that we don’t know how columns are arranged in each of 312 files of data we have. In this case we must ensure, that all the files have recognized headers. When this is done, Power Query will join the columns of each file based on column names. And this is what we need. Otherwise, Power Query will join Column1, Column2, Column3 etc., and since we don’t know if order of columns is the same in all files, that would be risky. To check, let’s go to sample file within function that Power Query applied automatically to combine our files:
Go to Queries -> Transform Sample File (1) -> search for Promoted Headers (2) step in Applied Steps. If the step is applied like in screenshot, it means that Power Query first sets headers for each file before they are combined. If the step is missing, apply it manually by selecting Use First Row as Headers (3) option from ribbon.
Second thing is related to Customers. We must somehow join the Customer Name information with our data. When I was examining the files, I noticed that Customers table have 3 columns containing the name: Prefix (Mr/Ms), First Name and Last Name. I need only one column with all the information combined. I can select these columns and Merge Columns option from ribbon:
I am using space as a separator and set the column name as Customer Name. My column is now created:
No data cleaning is needed; therefore, we can move to data enrichment.
Data enrichment
Our sales data is incomplete, that’s why we must enrich it with additional information, that will help us calculate highest sales values and present customers using their names. For now, those details are available in separate tables. There is no VLOOKUP function in Power Query, so how to join the information from multiple tables? We can perform Merge operation. It requires two tables that you want to Merge, and columns that will serve as a bridge between them. In our case, both sales data and Products data contain ProductKey column:
Select main table (1), go to Home tab (2) and click on Merge Queries (3). In the Dialog Window select a table that you want to Merge (4). Provide columns that will be used as a link between both tables (5 and 6). There are other options available, but we will not make things more complicated (7). Once all is done, select OK (8).
Power Query creates new column with the name of the Merged table (unless it’s already taken). It contains all the data from Products table for every row where ProductKey was matched in second table. We can expand any detail that we need, but for us only ProductPrice is relevant. To keep the column names clean, let’s uncheck the option Use original column name as prefix.
Now we have our ProductPrice column available to perform required calculations. I will now do the same quickly to pull CustomerName column from Customer table in the same way as done for ProductPrice. The procedure is the same (CustomerKey is the link between tables), so, I will do it offscreen.
Now we can calculate Sales Amount by multiplying ProductPrice and OrderQuantity. To do it using UI let’s do following steps:
Select ProductPrice column (1), go to Add Column tab (2), select Standard (3) and then Multiply (4). Change content type (5) to Use values in a column. Select OrderQuantity (6) and hit OK (7). New column is created with automatic name Multiplication. For sure this must be changed. You can either rename a column at this point, or modify the name of the generated column in formula bar:
This type of operation can be your first interaction with Power Query code (M Language), and as you can see it is fairly easy. Let’s change that name to SalesAmount.
Final transformation
We have our data enriched, and ready for the final touch. At the moment we have all the data combined, but this is definitely too granular, as it contains the details down to OrderLine item. What we need is a summary where we see only CustomerName and total SalesAmount for each of them. We must perform data aggregation and there is an option named Group By in Power Query that will help us.
Select CustomerName column (1), go to Home tab (2) and click on Group By (3). In the Dialog Window you can see, that CustomerName is already populated as an aggregator (4). Provide a column name (5) that you want to create, for me it’s TotalSalesAmount. Select Type of operation to be performed on the source column (6) and select the source column (7). What I need is to sum SalesAmount for each CustomerName. When everything is done, hit OK (8). Advanced settings allow to add more aggregators and aggregations, but in this scenario, this is all I need. The result is the list of CustomerNames with their SalesAmount calculated. Now, let’s sort it in descending order using TotalSalesAmount column. Right click on the column and select Sort Descending:
Last operation, since we must show only 5 Customers with highest score and we have our data sorted starting with highest sales, we can use Keep Top Rows option to select only 5 top cusomers:
Go to Home tab (1), click on Keep Rows (2) and select Keep Top Rows (3). Provide the number of rows to keep, for me it’s 5 and hit OK:
Just like that, we have our analysis ready. You can use it as a recurrent reporting, schedule a refresh and have new data available daily, weekly, you name it. Power Query will be combining all the files in the folder, so you don’t need to be worried about including new data, and it will perform all the transformations recorded in Applied Steps.
Conclusion
As much as screenshots made this article very long, there is not that much of work needed to perform all these operations. After spending some time learning Power Query you will quickly find out, that it takes couple of minutes to perform analysis like this one. If you need the source files, please let me know. However, I believe that you will learn much more if you try to replicate the same using the data you have available somewhere on your end. By bending my steps to your reality, you will learn much more. Still, if you feel otherwise, let me know and I will share the files 🙂
As always, thank you for staying till the end and see you in next article!
Kudos to Pawel for the clear and detailed demonstration of Power Query’s capabilities, Looking forward to more insightful content!
Thank you! Kudos to you for reading all the way through 😀