Two (and a half) ways to remove columns in Power Query

When working with Import Mode in Power BI, you should look at the size of your data model and make sure that it’s not bigger than necessary. There are many techniques you can apply to optimize your model, but there are two extremely simple transformation steps, that have the biggest impact on your model size. You can either reduce the number of rows (by filtering for example) or reduce the number of columns by removing them from dataset.

While this is not article about details of Power BI’s storage engine, it is important to know that Power BI stores the data column by column instead of row by row. That is why number of columns is more critical for the model size than the number of rows.

That’s why you should definitely consider removing unnecessary columns from the model. Today, I’ll walk you through the available options in the Power Query UI and make a case for why each is important.

Scope of the article

If you think about it, there are many non-obvious ways to get rid of columns. For example, you could use Group By function to aggregate the data based on specified columns, resulting in all the columns not mentioned in the formula to be removed. We have also Unpivot function, that will result in reducing the number of columns by increasing the number of rows. What is going to be interesting for us in this article are the simplest possible scenarios, where we will specify columns to be removed from dataset without any other special effects: Choose Column, Remove Columns, 
 

Before we start

If you followed the Introduction to Power Query Articles you know already that I am a huge advocate for enabling the Formula Bar, regardless of your coding experience or aspirations. For the purpose of this discussion Formula Bar must be enabled, so it is easier to understand the difference between mentioned functions and what are the implications of calling them. To enable Formula Bar, go to View tab in Ribbon and select Formula Bar checkbox:
 
Figure 1. Enabling Formula Bar in Power Query.

 

Closer look at available functions

I mentioned three functions that will be covered in this article. All of them are available to you in Home tab of the Ribbon, in Manage Columns section.
Power Query's options to Remove Columns.
Figure 2. Power Query’s options to Remove Columns.
You will find here an option to Choose Columns you need. When you expand available options for Remove Columns you will find a function with the same name and Remove Other Columns. Last two options are also conveniently available to you at the column level, when you right-click on column in the data preview:
 
Column level options to Remove Columns in Power Query.
Figure 3. Column level options to Remove Columns in Power Query.
 
How does each of these functions work? Let’s find out.
 

Choose Columns

This option is probably the most straight forward and the easiest to use, especially in scenario where you deal with a lot of columns and want to select only a few of them. When you use the function, a small window pops up:
Choose Columns in Power Query.
Figure 4. Choose Columns in Power Query.
  1. List of all columns available at that stage.
  2. Select All Columns – you can have all columns selected and unselect a few if needed or unselect all of them and select a few. Either way, this function allows you to choose columns as quick as possible.
  3. Search Bar – when you have a lot of columns it makes it very easy to find columns you want to keep.
  4. Sort Columns – very often overlooked feature. Not only you can select the columns you need but also sort them alphabetically. It helps to work in Data Preview in case you have a lot of columns.

Of course, the result of the function is that it keeps the columns that are selected. For the purpose of further discussion, let’s have a look at the code generated for this step in Formula Bar:

Figure 5. Code generated for Choose Columns function.

 

Remove Columns

Whether we use this function in Ribbon or directly on column level it doesn’t really matter. In both scenarios you must select columns that you want to remove in Data Preview. To select multiple columns, you must left click on each of them while holding Ctrl key. Selecting two columns while holding Shift key will select all the columns between them. You can also mix the keys to select few individual columns and add range of columns to the selection, by holding both Ctrl + Shift, it looks like in below picture:
Figure 6. Select multiple columns in Data Preview.
Figure 6. Select multiple columns in Data Preview.
Once all the columns that you want to remove are selected, sue the Remove Columns function to get rid of them. Let’s look at the code generated by this action:
Remove Columns function - code generated.
Figure 7. Remove Columns function – code generated.

 

Remove Other Columns

Works very similar to Remove Columns function, only the outcome is different. Here, instead of removing selected columns, you select columns that you want to keep, remove the columns that are not selected. Hence the name: Remove Other Columns, meaning other than selected. In this case we see it some weird mix of Remove Columns and Choose Columns functions. How about the generated code?
Remove Other Columns - code generated.
Figure 8. Remove Other Columns – code generated.

There is one interesting behavior while using Remove Other Columns functions, that could prove to be useful to you. When you are selecting columns to keep, you can click on them in an order that you would like to see them. Power Query is smart enough to understand what you are doing, and once your columns are selected, they are also sorted in the order of your clicks:

Figure 9. Remove Other Columns – order of column selection.
 
Figure 10. Remove Other Columns – transformation result with proper order applied.

Maybe this is not revolutionary feature, but I am pretty sure you can make good use of it, and not only reduce the model size but also organize the columns to make them easier to work with, all in a single transformation step.

 

Explaining the title

It’s fair to say that this is right about the time to finally explain the title of this article. We are talking about three different functions (Choose, Remove and Remove Other Columns), yet the title says two (and a half). We technically have three functions, but from the code perspective, we have only two (average is two and a half :)). You probably noticed that code generated for Choose Columns and Remove Other Columns functions is the same. You still can decide which one you like better or use them interchangeably depending on the need, but it is important to know, that code wise they are the same, and Remove Columns function is different. Why is this important?
 
Functions available for us in User Interface aim to make our life a bit easier, but they are still running the code behind the scenes. If you are not careful, the code can simply break, even in our case where the code generated by all the mentioned functions is extremely easy.
  • Choose Columns and Remove Other Columns – calls Table.SelectColumns function accepting two arguments. First is the table that function will work on. The name you see here is the name of the last step in Applied Steps list (the result of transformation step is a table in a given stage of transformation). Second argument is a list of columns that you want to keep.
  • Remove Columns – calls Table.RemoveColumns function. It also accepts two arguments. First again is the table like discussed above. Second argument is the list of columns that you want to remove.
Both functions work on list of columns as an argument. What happens when one of the columns is missing in the dataset?
Figure 11. Missing column error in Power Query.
Luckily, the error message is very easy to understand. Yet, you don’t want to see any errors in your solution. Especially when it’s shipped to other teams, and you want to focus on next work instead of keep fixing existing ones. From this extremely easy code we can draw the following conclusion:

Which one to choose and when

I already spoiled the outcome a bit in the introduction, because I believe both functions are important, and you can make good use of them. You already know a lot in terms of how these functions work, what is the code generated behind the scenes, and what my cause an error.
So, the question to you is, when do you want to receive an error? You probably don’t want to see too many of them, but still want to be aware when something important happens. The way I see it is the following:

Use Choose Columns / Remove Other Columns at the beginning of transformations. Especially when you work with less structured data like excel files, maybe these manual exports from transactional systems, done by multiple people and you can’t guarantee that all the files contain the same set of columns. If you use the Remove Columns step here, and you hardcode the names of columns that are needed, in case one of the columns is missing, you will get an error. Why would anyone like to be bothered by the fact that the column you don’t need is missing? On the other hand, if you select the columns you need, because these are columns needed for the Semantic Model, or for further transformations in Power Query, you would like to know that something has changed in data source, and at least one of the columns is no longer available.

User Remove Columns at the end of the transformations. I already mentioned that we should always remove the columns that are not relevant for the Semantic Model in Power Query, so they are not loaded into the model. After you are done with all the transformations, you might want to “clean” a bit and remove more columns before you hit “Click and Apply” button. You could say that you already selected the columns you need at the beginning, so, you shouldn’t have anything to remove. If this is the case, then of course you don’t have much to do. But maybe in the initial step you selected the columns not only needed for the model but also needed for further transformations. Those columns will no longer be needed when all the transformations are done, but they couldn’t be removed earlier either. Maybe you also created new columns during the data transformations, for example to pre-calculate other columns to split the complex formulas / operations into more digestible chunks. In this case, by the end of the transformations you will end with a bunch of columns that are no longer needed and could be simple removed. Therefore, if you want to use Remove Columns function it is a good moment, as this is much quicker to select a couple of columns to remove, than selecting 20 or more columns to keep. To summarize:

Conclusion

I want to focus not only on what you can do in Power Query, but also how you should do it. Therefore, I will help you explore technical capabilities and introduce you to best practice at the same time. This for sure was quite a lot of time spent on very easy concept. Still, very important, as it should be a part of every transformation. And even with simple code like the one you saw in this article; you still can face some challenges if you don’t know how these function work. Regardless of transformation complexity, always spend some time thinking how to deliver your solutions in best possible way, by making them immune to errors. Try to predict them and deal with them before they happen.
 
As always, thank you for reading and see you in next article 🙂
Picture of Pawel Wrona

Pawel Wrona

Lead author and founder of the blog | Works as a Power BI Architect in global company | Passionate about Power BI and Microsoft Tech

Did you enjoy this article? Share with others :)

5 4 votes
Article Rating
Comments notifications
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Related Posts