MSBI(SSIS,SSRS,SSAS),Tabular Model,DAX,Power BI,Power Pivot,Tableau

Data import

Data import can be used against any data source type. Current Power BI service limitation published file size is 1 GB. Data are loaded to file using column store compression. If you’ll read for report purposes just data required for usage in visuals this is not limiting too much. Even with Power BI Pro license you are limited to 8 refreshes per day. Times can be scheduled either at full hour or half past full. Scheduled time is approximate. When will refresh actually happen depends on service utilization. If you need to refresh data, more often consider using live connection. When using import, data are stored in Power BI file/service. Therefore, there is no need to setup permissions on data source side (service account for load is enough) and you can share data publically or with people outside organization. On the other hand, all data are stored on Power BI. So if you have requirement for Row Level Security, you need to implement it on Power BI side. If you need same logic applied for multiple reports, you have to duplicate it in all the reports. Same applies to calculations, if you need same measure in multiple reports, you need to recreate it every time.

Live connection

There are more limitations for live connection in place. It doesn’t work against all data sources. Current list can be seen here https://powerbi.microsoft.com/en-us/documentation/powerbi-refresh-data/#live-connections-and-directq...

You are also limited to just one data source/database you selected. You can’t combine data from multiple data sources anymore. If you are connected to SQL Database, you can still create logical relationships between objects from that database as well as measures and calculated columns. When you are connected to SQL Server Analysis Services, you are limited just to report layout and even can’t make calculated columns or measures. When connected to SSAS, you could also reach some licensing limitations. Power BI issues DAX queries against SSAS and this isn’t supported in Standard edition of SQL Server before SQL Server 2016. When using live connection, users have to have access to underlying data source. This means you can’t share outside of your organization or publically. Other way around, when you set up security once, all reports using live connection to the same source have the same security model applied automatically. Similarly, you can prepare your data model carefully with all measures and calculated columns on data source side. All reports from the same data source can benefit of it. Refresh frequency is unlimited. Power BI simply shows data as they are in database. However, be careful using live connection against highly transactional OLTP systems. If you suffer from performance problems now, analytical workloads from Power BI won’t help it. Consider enabling features on data source side like Operational Analytics in SQL Server 2016.

Summary

Both methods have limitations, developer should be aware upfront. For import it is mainly refresh frequency, data size and necessity to duplicate same logic (calculations, security) in multiple reports. That could be helped by templates, but if you need to change logic of calculation, you need to change it in multiple reports again. For live connection, not all data sources are supported. If you pick one, then you’re limited just to that one data source for report. If logic is setup on source side, all reports can benefit. My personal preference is use live connection against SSAS models, have the model centralized and maintain logic in one place. For SQL Databases where I don’t have model, I prefer import and use Power BI as a model and vizualization.

One of the most difficult things to deal with in the ETL is importing a file with dynamic columns. The file does not have a set number of columns. The number of columns changes from time to time. For example, the file can have 5 columns, or 20 columns, or anything in between.

Below are some examples of such a file:

File1:Col1,Col2,Col3t1,Col3t2,Col3t3a,a,1,1,1b,b,2,2,2c,c,3,3,3

In the above example the number of columns for Column3 is 3. But it could be 4, 5, 6 or more, we don’t know.

Before we look at the ETL side, we need to look at the data modelling side first. What kind of target table should we create to host these files.

Data Modelling

Essentially there are 2 choices.

1. We could make it normalised like this:

Table1:Col1,Col2,Col3Type,Col3Valuea,a,t1,1b,b,t2,1c,c,t3,1a,a,t1,2b,b,t2,2c,c,t3,2a,a,t1,3b,b,t2,3c,c,t3,3

In principle we design the structure so that we have a fixed number of columns. It doesn’t matter how many columns in the source file, we will always have 4 columns in the table.

The drawback of this design is when we query it. If we need to produce File1 from Table1, we would need to pivot the table like this:

select Col1, Col2, t1 as Col3t1, t2 as Col3t2, t3 as Col3t3

from

( select Col1, Col2, Col3Type, Col3Value

from Table3

) P1

pivot

( sum(Col3Value) for Col3Type in (t1, t2, t3)

) as P2

Output:Col1,Col2,Col3t1,Col3t2,Col3t3b,a,1,1,1a,b,2,2,2c,c,3,3,3

If we have 2 tables like that we need to join them together on the common column(s) and it becomes even trickier. If we are dealing with 1 million rows column, we risk a serious performance degradation. This might not be a problem for a data mart, but if it is for a transaction system where millisecond response time is expected, this design does not serve the purpose.

Alternative Design

The alternative is to find out how many columns Col3 is. We know it could be 3 to 5, but could it be 100 columns. Spending 5 minutes with the subject matter expert to understand the characteristic of Col3, is well worth it. I have seen this dynamic columns case 3 times (the first time was about 12 years ago) and on every case they followed the book/theory (to normalise) and everytime they regretted it. On the second case (about 8 years ago) when I was involved in troubleshooting the performance issue, in the end we changed the table structure to denormalised the table structure.

So, I would recommend to find out how many columns Col3 is. 90% of the chance is: it is not that many. The response I usually get from the SME is something like this: “Most likely Col3 varies between 3 and 5 columns, it certainly won’t exceed 8”. There is a natural meaning of this data, that’s why the range is limited. For example: Col3 could be the number of teams, or the number of projects, or the number of web sites, or the number of product types. In the business world most of the data is within a certain range. You can’t have 100 teams for example. It’s physically impossible for a company as it is limited by budget constraints.

So once we have that information (the maximum number of columns), then we create the table according to this limitation. Let’s assume that the SME said that the max is 8 columns. So we design Table1 with 10 columns (just in case) as follows:

create table table1

( Col1 varchar(10),

  Col2 varchar(10),

  Col3t1 int,

  Col3t2 int,

  Col3t3 int,

  Col3t4 int,

  Col3t5 int,

  Col3t6 int,

  Col3t7 int,

  Col3t8 int,

  Col3t9 int,

  Col3t10 int

)

Unless it is a staging table, don’t forget to add a PK column. It is a good design to be able to uniquely identify a row at all times. In this example I don’t add a PK column because the target is a staging table.

A table structure with a fixed number of columns like this makes our lives a lot easier. It directly mimic the structure of the source file, and able to accommodate the dynamic nature of Col3.

The Data Load

I’m going to explain the ETL for the denormalised table, i.e. the one with a fixed 10 columns for Col3. The ETL for the normalised table, i.e. the table with 4 columns will have to wait until next time I’m afraid, as I have a few other posts to write next week, plus a presentation for SQLBits 8, Advanced Dimensional Modelling. If you need the ETL for the normalised table with 4 columns please contact me.

Because of the dynamic nature of Col3 in the source file, the best way is to use the Script Component as a Source. There are a lot of sources on the internet Script Component. This post shows you the basics. This post (by Lee Everest) shows you how to use a Script Component to split the rows in the source file. Similarly, this MSDN post shows how to import a delimited source file using a Script Component.

The main concept is as follows: First create a Connection Manager for a File Connection (no need to use Flat File connection). Name this connection File1, like this:

Then create a workflow, and drag the Script Component into the workflow. When asked, specify as source.

Then:

On the Connection Managers tab, set the connection to File1 (that’s the middle column). On the first column, name it as “Conn1”.
On the Inputs and Outputs tab, rename the “Output 0” as File1Output. Then create the Output Columns one by one, as per the structure of Table2:

Col1, Col2, Col3t1, Col3t2, … Col3t10.
In this example we will convert the data type for col3 from string to int not in the Script Component, but later on using Derived Column / Data Conversion transformation. So for all columns, set the data type to String, length 10.
On the Script tab, click on the Edit Script and edit it as follows. Don’t forget to add “Using System.IO;” and if you want to use MessageBox for debugging you need to add “using System.Windows.Forms;”. Remember that you can’t debug an SSIS Script Component (see my post here), but you can debug SSIS Script Task.

private StreamReader SR;

private string File1;

public override void AcquireConnections(object Transaction)

{

  // Get the connection for File1

  IDTSConnectionManager100 CM = this.Connections.File1Conn;

  File1 = (string)CM.AcquireConnection(null);

}

public override void PreExecute()

{

  // Create a reader for File1

  base.PreExecute();

  SR = new StreamReader(File1);

}

public override void CreateNewOutputRows()

{

  // Declare variables

  string nextLine;

  string[] columns;

  char[] delimiters;

  int Col3Count;

  String[] Col3Value = new string[10];

  // Set the delimiter

  delimiters = ",".ToCharArray();

  // Read the first line (header)

  nextLine = SR.ReadLine();

  // Split the line into columns

  columns = nextLine.Split(delimiters);

  // Find out how many Col3 there are in the file

  Col3Count = columns.Length - 2;

  // MessageBox.Show(Col3Count.ToString());

  // Read the second line and loop until the end of the file

  nextLine = SR.ReadLine();

  while (nextLine != null)

  {

    // Split the line into columns

    columns = nextLine.Split(delimiters);

    {

      // Add a row

      File1OutputBuffer.AddRow();

      // Set the Col3Value variable to the values in the file

      for (int i = 0; i <= Col3Count - 1; i++)

      {

        Col3Value[i] = columns[i + 2];

      }

      // Set the values of the Script Component output according to the file content

      File1OutputBuffer.Col1 = columns[0];

      File1OutputBuffer.Col2 = columns[1];

      File1OutputBuffer.Col3t1 = Col3Value[0];

      File1OutputBuffer.Col3t2 = Col3Value[1];

      File1OutputBuffer.Col3t3 = Col3Value[2];

      File1OutputBuffer.Col3t4 = Col3Value[3];

      File1OutputBuffer.Col3t5 = Col3Value[4];

      File1OutputBuffer.Col3t6 = Col3Value[5];

      File1OutputBuffer.Col3t7 = Col3Value[6];

      File1OutputBuffer.Col3t8 = Col3Value[7];

      File1OutputBuffer.Col3t9 = Col3Value[8];

      File1OutputBuffer.Col3t10 = Col3Value[9];

    }

    // Read the next line

    nextLine = SR.ReadLine();

  }

}

public override void PostExecute()

{

  // Close the reader

  base.PostExecute();

  SR.Close();

}

Save the script, close SSIS script editor, click OK to get out of the Script Transformation Editor, back to the Data Flow.
I would suggest not to put the output into a table in the database directly, but put the output to a file first to check the result. This way you don’t need to worry about data conversion between string and int etc. Create a flat file connection with 12 columns. All the columns are string, with length = 10, as follows:
Run it. If it fails, check the source file and make it like this:

Also check the script, make sure it is like above.
Create an OLE DB Destination or SQL Server Destination (see my post here re why practically speaking we can’t use SQL Server Destination). But create the table first by executing the DDL I wrote above. Note that if you are using SQL Server Destination (because you are doing it to your local instance) then notice that on Windows 7 or Vista you need to run BIDS as Administrator, otherwise you’ll get this error: Unable to prepare SSIS bulk insert, so something like that.
Create a data conversion to convert Col3 columns from string to int, like this:
Connect the output of the data conversion to the DB and map the DC output columns to the DB columns like this:
Now execute it again, and check the target table:
Overall the Data Flow looks like this: