One of the most difficult things to deal with in the ETL is importing a file with dynamic columns. The file does not have a set number of columns. The number of columns changes from time to time. For example, the file can have 5 columns, or 20 columns, or anything in between.
Below are some examples of such a file:
File1:Col1,Col2,Col3t1,Col3t2,Col3t3a,a,1,1,1b,b,2,2,2c,c,3,3,3
In the above example the number of columns for Column3 is 3. But it could be 4, 5, 6 or more, we don’t know.
Before we look at the ETL side, we need to look at the data modelling side first. What kind of target table should we create to host these files.
Data Modelling
Essentially there are 2 choices.
1. We could make it normalised like this:
Table1:Col1,Col2,Col3Type,Col3Valuea,a,t1,1b,b,t2,1c,c,t3,1a,a,t1,2b,b,t2,2c,c,t3,2a,a,t1,3b,b,t2,3c,c,t3,3
In principle we design the structure so that we have a fixed number of columns. It doesn’t matter how many columns in the source file, we will always have 4 columns in the table.
The drawback of this design is when we query it. If we need to produce File1 from Table1, we would need to pivot the table like this:
1
2
3
4
5
6
7
8
| select Col1, Col2, t1 as Col3t1, t2 as Col3t2, t3 as Col3t3 from ( select Col1, Col2, Col3Type, Col3Value from Table3 ) P1 pivot ( sum (Col3Value) for Col3Type in (t1, t2, t3) ) as P2 |
Output:Col1,Col2,Col3t1,Col3t2,Col3t3b,a,1,1,1a,b,2,2,2c,c,3,3,3
If we have 2 tables like that we need to join them together on the common column(s) and it becomes even trickier. If we are dealing with 1 million rows column, we risk a serious performance degradation. This might not be a problem for a data mart, but if it is for a transaction system where millisecond response time is expected, this design does not serve the purpose.
Alternative Design
The alternative is to find out how many columns Col3 is. We know it could be 3 to 5, but could it be 100 columns. Spending 5 minutes with the subject matter expert to understand the characteristic of Col3, is well worth it. I have seen this dynamic columns case 3 times (the first time was about 12 years ago) and on every case they followed the book/theory (to normalise) and everytime they regretted it. On the second case (about 8 years ago) when I was involved in troubleshooting the performance issue, in the end we changed the table structure to denormalised the table structure.
So, I would recommend to find out how many columns Col3 is. 90% of the chance is: it is not that many. The response I usually get from the SME is something like this: “Most likely Col3 varies between 3 and 5 columns, it certainly won’t exceed 8”. There is a natural meaning of this data, that’s why the range is limited. For example: Col3 could be the number of teams, or the number of projects, or the number of web sites, or the number of product types. In the business world most of the data is within a certain range. You can’t have 100 teams for example. It’s physically impossible for a company as it is limited by budget constraints.
So once we have that information (the maximum number of columns), then we create the table according to this limitation. Let’s assume that the SME said that the max is 8 columns. So we design Table1 with 10 columns (just in case) as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| create table table1 ( Col1 varchar (10), Col2 varchar (10), Col3t1 int , Col3t2 int , Col3t3 int , Col3t4 int , Col3t5 int , Col3t6 int , Col3t7 int , Col3t8 int , Col3t9 int , Col3t10 int ) |
Unless it is a staging table, don’t forget to add a PK column. It is a good design to be able to uniquely identify a row at all times. In this example I don’t add a PK column because the target is a staging table.
A table structure with a fixed number of columns like this makes our lives a lot easier. It directly mimic the structure of the source file, and able to accommodate the dynamic nature of Col3.
The Data Load
I’m going to explain the ETL for the denormalised table, i.e. the one with a fixed 10 columns for Col3. The ETL for the normalised table, i.e. the table with 4 columns will have to wait until next time I’m afraid, as I have a few other posts to write next week, plus a presentation for SQLBits 8, Advanced Dimensional Modelling. If you need the ETL for the normalised table with 4 columns please contact me.
Because of the dynamic nature of Col3 in the source file, the best way is to use the Script Component as a Source. There are a lot of sources on the internet Script Component. This post shows you the basics. This post (by Lee Everest) shows you how to use a Script Component to split the rows in the source file. Similarly, this MSDN post shows how to import a delimited source file using a Script Component.
The main concept is as follows: First create a Connection Manager for a File Connection (no need to use Flat File connection). Name this connection File1, like this:
Then create a workflow, and drag the Script Component into the workflow. When asked, specify as source.
Then:
- On the Connection Managers tab, set the connection to File1 (that’s the middle column). On the first column, name it as “Conn1”.
- On the Inputs and Outputs tab, rename the “Output 0” as File1Output. Then create the Output Columns one by one, as per the structure of Table2:Col1, Col2, Col3t1, Col3t2, … Col3t10.
- In this example we will convert the data type for col3 from string to int not in the Script Component, but later on using Derived Column / Data Conversion transformation. So for all columns, set the data type to String, length 10.
- On the Script tab, click on the Edit Script and edit it as follows. Don’t forget to add “Using System.IO;” and if you want to use MessageBox for debugging you need to add “using System.Windows.Forms;”. Remember that you can’t debug an SSIS Script Component (see my post here), but you can debug SSIS Script Task.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
| private StreamReader SR; private string File1; public override void AcquireConnections( object Transaction) { // Get the connection for File1 IDTSConnectionManager100 CM = this .Connections.File1Conn; File1 = ( string )CM.AcquireConnection( null ); } public override void PreExecute() { // Create a reader for File1 base .PreExecute(); SR = new StreamReader(File1); } public override void CreateNewOutputRows() { // Declare variables string nextLine; string [] columns; char [] delimiters; int Col3Count; String[] Col3Value = new string [10]; // Set the delimiter delimiters = "," .ToCharArray(); // Read the first line (header) nextLine = SR.ReadLine(); // Split the line into columns columns = nextLine.Split(delimiters); // Find out how many Col3 there are in the file Col3Count = columns.Length - 2; // MessageBox.Show(Col3Count.ToString()); // Read the second line and loop until the end of the file nextLine = SR.ReadLine(); while (nextLine != null ) { // Split the line into columns columns = nextLine.Split(delimiters); { // Add a row File1OutputBuffer.AddRow(); // Set the Col3Value variable to the values in the file for ( int i = 0; i <= Col3Count - 1; i++) { Col3Value[i] = columns[i + 2]; } // Set the values of the Script Component output according to the file content File1OutputBuffer.Col1 = columns[0]; File1OutputBuffer.Col2 = columns[1]; File1OutputBuffer.Col3t1 = Col3Value[0]; File1OutputBuffer.Col3t2 = Col3Value[1]; File1OutputBuffer.Col3t3 = Col3Value[2]; File1OutputBuffer.Col3t4 = Col3Value[3]; File1OutputBuffer.Col3t5 = Col3Value[4]; File1OutputBuffer.Col3t6 = Col3Value[5]; File1OutputBuffer.Col3t7 = Col3Value[6]; File1OutputBuffer.Col3t8 = Col3Value[7]; File1OutputBuffer.Col3t9 = Col3Value[8]; File1OutputBuffer.Col3t10 = Col3Value[9]; } // Read the next line nextLine = SR.ReadLine(); } } public override void PostExecute() { // Close the reader base .PostExecute(); SR.Close(); } |
- Save the script, close SSIS script editor, click OK to get out of the Script Transformation Editor, back to the Data Flow.
- I would suggest not to put the output into a table in the database directly, but put the output to a file first to check the result. This way you don’t need to worry about data conversion between string and int etc. Create a flat file connection with 12 columns. All the columns are string, with length = 10, as follows:
- Run it. If it fails, check the source file and make it like this:Also check the script, make sure it is like above.
- Create an OLE DB Destination or SQL Server Destination (see my post here re why practically speaking we can’t use SQL Server Destination). But create the table first by executing the DDL I wrote above. Note that if you are using SQL Server Destination (because you are doing it to your local instance) then notice that on Windows 7 or Vista you need to run BIDS as Administrator, otherwise you’ll get this error: Unable to prepare SSIS bulk insert, so something like that.
- Create a data conversion to convert Col3 columns from string to int, like this:
- Connect the output of the data conversion to the DB and map the DC output columns to the DB columns like this:
- Now execute it again, and check the target table:
- Overall the Data Flow looks like this:
No comments:
Post a Comment