English Deutsch

 

MS-DOS is alive – within the SSIS ForEachLoop-Container!

Stumbling upon the legacy of a 20 year old operating system is odd enough, but stumbling in Integration Services is even more fun!

But first things first: if you need to process multiple files you will sooner or later put the ForEach-Loop-Container to use. Using the ForEachLoop, you can iterate over files with a certain name structure within a folder quite easily. A scenario for importing multiple files might involve reading multiple Excel-files within a folder. A possible configuration of the ForEach-Loop-Container is shown below:

The rest of the package then might look like this: within the loop a variable with the current file path is read in order to print out all found files in a message box after the container:

I have yet to show you the content of our folder, so here it is:

As expected, the package finds both Excel files – awesome! But wait, did we really expect the package to find the Excel 2007 file as well? After all the file has the extension .xlsx, which is actually quite helpful since both Excel file types differ significantly and might need separate processing within SSIS. So what if we wanted to exclude Excel 2007 files in the first place? Double-checking the configuration of the ForEach Loop Container reveals no helpful information – no matter what expression we insert into the “Files” field, the container cannot be modified to catch .xls files only. The expression *.xls always behaves like *.xls*!

At this point, Books Online once again prove to be helpful and reveal that the observed behavior is purely intentional – the expression behaves just like it would using the DIR-command in Windows (see http://msdn.microsoft.com/en-us/library/ms187670.aspx). Due to compatibility issues, the DIR-command is implemented in such a manner, that it “also returns files that have the same extension with additional characters appended” – the MS-DOS 8.3 file name restriction strikes again!

As a consequence of this behavior, we need to differentiate between file extensions within the ForEach-Loop-Container itself. For example, we could add a Precedence Constraint which evaluates an expression that checks whether the last three letters of the current file name are really x, l and s:

Using this modification, the package runs as expected and only finds a single Excel file, ignoring the Excel 2007 file:

So, is it a bug or a feature?! Once again SSIS has been proven to be at times, but in the end that is what makes it well worth loving – you always discover something new along the way!

Pearls on Codeplex – today: Enhanced SSIS Execute Package Task

Divide et impera (divide and conquer) – this well-known technique used for problem solving has been around for ages and has well made its way into modern software. For example, the Integration Services offer various possibilities for splitting large tasks into smaller, less complex and more manageable steps. You may use different data-flows for different tasks, structure your tasks using sequence containers or even call other SSIS packages from within the current one.

But with increasing package complexity (lots of connection managers and package variables) and heavy usage of package configuration comes a less manageable package – if you rely solely on built-in SSIS-Tasks. The possibility of calling other SSIS packages comes “out of the box” with SSIS 2005 and 2008, but passing variables and connection managers to the “child package” can become quite a hassle (lots of expression-hacking). It is even impossible to evaluate return values.

This “feature gap” is overcome by a Codeplex project which todays post is dedicated to: the “Enhanced SSIS Execute Package Task”. You can find the relatively unknown project under http://ssisexec.codeplex.com/ – see “Downloads” for SSIS 2005 and SSIS 2008 R2 versions. The following Books Online link might assist you in installing the task (the download consists of a dll and a slim readme): http://msdn.microsoft.com/en-us/library/ms403356.aspx. Once you set-up everything correctly, you can add the task to your list of Control Flow Tasks via a right click into the Toolbox -> Choose Items.

The component comes with a graphical interface where you can map variables and connection managers between packages. You can call packages which are stored in either your file system or msdb on a SQL Server. This way, passing an entire set of configurations to child packages can be achieved in a very comfortable manner. The following screenshot shows an example of how a mapping between packages might look like: the parent package passes its “Excel File”-connection (whose connection string has been set by package configuration) to the child package.

Beware: As of today the project is still in Alpha-status – I do not recommend using the task in a productive environment! However, the author has yet to receive any negative feedback, so it seems to run stable. I am sure he will appreciate your feedback (negative or positive) when you tried the component yourself – happy testing!

Excel and leading zeros vs. SSIS

Are you familiar with the difficulties Excel has with leading zeros? It typically takes a fair amount of formatting in order to get Excel to display columns with leading zeros correctly, as Excel treats columns with digits as numeric values. If you use SSIS and insert into an Excel destination, leading zeros will disappear upon opening the workbook no matter how you handle the column in SSIS.

A quick workaround is shown in the screenshots below. Using a “Derived Column” task, we simply create a new text-column (e.g. nvarchar) and fill it with the appropriate content embedded in quotes with a leading equal sign (mind the escape-sequence!).

As a result, Excel from now on interprets the columns’ content as a formula and the leading zero is displayed correctly.

SSAS 2005 MDX Query improvement– evil MEMBERS and even more evil: STRTOMEMBER

“The best MDX is the one you don’t write” (rumoured quote: Mosha Pasumansky) – but who believes that this quote has to be taken absolutely literally?

I was facing a customer dimension with 172,000 customers (not many indeed) dedicated to roundabout 200 sales representatives. The goal was to detect all customers, who didn’t achieve a turnover in a given month, but did so in the rest of the year. Translated into Adventure Works Cube, replacing the sales representative with dimension “Product”, the query first looked like this:

WITH

MEMBER
TurnOverDecember03
AS

(

[Measures].[Internet Sales Amount]

,STRTOMEMBER(‘[Date].[Calendar].[Month].&[2003]&[12]‘, CONSTRAINED)

)

SET Customers AS

EXCEPT(EXISTS

(NONEMPTY([Customer].[Customer].[Customer].MEMBERS

,[Measures].[Internet Sales Amount])

,[Product].[Product Categories].[Category].&[1])

,NONEMPTY([Customer].[Customer].[Customer].MEMBERS

,TurnOverDecember03))


SELECT

{

[Measures].[Internet Sales Amount]

} ON 0,

{

Customers

} ON 1 FROM [Adventure Works]


WHERE [Date].[Calendar].[Calendar Year].&[2003]

For the sake of readability, i decided to create a calculated member, “TurnOverDecember03″ to filter the set of customers.

The origin query (not the example) took about 45 seconds to display the data, which isn’t acceptable at all. Executing the query above with “MDX Studio” resulted in 44,240 calculated cells and 2.6 sec duration.

While searching for query improvement i decided to paste the expression for TurnOverDecember03 in the customers set, obeying Mosha’s quote.

WITH

SET Customers AS

EXCEPT(

EXISTS(NONEMPTY([Customer].[Customer].[Customer].MEMBERS

,[Measures].[Internet Sales Amount])

,[Product].[Product Categories].[Category].&[1])

,NONEMPTY([Customer].[Customer].[Customer].MEMBERS

,(

[Measures].[Internet Sales Amount]

,STRTOMEMBER(‘[Date].[Calendar].[Month].&[2003]&[12]‘)

)))


SELECT

{

[Measures].[Internet Sales Amount]

} ON 0,

{

Customers

} ON 1 FROM [Adventure Works]


WHERE [Date].[Calendar].[Calendar Year].&[2003]

Executing the second query in “MDX Studio” resulted in only 7,272 calculated cells but also 2.6 sec duration. Saving the CONSTRAINED option for STRTOMEMBER resulted to a duration of 0.6 seconds.

So, what happened here? Obviously, using a “Help member” leads to a calculation of the term NONEMPTY([Customer].[Customer].[Customer].MEMBERS,TurnOverDecember03) for EACH member of the customers dimension – which decreases the query’s performance the more members a dimension contains.

Furthermore, using STRTOMEMBER with CONSTRAINED also decreases the query’s performance. That is why one should take Mosha’s quote REALLY literally. In this case here, just saving the words “MEMBER, AS, STRTOMEMBER and CONSTRAINED” lead to a query improvement of almost 500 % and a 86% reduction of calculations.

Besides, in SSAS 2008 and it’s really performance boosting “block computing” the mentioned problem doesn’t occur at all, you can use “Help Members” without shame, both queries mentioned above calculated 7,272 cells only.

MDX ParallelPeriod – don’t leave your levels – others might not be there

Getting prior dimension members in a time dimension isn’t a voodoo affair. MDX provides us with a nice function, called “PARALLELPERIOD”, which “Returns a member from a prior period in the same relative position as a specified member” – e.g.: creating a query on “AdventureWorks Cube 2008″ with the following statement:

WITH

 

MEMBER LevelMonth_OneYearAgo AS

PARALLELPERIOD([Date].[Calendar].[Month],12,

[Date].[Calendar].CURRENTMEMBER).MEMBER_CAPTION

 

MEMBER LevelYear_OneYearAgo AS

PARALLELPERIOD([Date].[Calendar].[Calendar Year],1,

[Date].[Calendar].CURRENTMEMBER).MEMBER_CAPTION

 

SELECT {

LevelMonth_OneYearAgo

,LevelYear_OneYearAgo

} ON 0 ,

{

[Date].[Calendar].[Month].&[2003]&[8]

}ON 1 FROM [Adventure Works]

returns:

Although the hierarchy level “[Date].[Calendar].[Calendar Year]” isn’t mentioned on any axis, its values can be seen because of its attribute relationship to hierarchy level “[Date].[Calendar].[Month]“.

So, getting results with PARALLELPERIOD is possible despite NOT choosing the same hierarchylevel as mentioned on the query’s axis – if attribute relationships are designed between these levels .

But what happens if prior periods are not available? In the AdventureWorks OLAP Database, the timedimension Date’s starting member is July 1st, 2001. Replacing the month on AXIS 1 in the query above with “[Date].[Calendar].[Month].&[2002]&[6]” returns the following:

The result for level month is correct, there is no June 2001 in AdventureWorks, but what happened to the member created with PARALLELPERIOD on level “Year”? December 2001 is shown which is definitely NOT the current months last year value. In this case here, the last month of the prior year is displayed, but one can’t rely on this effect. Please observe the following example (and give it a try on your own AdventureWorks Cube)

The goal is to retrieve the well-known “Sales amount” and last years “Sales Amount” at any given month in AdventureWorks Cube between 2001 and 2002. Browsing the cubes values returns the following:

There are no values available for January 2001 till June 2001. Creating a similar MDX query like:

WITH

//LAG: Returns the member that is a specified number of positions

//(here: 12) BEFORE a specified member along the member’s dimension/hierarchy.

MEMBER [Measures].[Lag_Month] AS

([Measures].[Sales Amount],

[Date].[Calendar].CURRENTMEMBER.LAG(12))

 

//PARALLELPERIOD: Returns a member from a prior period

//in the same relative position as a specified member

MEMBER [Measures].[PP_LevelMonth] AS

([Measures].[Sales Amount],

PARALLELPERIOD([Date].[Calendar].[Month],12,

[Date].[Calendar].CURRENTMEMBER))

 

MEMBER [Measures].[PP_LevelYear] AS

([Measures].[Sales Amount],

PARALLELPERIOD([Date].[Calendar].[Calendar Year],1,

[Date].[Calendar].CURRENTMEMBER))

 

SELECT {

[Measures].[Sales Amount]

,[Measures].[Lag_Month]

,[Measures].[PP_LevelMonth]

,[Measures].[PP_LevelYear]

} ON 0,

{ //Level month on axis 1

[Date].[Calendar].[Month].&[2001]&[7]

    : [Date].[Calendar].[Month].&[2002]&[12]

} ON 1 FROM [Adventure Works]

returns the following:

The values for “Sales Amount” of the member created with PARALLELPRIOD on level “Year” are in disorder, starting at January 2002 with the first existing value for “Sales Amount” in 2001 (July 2001), but not starting at the node for July 2002, which is the first node having both values, the current years and last years “Sales Amount”.

Members created with MDX function “LAG” or with AXIS 1 corresponding level “Month” return correct results.

Therefore, don’t mix up the levels, even if the implicit possibilities of MDX tempt you to do so.

Slowly Changing Dimension – now even faster!

Have you ever dealt with incrementally filling an existing Data Warehouse? Then you probably have encountered the problem of Slowly Changing Dimensions (short: SCD). Mostly small changes which affect records of a dimension are the primary cause for an SCD to be used. One can even differentiate between a few cases so generally speaking, using an SCD requires special attention.

SQL Server Integration Services therefore supplies the developer with an SCD-wizard, which allows for quick and uncomplicated usage of the SCD. But if you ever used the wizard, you may have encountered the following mayor drawbacks:

1. Changes to the component cause connections to the following components to be dropped. These connections then have to be manually reproduced.

2. Bigger changes to dimensional data are handled with very low performance.

This is why the SCD Component is sometimes replaced with workarounds and has yet to break out of its shadowy existence between Script Components and SQL-Tasks.

In the past months, a project named KimballsMethodSCD (http://kimballscd.codeplex.com/) has evolved on Microsofts Community-Platform Codeplex (http://www.codeplex.com), in order to help the Slowly Changing Dimension redeem itself within SSIS. The components author is Todd McDermid (http://toddmcdermid.blogspot.com/), while Ralph Kimball delivered the components name. He is author of the book The Data Warehouse Toolkit and his SCD-Best Practices have had substantial influence on the development of the component.

But now on to the add-on: the installation is pretty straight-forward and flawless, thanks to the supplied msi-Installer. The KimballMethodSCD-Component is then available as a Dataflow-Component. An exhaustive illustration of the component would deliver enough substance for ten Blog-entries, so I am going mention only the “pearls”:

- implementation of Kimballs Best Practices: unknown member support; support of a “reason for change”-column; simple row-auditing for insert- and update-operations

- the existing dimension is read from a dataflow instead of a connection manager which provides for more flexibility and better cache-behavior

- changes are non-destructive to the following components

- flexible column comparisons are possible: e.g. columns can be compared case (un-) sensitive and space (un-) sensitive

- a substantial gain in performance up to the factor of 100 (!) can be easily achieved with bigger dimensions

While the first points mostly help the developer with design, the gain in performance is a nice side-effect and can be accredited to the add-on’s proper implementation: by using multiple threads and sort-optimization execution times are possible which lie far beyond those of the standard-wizard. The wizard uses a slow row-by-row lookup for the dimensional table, while KimballMethodSCD reads the entire table in one stream.

The example above, which can be downloaded with the add-on via Codeplex, shows the impressive performance-gain: a dimension table was loaded with 120.000 records. Next, the source file was manipulated to 33 SCD1 changes and 42 SCD2 changes, affecting only a small percentage of the original records. Upon execution, the SSIS standard wizard took 29 minutes while KimballMethodSCD only needed 15 seconds! While the system had to compensate for a 10 times higher demand in main memory, the user was rewarded with a gain in performance with the factor 118.

The execution times, the extended functionality (according to Ralph Kimballs Best Practices) and last but not least the simplicity of KimballMethodSCD make the component a valuable alternative to the standard wizard and should help propagate the usage of Slowly Changing Dimension within SSIS.

Parameterized DataReader Source in SSIS

When integrating different sources with Integration Services of SQL Server 2005, one sometimes encounters ODBC-sources, which can be queried using a DataReader-Connection. But when it comes to parameterization, one quickly encounters the limits of the DataReader: While parameterization is possible with OLE-DB sources, there is no obvious way which applies to the DataReader connection.

The following example shows the problem: We want to control the condition on SellStartYear via an external parameter.

The editor neither offers a direct parameterization nor indirect by using a string variable as SQL command.

The following trick can be used here:

The Dataflow containing the source offers the configuration of different values via its properties window:

The SQL commands of all contained sources can be controlled by expressions. In our example it would be the property [Data Reader Source].[SQLCommand]. Attention should be paid to the Name-property of the source (in this case “Data Reader Source”) – it has to be static in order to make the parameterization work. Now the SQL-query can be parameterized in any possible way.

Here we stored the year in an external String-variable, which we can configure dynamically.

In order not to lose the overview, the Expression Highlight feature of BIDS-Helper can be very useful here. It flags the Dataflow after a successful configuration with a magenta-colored triangle. Now one can easily determine which SSIS-objects are controlled by expressions.

The same workaround can be used for ADO.NET sources with Integration Services of SQL Server 2008. Here, the editor has been upgraded to the level of OLE DB-sources, but unfortunately nothing has changed in respect to the missing functionality regarding parameterization.