Andrew Breen

Thursday, 14 August 2014

SQL: More PIVOT

As you have seen in the previous article the use of PIVOT with the months of the year. This is a common use of PIVOT.

Let's look at another example. Again, using Northwind. This time what I'd like to achieve is a list of employees as columns and a total of orders each has made in each month/year. Let's start by looking at the raw data

SELECT OrderDate, OrderId, CustomerId, EmployeeId FROM Orders
ORDER BY OrderDate, EmployeeId

Gives us some output that isn't quite the data we are looking for

OrderDate               OrderId     CustomerId EmployeeId
----------------------- ----------- ---------- -----------
1996-07-04 00:00:00.000 10248       VINET      5
1996-07-05 00:00:00.000 10249       TOMSP      6
1996-07-08 00:00:00.000 10251       VICTE      3
1996-07-08 00:00:00.000 10250       HANAR      4
1996-07-09 00:00:00.000 10252       SUPRD      4
...
1998-05-06 00:00:00.000 11074       SIMOB      7
1998-05-06 00:00:00.000 11075       RICSU      8

(830 row(s) affected)

So we will get this with the year and month of the OrderDate and then sort it

SELECT
    YEAR(OrderDate) AS YY, 
    MONTH(OrderDate) AS MM, 
    OrderId, 
    CustomerId, 
    EmployeeId FROM Orders
ORDER BY 
    YEAR(OrderDate), 
    MONTH(OrderDate),
    EmployeeId

Giving us something a little bit better

YY          MM          OrderId     CustomerId EmployeeId
----------- ----------- ----------- ---------- -----------
1996        7           10258       ERNSH      1
1996        7           10265       BLONP      2
1996        7           10266       WARTH      3
1996        7           10251       VICTE      3
1996        7           10253       HANAR      3
1996        7           10256       WELLI      3
1996        7           10257       HILAA      4
...

We can see that EmployeeId 1 sold one order in July 1996, EmployeeId 3 sold 4 etc. We can summarise this with

SELECT
    YEAR(OrderDate) AS YY, 
    MONTH(OrderDate) AS MM, 
    EmployeeId, 
    COUNT(*) AS NumberOfOrders 
FROM Orders
GROUP BY 
    YEAR(OrderDate), Month(OrderDate), EmployeeId
ORDER BY 
    YEAR(OrderDate), Month(OrderDate), EmployeeId

Producing

YY          MM          EmployeeId  NumberOfOrders
----------- ----------- ----------- --------------
1996        7           1           1
1996        7           2           1
1996        7           3           4
1996        7           4           7
...

Now to PIVOT this we can simply run this

SELECT * FROM
(
    SELECT 
        YEAR(OrderDate) As YY, 
        MONTH(OrderDate) AS MM, 
        CustomerId,
        EmployeeId from Orders) As Source
PIVOT
(
    COUNT(CustomerId) FOR EmployeeId IN ( [1], [2], [3], [4] )
) AS PivotTable

This gives us the output for Employees 1, 2, 3 and 4

YY          MM          1           2           3           4
----------- ----------- ----------- ----------- ----------- -----------
1997        1           3           4           7           8
1996        8           5           2           2           5
1997        9           8           7           4           5
...
1996        7           1           1           4           7
1998        4           8           18          10          10

(23 row(s) affected)

Not quite right. I'd rather it was in some sort of order

SELECT * FROM
(
    SELECT 
        YEAR(OrderDate) As YY,
        MONTH(OrderDate) AS MM,
        CustomerId,
        EmployeeId 
    FROM Orders) As Source
PIVOT
(
    COUNT(CustomerId) FOR EmployeeId IN ( [1], [2], [3], [4] )
) AS PivotTable
ORDER BY YY, MM

One thing to note - in the ORDER BY clause you can use the column names - you don't have to repeat the clause. The output will be

YY          MM          1           2           3           4
----------- ----------- ----------- ----------- ----------- -----------
1996        7           1           1           4           7
1996        8           5           2           2           5
1996        9           5           5           1           3
...

There are however nine employees. So to get all nine employees we need to modify our query for this.

SELECT * FROM
(
    SELECT 
        YEAR(OrderDate) As YY, 
        MONTH(OrderDate) AS MM, 
        CustomerId, 
        EmployeeId FROM Orders) As Source
PIVOT
(
    COUNT(CustomerId) FOR EmployeeId 
         IN ( [1], [2], [3], [4], [5], [6], [7], [8], [9] )
) AS PivotTable
ORDER BY YY, MM

Finally giving us

YY    MM    1     2     3     4     5     6     7     8     9
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1996  7     1     1     4     7     3     2     0     2     2
1996  8     5     2     2     5     0     4     1     6     0
1996  9     5     5     1     3     1     3     2     3     0
...

This creates an issue. What happens when a new Employee joins. We need to modify our query to cope with this. The next article will be about creating a Dynamic PIVOT. No real rocket science, but using generated SQL.

Tuesday, 29 July 2014

SQL: Introduction to the PIVOT keyword

I haven't written in a while. Normally my articles have been some type of more advanced coding which I have directed some of the talented apprentices I worked with to look at. However, I have moved back to development so there wasn't the greatest impetus. But, I think I will keep going with some more advanced topics. This one is on the use of PIVOT and I have a few more in a series of articles on this.

As before I will be using the Northwind database. Let's set an initial target: we want the total Freight cost in each month. So in the database we can run this SQL to get the list of orders along with the order date and freight cost.

SELECT OrderId, OrderDate, Freight FROM Orders

Produces

OrderId     OrderDate               Freight
----------- ------------- ---------------------
10248       1996-07-04 00:00:00.000 32.38
10249       1996-07-05 00:00:00.000 11.61
10250       1996-07-08 00:00:00.000 65.83
...
11076       1998-05-06 00:00:00.000 38.28
11077       1998-05-06 00:00:00.000 8.53

(830 row(s) affected)

We can get the month using the MONTH function (or DATEPART), like this

SELECT OrderId, Month(OrderDate) AS MonthNumber, Freight FROM Orders

Producing

OrderId     MonthNumber Freight
----------- ----------- ---------------------
10248       7           32.38
10249       7           11.61
10250       7           65.83
...
11076       5           38.28
11077       5           8.53

(830 row(s) affected)

And this data could be summarised using this

SELECT MONTH(OrderDate) AS MonthNumber, SUM(Freight) AS TotalFreight FROM Orders
GROUP BY MONTH(OrderDate)
ORDER BY 1

giving us this output

MonthNumber TotalFreight
----------- ---------------------
1           7702.42
2           5874.39
3           7267.83
4           9332.67
5           4146.48
6           1852.65
7           3746.90
8           4475.44
9           4360.53
10          5466.12
11          4160.71
12          6556.55

The PIVOT function (at it's simplest) will allow us to "Pivot" the data - we will take the month number and make each one our column headings.

SELECT *
FROM 
  (SELECT MONTH(OrderDate) AS MonthNumber, Freight FROM Orders) AS Source
PIVOT
(SUM(Freight)
 FOR MonthNumber 
    IN ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12])) AS PivotTable

And now we have our Month Numbers as columns

1        2        3        4        5        6        
-------- -------- -------- -------- -------- -------- 
7702.42  5874.39  7267.83  9332.67  4146.48  1852.65  

7        8        9        10       11       12
-------- -------- -------- -------- -------- --------
3746.90  4475.44  4360.53  5466.12  4160.71  6556.55

Let's look at the syntax of the PIVOT statement (or at least this simple example). We have what is called an "in-line" view for our source data. This is simply our MonthNumber and the Freight cost. If you run this as a separate statement you will get a list of orders (in this case without the OrderId) and the freight cost.

We then have the PIVOT keyword and after this (in brackets) the details of what we are pivoting. You are required (i.e. YOU NEED) to have an aggregation function. In this case we are using SUM (to get the total freight cost for each month). Then we have

FOR MonthNumber IN ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12])

This defines where the data will come from (in this case the MonthNumber in our source data) and which column values will be shown (the values 1, 2, 3, etc up to 12). The values in these columns must be convertabile to a string (or more specifically nvarchar) - so can't be an image or text etc.

You need to have the table aliases - even though you don't refer to them. The PIVOT operation will perform the grouping - you do not specify a GROUP BY clause.

There will be a few more articles in this series, looking at more complex examples, a dynamic PIVOT and using UNPIVOT as well as a bit more detail. I like to start with an example rather than the theory. Before those articles here is a slightly more complex example

SELECT *
FROM 
   (SELECT YEAR(OrderDate) AS Year, MONTH(OrderDate) AS MonthNumber, Freight FROM Orders) 
       AS Source
PIVOT
(SUM(Freight)
 FOR MonthNumber 
    IN ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12])) AS PivotTable

The only change is that an extra column has been added to our source - Year. And if you run this you get

Year   1         2         3         4         5         6         
------ --------- --------- --------- --------- --------- --------- 
1998   5463.44   4272.94   5379.02   6393.57   685.08    NULL      
1996   NULL      NULL      NULL      NULL      NULL      NULL      
1997   2238.98   1601.45   1888.81   2939.10   3461.40   1852.65   

7         8         9         10        11        12
--------- --------- --------- --------- --------- ---------
NULL      NULL      NULL      NULL      NULL      NULL
1288.18   1397.17   1123.48   1520.59   2151.86   2798.59
2458.72   3078.27   3237.05   3945.53   2008.85   3757.96

Suddenly we have a Year output as well - no extra GROUP BY (this has been done for us). We can order this (add an ORDER BY clause).

One thing to note is that we have NULL's in our output. When we use the SUM aggregate method, if there is nothing to SUM (i.e. for the Month and Year) the output will be NULL.

Let's go back to our source data and run this

SELECT 
     YEAR(OrderDate) AS Year, 
     MONTH(OrderDate) AS MonthNumber, 
     SUM(Freight) AS TotalFreight 
FROM Orders
GROUP BY YEAR(OrderDate), MONTH(OrderDate)
ORDER BY 1

You get output

Year        MonthNumber TotalFreight
----------- ----------- ---------------------
1996        7           1288.18
1996        8           1397.17
1996        9           1123.48
...
1998        4           6393.57
1998        5           685.08

(23 row(s) affected)

There is no entry for January in 1996. So in our source data we can't use some nifty ISNULL method to get our data. It is unfortunately the same for the PIVOT method (although we will look at some techniques for handling this in a later article).

However, if you use the COUNT aggregate function you get zero's

SELECT * 
FROM 
 (SELECT YEAR(OrderDate) AS Year, MONTH(OrderDate) AS MonthNumber, Freight FROM Orders) 
    AS Source
PIVOT
(COUNT(Freight) 
 FOR MonthNumber 
    IN ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12])) AS PivotTable
ORDER BY 1

Will produce

Year        1           2           3           4           5           6           
----------- ----------- ----------- ----------- ----------- ----------- ----------- 
1996        0           0           0           0           0           0           
1997        33          29          30          31          32          30          
1998        55          54          73          74          14          0           

7           8           9           10          11          12
----------- ----------- ----------- ----------- ----------- -----------
22          25          23          26          25          31
33          33          37          38          34          48
0           0           0           0           0           0

Monday, 24 February 2014

SQL: Using a Collation within a SELECT list or WHERE clause

Sometimes you when you do comparisons on string data within a table you do want it to be case sensitive. However for a lot of queries (or table structures to be more specific) they will come out as case insensitive. For example, the following query on the table Employees in Northwind

SELECT EmployeeID, LastName, FirstName FROM Employees
WHERE FirstName IN ('Andrew', 'robert')

Will return two rows:

EmployeeID  LastName             FirstName
----------- -------------------- ----------
2           Fuller               Andrew
7           King                 Robert

Even though I asked for FirstName of 'robert' - I got Robert (with an uppercase R).

This is due to the collation sequence. If you use sp_help Employees it will show you the structure of the table. I am using this query to return the structure (which is part of the query with in the system stored procedure sp_help).

SELECT
  'Column_name' = name,
  'Type' = type_name(user_type_id),
  'Computed' = CASE WHEN ColumnProperty(object_id, name, 'IsComputed') = 0 
     THEN 'No' ELSE 'Yes' END,
  'Length' = convert(int, max_length),
  'Nullable' = CASE WHEN is_nullable = 0 THEN 'No' ELSE 'Yes' END,
  'Collation' = collation_name
FROM 
  sys.all_columns
WHERE
  object_id = 245575913
AND 
  name IN ('EmployeeID', 'LastName', 'FirstName')

(If you are going to do this check your object_id is correct).

This is showing the structure as

Column_name  Type   Computed Length      Nullable Collation
------------ ------ -------- ----------- -------- --------------------
EmployeeID   int    No       4           No       NULL
LastName     nvarch No       40          No       Latin1_General_CI_AS
FirstName    nvarch No       20          No       Latin1_General_CI_AS

The collation here is Latin1_General_CI_AS - the CI tells us it is case insensitive (AS is accent insensitive and if you see WS/WI this is width sensitivity/insensitivity and KS/KI is kana sensitivity/insensitivity respectively).

But what if I want to perform queries that are case sensitive. After all when I display the data it is in the case as it is stored.

I can convert a column to a different collation - this is done by adding after the column name the COLLATION keyword and the name of collation I want. Thus

SELECT EmployeeID, LastName, FirstName,
FirstName COLLATE Latin1_General_CS_AS AS CS_FirstName
FROM Employees
WHERE FirstName IN ('Andrew', 'robert')

Adds an additional column (with an alias CS_FirstName):

EmployeeID  LastName             FirstName  CS_FirstName
----------- -------------------- ---------- ------------
2           Fuller               Andrew     Andrew
7           King                 Robert     Robert

But I can't use a column alias in a where clause, so using something like this

SELECT EmployeeID, LastName, FirstName,
FirstName COLLATE Latin1_General_CS_AS AS CS_FirstName
FROM Employees
WHERE FirstName COLLATE Latin1_General_CS_AS IN ('Andrew', 'robert')

will now give us the correct results.

EmployeeID  LastName             FirstName  CS_FirstName
----------- -------------------- ---------- ------------
2           Fuller               Andrew     Andrew

Of course you don't need the computed column in the final output.

Saturday, 28 December 2013

Playing with Prolog

If you ask a number of computing science graduates (especially from the 90's) a few will have used the Prolog language at some point. Prolog is a logic language - think of it as defining a lot of facts and matching the fact with some data (like doing a database lookup). You run a "query" by matching your rules to your facts.

In my final year of my degree I worked on a project to create a "Graphical Interface for Prolog". The project represented (I think) 25% of the final score for the year and this project made me a developer. The idea for the project came from a poll of possible projects - I was told at the end that they thought the work that would be done in this project would be mainly theoretical. However, I ended up producing a "product" - in fact two products. (I also ended up winning the IBM prize that year for the best project).

We already had a Prolog interpreter for the classes we did in Artificial Intelligence - and the source code for it, which was written in C by Fernando Pereira whilst he was doing a PhD at Edinburgh University. I took his code and added some hooks into it to allow it to be extended. This gave a mechanism to add new methods working with the Prolog data types. I then wrote a set of graphical predicates (essentially functions that said whether they worked or not - can could pass back data).

For a few years I had been meaning to try and get this up and running again. I did have the source code - but on 5.25" floppy disks (which I can't read any more). But I did have the printouts.

I looked around on the internet for a Prolog interpreter and found Prolog.NET for which there was source code (written in C#). So, it being the holidays I decided on doing a bit of sofa coding - although after four or five hours, I do wish I had just sat at a desk with multiple monitors and to save all the papers thrown about the floor!

So, here are a few instructions if you want to do your own extensions. The main piece of code to add to is Library.cs which has a list of predicates and functions. I added the following code.

standard.Add("graphics", 1, new PredicateDelegate(Graphics.graphics), false);
standard.Add("axes", 2, new PredicateDelegate(Graphics.axes2), false);
standard.Add("grset", 2, new PredicateDelegate(Graphics.grset), false);
standard.Add("point", 1, new PredicateDelegate(Graphics.point), false);
standard.Add("line", 2, new PredicateDelegate(Graphics.line), false);
standard.Add("rectangle", 2, new PredicateDelegate(Graphics.rectangle), false);
standard.Add("wait", 1, new PredicateDelegate(Graphics.wait), false);

These predicates that I am defining (graphics, axes, grset, point, line, rectangle) are the same ones I did twenty years ago - just need to do an implementation. My implementation will be with Windows Forms - but without events. This makes it difficult to get mouse button presses, so I have another predicate (wait) which will wait for a period of time.

Other examples of code already written is in the folder LibraryMethods. I added a file Graphics.cs and changed the namespace and the class definition as below

namespace Prolog
{
    internal static class Graphics
    {
       // Your code
    }
}

I initially started "hacking" the code trying to keep track of an implementation of a Windows Form instance within my Graphics class. But stopped. The class was now doing more than one thing - dealing with the Prolog predicates as well as working out the Windows Forms. And was getting confusing.

Looking at my old project in C I had a set of methods called s1wg - Simple One Window Graphics. The implementation behind this was with X-Windows. Guess what - I decided to use the same "interface" - although now within a class. It was actually good to see that the implementation I did all these years ago (before getting real experience) was essentially implementing the S of SOLID (Single Responsibility Principle). I could write this code separately (in another project) and check it was working without worrying about the Prolog part (and that was giving me some problems as well). So s1wg is a simple class with methods to

Constructor
Close down (close the window)
Create a window (with specified position, size and title)
Draw a point
Draw a line (between two positions)
Draw a rectangle (with a position, width and height)

I then went back to work with the Prolog parts. The predicates we have are

graphics - with a parameter to initialise or close (e.g. graphics(initialise) or graphics(close)).
axes - to define a virtual axes (e.g. axes([-1,-8],[8,2])).
grset - to set something (e.g. grset(size,[400,400]) sets the window size in pixels)
point - to draw a point using virtual co-ordinates (e.g. point([3,3])).
line - to draw a line between virtual co-ordinates (e.g. line([2,5],[5,6])).
rectangle - to draw a rectangle a particular position with specified width and height (e.g. rectangle([1,3],[5,6])).
wait - taking a parameter for a delay in milliseconds (e.g. wait(100)).

I'll get the code put up somewhere in case anyone wants to try this. Here is however an example of some of the code

public static bool line(WamMachine machine, WamReferenceTarget[] arguments)
{
    Debug.Assert(arguments.Length == 2);

    WamCompoundTerm lowerLeft = arguments[0].Dereference() as WamCompoundTerm;
    WamCompoundTerm upperRight = arguments[1].Dereference() as WamCompoundTerm;

    if (lowerLeft == null || upperRight == null)
        return false;

    if (lowerLeft.Functor == Functor.ListFunctor && 
        upperRight.Functor == Functor.ListFunctor)
    {
        if (lowerLeft.Children.Length != 2 || upperRight.Children.Length != 2)
            return false;

        double? llX = GetDouble(lowerLeft.Children[0]);
        WamCompoundTerm t = lowerLeft.Children[1].Dereference() as WamCompoundTerm;
        double? llY = GetDouble(t.Children[0]);

        double? urX = GetDouble(upperRight.Children[0]);
        t = upperRight.Children[1].Dereference() as WamCompoundTerm;
        double? urY = GetDouble(t.Children[0]);

        if (llX == null || llY == null || urX == null || urY == null)
            return false;

        s1wg.DrawLine(ScaleX((double)llX), ScaleY((double)llY), 
                      ScaleX((double)urX), ScaleY((double)urY));
    }
    return true;
}

This piece of code is checking for two parameters - which are Prolog lists, e.g. line([3,4],[7,8]). From this you need to extract the values (as WamCompoundTerm) and then check that we have lists - then the lists both have two entries. I wrote a method to extract a double from each element. I had major problems with doubles, so ended up writing this bit of code

private static double? GetDouble(WamReferenceTarget argument)
{
    WamValueDouble operandDouble = argument.Dereference() as WamValueDouble;
    Double d = 0.0d;

    if (operandDouble == null)
    {
        WamValueInteger operandInteger = argument.Dereference() as WamValueInteger;
        if (operandInteger == null)
        {
            WamValueString operandString = argument.Dereference() as WamValueString;
            if (operandString != null)
            {
                if (Double.TryParse(operandString.Value, out d))
                {
                    return d;
                }
                else 
                    return null;
            }
            else
                return null;
        }
        else
            return (double)operandInteger.Value;
    }
    else
        return operandDouble.Value;
}

Ok - definitely needs cleaning up. I check if I can get a double (which I cannot ever get working), then check if we have an integer - if it isn't an integer I check for a string that can be converted into a double. Note I am using the nullable double functionality. I am pretty sure I told my apprentices that if you start using structures like this it is probably a "code smell" of a bad design!

There are some other methods (ScaleX and ScaleY) that translate between the virtual a real co-ordinates.

What was the point of this graphical interface. It allowed you to see problems solved graphically. My first one isn't really a logic problem, but shows the use of look-ups. This exercise is "chain encoded images" - for a chain of numbers I draw lines in the direction depending on each number.

Firstly, lets have some "facts"

vector(["0"],[1,0]).
vector(["1"],[1,1]).
vector(["2"],[0,-1]).
vector(["3"],[-1,1]).
vector(["4"],[-1,0]).
vector(["5"],[-1,-1]).
vector(["6"],[0,1]).
vector(["7"],[1,-1]).

These are a list of vectors - if I specify "1" I will move 1 unit on the X axis and 1 unit on the Y axis. So if you execute the command

:-vector(["0"],C).

You will get output that C is [1,0]. Next I will have a predicate newposition which will calculate a new position

newposition([X1,Y1],[DX,DY],[X2,Y2]) :-
    (X2 is (X1 + DX)),
    (Y2 is (Y1 + DY)).

Running the following command

:-newposition([3,4],[1,2],NewPosition).

Then you will get the NewPosition as [4,6]. Hooking this up to vector you can use this like

:-vector(["2"],Delta),
newposition([4,2],Delta,NewPosition).

will give the output

Delta = [0,-1]
NewPosition = [4,1]

One of my main issues with this implementation was getting lists working. Lists in Prolog are managed a the first element in the list - and the remaining items. The code to draw the chain consists of two predicates. Prolog can/is a very recursive language. The first version we have of this predicate does the main work

drawchain([H|T],S,P) :-
    vector([H],D),
    newposition(P,D,N),
    line(P,N),
    wait(100),
    drawchain(T,S,N).

Here the drawchain predicate takes a list as the first parameter[H|T] where H is the head of the list (the first element) and T is the tail (the remaining elements). The remaining parameters are S (for start position) and P (for current position). Take each line in turn

vector([H],D) uses the head element to get the value D (the delta/change from position)
newposition(P,D,N) calculates the new position N using the current position P and delta D
line(P,N) draws a line between P and N
wait(100) will wait 100 millseconds
drawchain(T,S,N) will call the drawchain predicate with parameters S (the start position) and N (the new position we just calculated) with T being the tail of the list.

The next predicate is the stop condition - when the list is empty. In this case we want the predicate to draw a line from the current position to the start position (the "parameters" P and S respectively).

drawchain([],S,P) :-
    line(P,S).

Now to execute this I have the predicate below

chains :-
  Chain is ["0","1","6","3","4","5","2"],
  axes([-5,-2],[11,6]),
  graphics(initialise),
  drawchain(Chain,[0,0],[0,0]).

Which is executed with

:-chains.

And you get this.

Chain Encoded Images

It's been really quite fun to play with this. Not sure if I have wasted a day of my time, but interesting to see the work I did a few years ago in operation. I have other examples to get up and running with this (Towers of Hanoi, Binary Trees and Travelling Salesman) which I implemented with this interface. As well as completing the graphics predicates (text, filled shapes etc.)

The good news is that I did in less than a day work that I did over a three month period in 1990 (albeit with a little bit of an idea now of what I am doing).

Friday, 6 December 2013

First view of parallel programming in C#

I have my current apprentice group sitting an assessment in software testing yesterday. Whilst they were doing this I thought I would investigate parallel computing in C#/.Net.

This is something I was very heavily involved with from about 1997 to 2000. At the time I was working in Schlumberger with MPP (massively parallel computers) – mainly IBM SP/2 (now IBM e-series), Silicon Graphics Origin Servers and Sun SparcStations. In the office (in 1999) we had a sixteen processor IBM SP/2 – but I also worked closely with IBM, SGI and Sun to visit them and use servers with up to 500 processors, including several weeks spent at IBM in Poughkeepsie in upstate New York and Cray Park (SGI’s HQ at the time) in Minnesota.

What I was looking at was something I could do on my own laptop, which has an Intel i5 with 4 cores. Could I get a speed-up and how easy would it be to implement. Previously I used libraries like MPI (Message Passing Interface) and PVM (Parallel Virtual Machine) to provide a common approach for all the platforms that we supported.

I looked for some reference material (see http://www.microsoft.com/en-gb/download/details.aspx?id=19222) which I could work through, but needed some examples to see it in operation. This gives an examples – however, I would strongly recommend read this to understand what some of the issues might be.

I you are reading this looking for asynchronous programming in another thread (i.e. doing some work/calculation whilst loading a page) then this is the wrong article. This will be about solving a problem – and I’ll start with a simple one, but eventually maybe use it for some mathematical puzzle.

The problem that I will look to solve will be to perform the addition of an array of numbers to get a sum. I’ll work with an array of 500,000,000 shorts – and these are initialised via a loop with the current number of milliseconds. I just needed some values in the array. Something more “real” or even “random” would be better, but I just need some numbers.

The code to do this will be

private static short[] numbers = new short[500000000];

static void InitialiseTask()
{
    for (int i = 0; i < numbers.Length; i++)
    {
        numbers[i] = (short) DateTime.Now.Millisecond;
    }
}

To perform the sum of this array the sequential code would be

static void SequentialTask()
{
    Console.WriteLine("Starting Sequential Task");
    long sum = 0;
    for (int i = 0; i < numbers.Length; i++)
    {
        sum += numbers[i];
    }
    Console.WriteLine("Sum {0}", sum);
}

I am going to package these up with a method to record start and end times

static void Task(Action task)
{
    DateTime startTime;
    Console.WriteLine("Starting {0}", task.Method);
    startTime = DateTime.Now;

    // Run the task
    task.Invoke();
    Console.WriteLine("{0}", task.Method);
    Console.WriteLine("-- START {0:hh:mm:ss.fff} FINISH {1:hh:mm:ss.fff} MS {2}",
        startTime, 
        DateTime.Now, DateTime.Now.Subtract(startTime).TotalMilliseconds);
}

So that my main program would be

static void Main(string[] args)
{
    Task(InitialiseTask);
    Task(SequentialTask);
}

And running this gave this result

So let’s turn the sequential task into a parallel task. We will split this up into two summations – the first one will sum the first 250,000,000 values and the second one the last 250,000,000 values. We will then add these two sum together to get the final sum.

We have method below which takes as parameter a task number – which will determine which part of the array we will count. The return is a long – not an int or short which will be too small.

static long ParallelSum(int taskNumber)
{
    Console.WriteLine("Starting Parallel Task {0}", taskNumber);
    long sum = 0;
    int start, end;
    if (taskNumber == 0)
    {
        start = 0;
        end = numbers.Length / 2;
    }
    else
    {
        start = numbers.Length / 2;
        end = numbers.Length;
    }

    for (int i = start; i < end; i++)
    {
        sum += numbers[i];
    }
    Console.WriteLine("Sum for task {0} is {1}", taskNumber, sum);
    return sum;
}

I know I have explicitly set the start/end points of the loop – rather than calculate them. Next for the .Net magic (I always tell my apprentices that there is no magic, but this is quite cool).

static void ParallelTask()
{
    Console.WriteLine("Starting Parallel Tasks");

    long[] sums = new long[2];
    Parallel.For(0, 2, task =>
        {
            sums[task] = ParallelSum(task);
        });

    long totalSum = 0;
    foreach (long sum in sums)
        totalSum += sum;

    Console.WriteLine("Total Sum {0}", totalSum);
}

We use the magic Parallel.For(inclusiveStart, exclusiveEnd, Action body). This will run each iteration – but in parallel. The body (delegate) is run once per iteration. So here the delegate ParallelSum will be run with arguments 0 and 1 (i.e. from and including inclusiveStart of 0 to exclusiveEnd of 2 – i.e. not 2).

I have created a local array variable to save the summations of each task and retrieved at the end of each task that summation. Once all the tasks have been completed a total sum is created.

Modifying the main program to perform the parallel task as below

static void Main(string[] args)
{
    Task(SequentialInitialiseTask);
    Task(SequentialTask);
    Task(ParallelTask);
}

Giving output of

With the sequential task taking 1621 milliseconds and the parallel task 1296. In a further task (where I didn't do anything else on my laptop the results were 1588 ms against 844. Which is even more impressive. I am however concerned about the nearly six minutes to set the values in the array. So let’s make a parallel version of that. The main program will look at that

static void Main(string[] args)
{
    Task(ParallelInitialiseTask);
    Task(SequentialTask);
    Task(ParallelTask);
}

The task to initialise the array in parallel

static void ParallelInitialiseTask()
{
    Console.WriteLine("Starting Parallel Tasks");
    Parallel.For(0, 2, ParallelInitialise);
}

static void ParallelInitialise(int taskNumber)
{
    Console.WriteLine("Starting Parallel Task {0}", taskNumber);
    int start, end;
    if (taskNumber == 0)
    {
        start = 0;
        end = numbers.Length / 2;
    }
    else
    {
        start = numbers.Length / 2;
        end = numbers.Length;
    }

    for (int i = start; i < end; i++)
        numbers[i] = (short)DateTime.Now.Millisecond;
}

When you run this the results are

Showing a good speed up.

The splitting up of the problem (i.e. each task working with its own part of the array) is known as partitioning using domain decomposition. Here we have multiple tasks doing the same thing – each task handling a portion of the data.

My processor has two cores and four logical processors – so the next step is to look at splitting the problem up more. I’d also like to look at the tools available to view the task running. Maybe more next time, if I get a few hours of fun!

Also need to investigate what thread safe really means, locking and a review of all the costs involved. The article mentioned earlier has a lot of detail – and hopefully I can get off the line with some simple examples. Although I need to dig out some other typical problems to solve.

Note: A server like a SP/2 was essentially multiple RS/6000 workstations networked together with a big fast networking switch – and a Silicon Graphics Origin has what could be described as a shared memory architecture, but each piece of memory will be associated with a processor – a processor accessing the memory local to that processor will get quicker access than to other memory.

Thursday, 19 September 2013

SQL: User Defined Functions - Multi Statement Table Valued

Nearly finished this ad-hoc series of articles. This started as an exploration of CROSS APPLY and took me back to using user defined functions and how to use functions that return tables.

Our Scalar user defined function returned a single value, but we could have complex code. This is akin to a stored procedure, with the change that we could use the value returned within our SQL. The inline table valued function gave us the ability to return a results set that we could use within our FROM clauses. It was similar to a view (and could be used like a view, in that it could be updateable as well - depending on the SQL).

This type - multi statement table valued - of function creates a table variable which we then populate within our function and return it. We can put additional code within our function rather than a single SELECT statement. So now we can return a complex result set - and use this within our SQL. To be able to do this without functions, you would use a stored procedure and create a table variable (or a temporary table) and execute the stored procedure.

I always like a solution to solve, rather than showing some syntax. So I'd like a function that returns for a particular employee (in Northwind again) the employees formatted name, their managers name and the number of people that reports to the employee.

The first part - the employee name and manager name can be found with this SQL

SELECT 
        E.EmployeeID, 
        E.TitleOfCourtesy + ' ' + E.FirstName + ' ' + E.LastName,
        M.TitleOfCourtesy + ' ' + M.FirstName + ' ' + M.LastName AS ManagerName
    FROM Employees E
    LEFT OUTER JOIN Employees M ON (E.ReportsTo = M.EmployeeID)
    WHERE E.EmployeeID=2

And then to get the number of reports, we can use this

SELECT COUNT(*)
    FROM Employees
    WHERE ReportsTo = 2

It is possible to do this as one query (as a sub-query), but for the sake of this example we are going to perform both SQL within this Multi Statement Table Valued User Defined function (try saying that after a few beers!).

As a function this is what we would write

CREATE FUNCTION EmployeeInfo(@EmployeeID int)
RETURNS @EmployeeDetails TABLE
   (EmployeeID int,
    EmployeeName nvarchar(20),
    ManagerName nvarchar(20),
    NumberOfReports int
   )
AS
BEGIN
    INSERT INTO @EmployeeDetails
        SELECT 
            E.EmployeeID, 
            E.TitleOfCourtesy + ' ' + E.FirstName + ' ' + E.LastName,
            M.TitleOfCourtesy + ' ' + M.FirstName + ' ' + M.LastName AS ManagerName,
            NULL
        FROM Employees E
        LEFT OUTER JOIN Employees M ON (E.ReportsTo = M.EmployeeID)
        WHERE E.EmployeeID=@EmployeeID

    DECLARE @NumberOfReports int

    SELECT @NumberOfReports = COUNT(*)
    FROM Employees
    WHERE ReportsTo = @EmployeeID

    UPDATE @EmployeeDetails SET NumberOfReports = @NumberOfReports
    RETURN
END

This can be executed with

SELECT * FROM dbo.EmployeeInfo(5)

Wednesday, 18 September 2013

SQL: User Defined Functions - Inline Table Valued

The next step on our User Defined Functions journey (this didn't start as a series of articles, but a quick demo of the CROSS APPLY statement!) is inline table valued functions.

They are very like a view. So let's start with an example of a view. Again from Northwind, we will work with a view that returns for each order

Order ID
Customer ID
Customer Name
Order Total (which is calculated from the Order Details)

To package as a view we will have this SQL

CREATE VIEW OrderSummary
AS 
SELECT 
   O.OrderID, C.CustomerID, C.CompanyName,
   CONVERT(money,SUM((OD.UnitPrice * OD.Quantity) * (1-OD.Discount))) AS OrderTotal
FROM Orders O
JOIN [Order Details] OD ON (O.OrderID = OD.OrderID)
JOIN Customers C ON (C.CustomerID = O.CustomerID)
GROUP BY 
 O.OrderID, C.CustomerID, C.CompanyName

Giving us the following output

OrderID     CustomerID CompanyName                              OrderTotal
----------- ---------- ---------------------------------------- ---------------------
10643       ALFKI      Alfreds Futterkiste                      814.50
10692       ALFKI      Alfreds Futterkiste                      878.00
10702       ALFKI      Alfreds Futterkiste                      330.00
...
10998       WOLZA      Wolski  Zajazd                           686.00
11044       WOLZA      Wolski  Zajazd                           591.60

(830 row(s) affected)

Thus we can use this within our SQL statements almost as if it is another table (and thus JOIN to it). If we want to restrict the output we use a WHERE clause, such as

SELECT * FROM OrderSummary
WHERE OrderTotal BETWEEN 140 and 145
ORDER BY OrderTotal

Which corresponds to these five rows

OrderID     CustomerID CompanyName                              OrderTotal
----------- ---------- ---------------------------------------- ---------------------
10809       WELLI      Wellington Importadora                   140.00
10349       SPLIR      Split Rail Beer & Ale                    141.60
10585       WELLI      Wellington Importadora                   142.50
10321       ISLAT      Island Trading                           144.00
10334       VICTE      Victuailles en stock                     144.80

(5 row(s) affected)

Now to do the exact same with a function we would create one, but return a TABLE

CREATE FUNCTION fOrderSummary1()
RETURNS TABLE
AS
RETURN(
   SELECT 
      O.OrderID, C.CustomerID, C.CompanyName,
      CONVERT(money,SUM((OD.UnitPrice * OD.Quantity) * (1-OD.Discount))) AS OrderTotal
   FROM Orders O
   JOIN [Order Details] OD ON (O.OrderID = OD.OrderID)
   JOIN Customers C ON (C.CustomerID = O.CustomerID)
   GROUP BY 
      O.OrderID, C.CustomerID, C.CompanyName
)

And we can execute this, and restrict the rows with

SELECT * FROM dbo.fOrderSummary1()
WHERE OrderTotal BETWEEN 140 and 145
ORDER BY OrderTotal

With the same output. So nothing more than we had before. However, we can create a function that takes parameters. In this case we can pass the lower and upper bounds.

CREATE FUNCTION fOrderSummary2(@Lower int, @Upper int)
RETURNS TABLE
AS
RETURN(
   SELECT 
      O.OrderID, C.CustomerID, C.CompanyName,
      CONVERT(money,SUM((OD.UnitPrice * OD.Quantity) * (1-OD.Discount))) AS OrderTotal
   FROM Orders O
   JOIN [Order Details] OD ON (O.OrderID = OD.OrderID)
   JOIN Customers C ON (C.CustomerID = O.CustomerID)
   GROUP BY 
      O.OrderID, C.CustomerID, C.CompanyName
   HAVING CONVERT(money,SUM((OD.UnitPrice * OD.Quantity) * (1-OD.Discount))) 
          BETWEEN @Lower AND @Upper
)

Which we can now use as

SELECT * FROM dbo.fOrderSummary2(140, 145)
ORDER BY OrderTotal

Which removes the WHERE clause in the main SQL - and hides it away.

Rules for inline table-valued functions are similar as for a view - if the SELECT statement is updateable, then the function will also be updateable.