Thursday, January 10, 2013

OBIEE and Linear Regression with Oracle DB


I needed to create some basic prediction with OBIEE and decided to use Linear Regression. Something like this:
 but with substantial data.
I wanted to check Oracle DB regression functions in OBIEE for some time, so now I have an excuse.

2 disclaimers:
1. To do prediction with Oracle DB, you should prefer Oracle Advanced Analytic (Data Mining). (Seehere for additional information).
2. If you are looking for general Linear Regression with OBIEE that is not DB dependent, you should read Kurt Wolff Statistical Analysis Using Linear Regression in OBIEE

If you are still with me...Lets start working.

In Oracle DB there is a set of Linear Regression Functions (here in 11g documentation, you'll find them in10g as well ). We will focus on 2 of them regr_slope and regr_intercept. They do exactly what their names implies. 
Basic version of both expect 2 numeric parameters and finds the line that can be described byY=a*X+b.
Unfortunately regr_slope(X,Y) is actually 1/a and regr_intercept(X,Y) is not b but rather the point on X where Y=0 (a*regr_intercept+b=0). So if you do some basic math you will find out that for Y=a*X+b using these 2 functions you get:
Y=(X-regr_intercept)/regr_slope(regr_slope=1/a and regr_intercept = -b/a)
   
To use this sort of functions in a OBIEE Analysis we will use EVALUATE_AGGR function in OBIEE. Unlike regular EVALUATE function, that allows us to run DB functions in OBIEE. The EVALUATE_AGGR does it for aggregation functions. It usually forced the DB to run the SQL query without it, and then run the aggregation function on top of the result.

Lets start with a basic Analysis. Mine will run on top of SH schema in sample Oracle DB.
We will complicate things, step by step.

So if we start with 2 columns Year (a number) and Amount_sold, the relevant functions are:
REGR_SLOPE(Year, Amount_Sold)
REGR_INTERCEPT(Year, Amount_sold)

and the relevant formulas in OBIEE (I removed the folder names for clarity):
EVALUATE_AGGR('regr_slope(%1,%2)' as DOUBLE, "Year", "Amount_Sold")
EVALUATE_AGGR('regr_intercept(%1,%2)' as DOUBLE, "Year", "Amount_Sold")

Change the Data Format in Column Properties for regr_slope so you can see several places after the Decimal Point or you will see a 0.
The formula that combines it all is (remember Y=(X-regr_intercept)/regr_slope):
("Year"-EVALUATE_AGGR('regr_intercept(%1,%2)' as DOUBLE, "Year", "Amount_Sold"))
/EVALUATE_AGGR('regr_slope(%1,%2)' as DOUBLE, "Year", "Amount_Sold") .

What do I do when I want "prediction" as well. The prediction is actually extending the regression line to future dates. So all I have to do is change the Business Model Diagram and make the join Outer Join (instead of Inner).
And the result is:



Now lets complicate things:
A. Add Calendar month
B. Add Chanel Class and require separate regression for each.

What is the problem with adding Calendar Month? We can't just add it to the Analysis, we have to add the Month to the regression functions. Since they are only 2 parameters we will add it to the Year. We need uniform spread of the month during the year. I prefer to add the following calculation for Year+Month calculation:
Year+(Month-1)/12.
There is one more problem. Month is an integer, we want to divide it by 12. To make it really work we need to turn it to dual, so the actual date value will be:
Year+cast(month-1 as double)/12

The regr_slope for example is now:
EVALUATE_AGGR('regr_slope(%1,%2)' as DOUBLE, "YEAR"+cast("MONTH_No"-1 as double)/12, "Amount_Sold")

And the complete function is the next 6 lines:
("YEAR"+cast("MONTH_No"-1 as double)/12-EVALUATE_AGGR('regr_intercept(%1,%2)' as DOUBLE, "YEAR"+cast("MONTH_No"-1 as double)/12, "Amount_Sold"))
/
EVALUATE_AGGR('regr_slope(%1,%2)' as DOUBLE, "YEAR"+cast("MONTH_No"-1 as double)/12, "Amount_Sold")


It's not that terrible. For clarity, lets replace "YEAR"+cast("MONTH_No"-1 as double)/12 with XX:
(XX-EVALUATE_AGGR('regr_intercept(%1,%2)' as DOUBLE, XX, "Amount_Sold"))
/
EVALUATE_AGGR('regr_slope(%1,%2)' as DOUBLE, XX, "Amount_Sold")

 So we have now:


Or in the Outer Join Version:


And now the last part, adding Channel Class. This is actually the most problematic part. Why? Because of the way EVALUATE_AGGR works.
When we add Channel Class to the Analysis, we actually want a separate line for each Class. So the relevant solution is running the regression functions with "over (partition by CLASS)" extension. Unfortunately when doing it with our EVALUATE_AGGR we get a Cartesian Product of the result sets. In plain English - OBIEE creates an SQL that doesn't know how to relate the regression result to the relevant Class and for each class returns all the combinations. I tried everything I could (including this

http://gerardnico.com/wiki/dat/obiee/vertical_fragmentation_sql), but in vain. If you can do better, please let me know.

Luckily for me, in this case the regular Evaluate works as well.
So I created the following calculation:
("Year"-EVALUATE('regr_intercept(%1,%2) over (partition by %3)' as DOUBLE, "Year", "Amount","CHANNEL_CLASS"))
/EVALUATE('regr_slope(%1,%2) over (partition by %3)' as DOUBLE, "Year", "Amount","CHANNEL_CLASS")

And it worked: