Sunday 11 November 2012

Big Data Analytics - innaproriate analysis

I thought I wasn't going to rant about this again for a while but three encounters this week have fanned the flames again.

I don't know how many Twitter and LinkedIn posts I have made on Big Data + Analytics over recent months but its definitely an area on an increasing trend in the business world. However, the reality is most of the business world struggles to extract any meaningful  'knowledge' from all of the 'data' that is 'collated' from business activities.

Why is that - because the main analysis tools used are spreadsheets - an in particular - Excel. Now don't get me wrong Excel is a phenomenal software package - but in my view in some instances it is being used to produce models that are way outside of its domain of appropriateness.

What do I mean by that? Well - three events this week have highlighted the tip of the iceberg for me. All of these are being addressed, I hasten to add, but I don't think I am alone in my experiences.

1 The first was when I was sat in a meeting looking at the projected image of some analysis using Excel, upon which we were making decisions that affected the direction of the business. One of a myriad of cells was being concentrated on - and the value in the cell was 'zero'. Everyone in the room knew that wasn't right so we all sat there for 5 minutes discussing why this was so. Now this could have been a simple mistake somewhere on one of the supporting sheets but the effect it had was to throw the whole of the analysis into question. How could we then believe any of the other numbers. Therein lies the first 'rant fact' - it is difficult to manage traceability in these sorts of tools.

2 The second was when I was asked to comment and add to a sheet for some supporting data input into a model. Someone was collating data to help build up a spreadsheet model and was emailing around for changes to the data sheet. Of course no one person holds all of this in their head so people were passing on the sheet for refinement. The version that came to me for input was 'blah filename - Copy - Copy - Copy'. Therein lies the second 'rant fact' - if not part of some wider process, configuration and version control can get out of hand pretty quickly.

3 The third and for me the most serious came from checking through to try and understand a part of a model that didn't appear to be functioning as expected (see 'rant fact' 1). When I looked into the sheets in question - without even going into any of the equation set being used - I found one sheet with 100 columns and 600 rows of manually input driven data entries - that's 60,000 places for making an input error on that sheet alone and there were more sheets!  Therein lies the third 'rant fact' - data quality within this environment is difficult to control.

The issue is that Excel in particular is so easy to fire up and start bashing away at, that we forget that we are in some cases building complex calculation engines. In some instances these engines are not using any 'design' process at all. There is no overarching systems design process and even at a simplistic level there is no recognition of fundamental modelling techniques that would improve modelling and therefore output quality, namely, consideration of the following;

1 Functional model development - what is the sheet set up to do - even a simple flowchart would help never mind some functional breakdown of the calculation set.

2 Data model development - what data, where from, what format, type views to force thinking about quality control of data, a database maybe!

3 Physical model of the hardware - how does the overall 'system', including data input, connect, store and process the information.  Maybe using email and collating feedback on a laptop is not the best system configuration.

All these activities add time and cost to model development and because their results are intangible and difficult to measure can get left out in the rush to get the answer out. However, the question is, would you put your own money at risk on the basis of this 'answer'?

What is the solution? Well certainly don't let learner drivers loose in the F1 racing car for a start - but there must also be some way of providing an easily accessed development environment that can   be used to translate formula into readable and understandable code - formula translation - now that could catch on (sorry couldn't resist!).

No comments:

Post a Comment