Exploratory Data Analysis

It doesn’t take much to trigger me into a rant about the weaknesses of reports on data and “dashboards” purporting to be “analytics” or “business intelligence”. Lots of pie charts and line graphs with added bling are as the proverbial red rag to a bull.

Until recently my response was to demand more rigorous statistics: hypothesis testing, confidence limits, tests for reverse causality (but recognising that causality is a slippery concept in complex systems). Having recently spent some time thinking about using data analysis to gain actionable insights, particularly in the setting of an educational institution, it has become clear to me that this response is too shallow. It embeds an assumption of a linear process: ask a question, operationalise it in terms of data and statistics and crunch some numbers. As my previous post indicates, I don’t suppose all questions are approachable. Actually, thinking back to the ways I’veĀ  done a little text and data mining in the past, it wasn’t quite like this either.

The label “exploratory data analysis” captures the antithesis to the linear process. It was popularised in statistical circles by John W Tukey in the early 1960’s and he used it as a title for a highly influential book. Tukey was trying to challenge a statistical community that was very focused on hypothesis testing and other forms of “confirmatory data analysis”. He argued that statisticians should do both, approaching data with flexibility and an open frame of mind and he saw having a well-stocked toolkit of graphical methods as being essential for exploration (Tukey was responsible for inventing a number of plot types that are now widely used).

Tukey read a paper entitled “The Technical Tools of Statistics” at the 125th Anniversary Meeting of the American Statistical Association in 1964 which anticipated the development of computational tools (e.g. R and RapidMiner), is well worth a read and has timeless gems like:

“Some of my friends felt that I should be very explicit in warning you of how much time and money can be wasted on computing, how much clarity and insight can be lost in great stacks of computer output. In fact, I ask you to remember only two points:

  1. The tool that is so dull that you cannot cut yourself on it is not likely to be sharp enough to be either useful or helpful.
  2. Most uses of the classical tools of statistics have been, are, and will be, made by those who know not what they do.”

There is a correspondence between the open-minded and flexible approach to exploratory data analysis that Tukey advocated and the Grounded Theory (GT) Method of the social sciences. As a non-social scientist, GT seems to be a trying a bit too hard to be a Methodology (academic disputes and all) but the premise of using both inductive and deductive reasoning and going in to a research question free of the prejudice of a hypothesis that you intend to test (prove? how often is data analysed to find a justification for a prejudice?) is appealing.

Although GT is really focussed on qualitative research, some of the practical methods that the GT originators and practitioners have proposed might be applicable to data captured in IT systems and for practitioners of analytics. I quite like the dictum of “no talk” (see the wikipedia entry for an explanation).

My take home, then, is something like: if we are serious about analytics we need to be thinking about exploratory data analysis and confirmatory data analysis and the label “analytics” is certainly inappropriate if neither is occurring. For exploratory data analysis we need: visualisation tools, an open mind and an inquisitive nature.

4 thoughts on “Exploratory Data Analysis”

  1. Is provenance also an issue? If not all data is equal, and some sources are more authoritative than others… Although maybe that belongs to a larger issue of data quality and assigning probabilities/weights/degrees of confidence?

    I was once on a work placement as a database administrator where data from event attendee registration cards was mixed with more reliable data sources, and provenance mattered. Possibly even more so in the case of linked data.

  2. Tavis-
    I think it is a big issue. And I think “quality” can cover a multitude of sins so in an ideal world you’d want to know and carry forward the information about the context of collection, through the storage, …. Its not just typos and missing data (or what was done to accommodate missing data a.k.a “imputation”); its easy to lose track of what data means. Ask me what I earned last year. Ask my employer. Check my bank account… Ask HMRC. How many different answers do you want?

    As for some of the linked data out there: far too much scraped HTML and guesswork concealed inside very-authoritative-looking RDF!

    “Not all data is equal…” Definitely.

    Cheers, Adam

  3. Bravo – I really agree with the spirit of this post!

    I like the way Tony Hirst talks about allowing the data to tell its story. Such data narratives may contain ‘secrets’ that can be more readily oped up through a variety of visualisation tools.

    The key is not just being open to such approaches but also preserving enough data to support them, as opposed to aggregating or otherwise limiting the data preserved to that which aligns with preconceived lines of enquiry.

Leave a Reply