Super Nerdy: K-Means Clustering Of Distillery Profiles

Add it to the “terroir isn’t a thing in scotch/regions are meaningless” pile. Over at a big data blog, Luba Gloukhov did a k-means clustering of 86 whiskies.

What’s that, you ask?

K-means clustering is a technique to analyze large datasets where your end desire is to group things together based on mathematically calculated distances between attributes in the data set.  Essentially, the program runs through the data set and figures out what elements are the most alike.

This work is more accessible and understandable in the form of David Wishart’s Whisky Classified, which grouped several distilleries together by flavor profile. While Wishart’s is a great effort and certainly one of the best introductions to the concept, the challenge is data points.

It’d be interesting to see this approach applied to larger, more constantly updated data sets such as the Malt Maniacs’ data, though missing from most of these are an agreed-upon set of flavor variables that may be scored.

If the concept is over your head or you don’t dig reading code samples, just look at his map plots, which show a pretty scattershot distribution across Scotland by flavor. There’s obviously a cluster in Speyside but it’d be more useful to do a zoomed-in view there.

Certainly it’s something that would suggest a lot more fun data mining, but it’s an interesting start.

3 thoughts on “Super Nerdy: K-Means Clustering Of Distillery Profiles”

  1. This is an interesting effort. However, it is based on data from “Whisky Classified” by David Wishart, and that presents several problems for the analysis:
    1. The dataset is from an older edition of the book, I would guess; it’s impossible to tell from the chain of references. The bottlings are old and about half of them are not the current standard editions, e.g. Longmorn 15yo, Glenkinchie 10yo, old Glen Ord 12yo (not Singleton), Dufftown 15yo Flora & Fauna (not Singleton), etc etc. Maybe they have been updated in the latest (2012) edition of the book, but not in this dataset.
    2. Wishart rates individual bottles/expressions rather than a distillery. So you see where a certain Aberlour 10yo and a certain Macallan 10yo clusters, but this many not be representative of the distillery output.
    This is especially true of ex-sherry expressions, such as 27-yo Balmenach Highland Selection Limited Edition, or a 16yo Dailuaine Flora & Fauna.
    3. There are some typos and data errors – some distillery names are misspelled so yld guess the same QC issues apply to the data. My spotchecks comparing with the book only found occasional errors, such as for Bowmore, that got Medicinal=1 instead of 2 in the book – if you wonder why it didn’t group with the other Islays, but more on this later.
    Clynelish is another special case. It grouped with the Islay cluster, which at first was mind boggling. I compared the Clynelish entries and they were nothing like in the book (where the OB 14yo was used). There was no entry swap, I checked. My conclusion is that this Clynelish was in fact a Brora, rated for an even earlier edition than my own 2006 book.
    4. Technical aspects of the analysis: the 12 dimensions are treated as equal – but should they? “Tobacco” seems an unnecessary distraction, with only 10 whiskies getting 1 point, and the other 76 0 points on the 0-4 scale. Scaling is not necessary, but rather misleading, since all 12 dimensions are evaluated on a common 0-4 scale.

    Now on to the ratings. The 4-cluster solution in the original analysis is meaningful, but it doesn’t really group distilleries. The four clusters are: 1. Peaty (smoky+medicinal); 2. Ex-sherry (sweet+honey+winey); 3. Ex-bourbon (floral+sweet); 4. Non-Islay Peaty (smoky+honey+medicinal), i.e. the likes of Springbank, Glen Scotia, Fettercairn, Glen Garioch, Ardmore, Bowmore (due to the data entry error), Highland Park, Old Pulteney, Bruichladdich.

    So I did learn something interesting – namely, that Highland Park used to be peatier with less sherry character, and Old Pulteney and Glen Garioch used to be peatier than they are today.

    I also redid the analysis with the corrected Bowmore entry – the results are now quite different. The 4-cluster solution gives:
    1. Peaty (smoky+medicinal, not sweet, n=12) – the previous peaty cluster (usual suspects) plus Bowmore, Glen Scotia, Old Pulteney, Springbank, Isle of Jura, Oban.
    2. Ex-sherry (winey, big body, n=11) – Aberlour, Dalmore, Glendronach, Glenfarclas, Macallan, Mortlach, but also the particular expressions of whiskies we wouldn’t naturally add to this list: Balmenach, Dailuaine, Glendullan, Royal Lochnagar, Strathisla).
    3. Ex-bourbon, no peat (floral, fruity, not winey, less body, n=35) – everything from AnCnoc, Arran, to Bunnahabhain (?), Cragganmore, Benriach, Glenfiddich, Glen Elgin, Glenmorangie, Speyburn, Tobermory and Tullibardine – quite the mixed bag.
    4. Whiskies with some ex-sherry blended in or with some peat (malty, somewhat floral, somewhat winey, smokier n=35) – another mixed bag, from Aberfeldy and Balvenie to Edradour, Glenlivet, Glenrothes, Highland Park, Knockando, Longmorn, Tomatin, etc.

    A solution with 5 clusters is also quite nice, separating the hard-core peated whiskies from the less-peated ones (Ardmore, Bowmore, Bruichladdich, Highland Park, Springbank, etc.), then the heavy ex-sherry cluster (n=8 including Macallan, Mortlach, Dalmore and Glendronach), the non-sherry, and the some-sherry clusters.

    In any case, you get the sense of the strengths and the limitations of this exercise.

    1. Fantastic analysis, Florin. Thanks so much for sharing! I’ve been playing around with this stuff in a very lightweight way in my work but I can see you’ve got a substantial leg up on it.

      My impressions were similar to yours: it’s an interesting if flawed effort, but it could be a fun stepping stone to something more meaningful.

      The Whisky Classified dataset is an interesting one, though as you point out, it’s got some problems with quality and it’s pretty stale with regards to bottling.

      I agree in general that a four-cluster grouping is limiting precisely because of the peating levels. Five might be a sweet spot. Wonder if you popped out a few more – I suspect it’d quickly devolve to nothing extremely meaningful. Wishart IIRC used, what, ten clusters in his book?

      I think perhaps the more achievable goal would be to take the Malt Maniacs dataset, and essentially calculate your similarity to the scorers there and us that as the basis for a recommendation engine. Of course, the challenge there is actually having access to many of the whiskies they have….

      Anyway, a fun thing to consider at the end of the week. Thanks again for your input, it’s a good set of criticisms against the first effort here. I get the impression that Luba isn’t necessarily as malt-experienced so I suspect some of these points may have skated by (as they would if you’re not a huge nerd for this.)

      1. A Malt Maniacs analysis is an interesting idea. MAO and myself are toying with the idea at the moment, although from a different perspective – i.e., find the characteristics that determine a whisky score. The challenge in analyzing Malt Maniacs data from the perspective of clustering is that they don’t score separate components of flavor, like Whishart did. One could in principle take the verbal reviews and mine them for occurrence of meaningful words (e.g., malt, breakfast, cereals, or sweet, desert, raisins, sultanas, etc). I have a vague recollection that someone already did such an exercise and even published it some 5 years back, but I can’t find a reference.

        I forgot that Wishart actually did the clustering, I will redo the analysis to see if I recover his clusters or something similar. I’m curious if he updated the bottles in the 2012 version, but not enough to spend the $20 for the new edition. I’m not crazy about this book, but it did come in handy twice in the last two weeks, so maybe I will get it. Inputting/updating the dataset will not be difficult, probably 1/2h of work. (If he only updated the photos I’ll just return the book!)

        I should say that the only reason I could redo Luba’s analysis is that, to his credit, he published his code – this way my effort was minimal.

Leave a Reply