Market Basket Analysis with IBM Netezza Analytics

PLEASE NOTE: This article is outdated and does not reflect changes made in IBM Netezza Analytics V2

 

IBM Netezza’s analytics package became available earlier this year.  The documentation on it is okay but I think they went a little light on the examples that each vertical they support can build upon.  For retail, probably the most common analytic challenge we all face is market basket analysis.

Finding product affinities is very important because we can use that data to identify up-sell or cross-sell opportunities.  The classic example is diapers and beer.  If these items are frequently purchased together, then what happens if you put a higher margin beer near the diapers?  Well, you hope that someone who was going to buy beer anyways purchases the higher margin item.  And for those that didn’t plan on buying beer, perhaps the better product placement will entice them to.

The association part isn’t the challenge so much as the volume of data we now have and the number of item sets people want to analyze.  SQL is great and finding how many times item 1 and item 2 appear in the same basket — but what about item 1, 2, 3, 4 & 5?  Of the 5 items, which represents the best opportunity for up-sell and which for cross-sell?

Luckily, the analytics function fpgrowth can help.  Like all of IBM Netezza’s analytic functions, they are presented as stored procedures that take a  number of input parameters and output a table that you can then use as the basis of your analysis.  Using the TPC-DS data set, an example of how you might run this appears below.  This will create a series of tables all having the prefix OCT05_MBA for any product associations having a minimum absolute support of 500 transactions, using SS_TICKET_NUMBER as the transaction ID and SS_ITEM_SK as the item ID.

TPCDS(ADMIN)=> call nza..fpgrowth('intable=store_sales,tid=ss_ticket_number,item=ss_item_sk,supportType=absolute,support=500,pfx=OCT05_MBA');
 NOTICE:
 RUNNING FPGrowth algorithm:
 DATASET : "STORE_SALES"
 Transaction column : "SS_TICKET_NUMBER"
 Item column : "SS_ITEM_SK"
 Group by : <none>
 Minimum support : 500  transactions
 Min frequent itemset size : 0
 Max frequent itemset size : 1000000000
 Level of conditional dbs : 1
 Result tables prefix : "OCT05_MBA_FP_"
FPGROWTH
 ----------
 22123
 (1 row)

The PFX parameter instructs the procedure to create tables having that prefix.  A quick search will show us the item set ranges it identified:

TPCDS(ADMIN)=> \dt OCT05_MBA_FP
 List of relations
 Name        | Type  | Owner
 -------------------+-------+--------
 OCT05_MBA_FP_SET1 | TABLE | USER1
 OCT05_MBA_FP_SET2 | TABLE | USER1
 OCT05_MBA_FP_SET3 | TABLE | USER1
 (3 rows)

SET1 contains the support for item sets containing a single item; SET2 contains two item set and SET3 contains three item sets.  You can use these together to identify the confidence for a particular set.

The tables all look very similar — with the exception of the number of item columns.  Using SET3:

TPCDS(ADMIN)=> select * from OCT05_MBA_FP_SET3 order by sup desc limit 5;
 ITEM1 | ITEM2 | ITEM3 | SUP | GRP
 -------+-------+-------+-----+-----
 80413 | 15877 | 47005 | 548 |   0
 59137 | 10675 | 52789 | 538 |   0
 63439 | 11491 | 27931 | 537 |   0
 33211 | 58615 | 54007 | 534 |   0
 76705 | 92431 | 51703 | 534 |   0

The “SUP” column is the number of transactions that contained all three of these products.  To identify the confidence, you’d need to query the SET2 table for two of the items in the set and compare.  For example:

TPCDS(ADMIN)=> select * from OCT05_MBA_FP_SET2 where (item1,item2) = ( 80413,47005);
 ITEM1 | ITEM2 | SUP | GRP
 -------+-------+-----+-----
 80413 | 47005 | 592 |   0

So the three products appeared in the same transaction 548 times out of a possible 592 baskets.

Some other options that you might consider using are:

maxsetsize=INTEGER

This will limit the analysis to a fixed item set size, as opposed to exploring every existing item set size.

supportType=[ absolute | percent ]

Here you can specify whether or not you want the support value you specify to be absolute (number of transactions containing the item set) or as a percentage of all transactions.

by = COLUMN NAME

This lets you identify associations within a particular group.  For example, you could do a store-level market basket analysis with:

TPCDS(ADMIN)=> call nza..fpgrowth('intable=store_sales,tid=ss_ticket_number,item=ss_item_sk,supportType=absolute,support=500,pfx=OCT05_MBA,by=SS_STORE_SK');

This will then populate the GRP column of the resulting set tables with the actual value.  For example:

ITEM1 | ITEM2  | SUP | GRP
 -------+--------+-----+-----
 56491 |  25807 |  11 |   4
 56491 |  73903 |  13 |   4
 56491 |  90613 |  12 |   4
 38275 | 100765 |  12 |   4

These affinities exist within store #4.  You could use this to compare the dominant sets at the store (region, district, brand) levels.

Advertisements
This entry was posted in analytics and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s