PLEASE NOTE: This article is outdated and does not reflect changes made in IBM Netezza Analytics V2
IBM Netezza’s analytics package became available earlier this year. The documentation on it is okay but I think they went a little light on the examples that each vertical they support can build upon. For retail, probably the most common analytic challenge we all face is market basket analysis.
Finding product affinities is very important because we can use that data to identify up-sell or cross-sell opportunities. The classic example is diapers and beer. If these items are frequently purchased together, then what happens if you put a higher margin beer near the diapers? Well, you hope that someone who was going to buy beer anyways purchases the higher margin item. And for those that didn’t plan on buying beer, perhaps the better product placement will entice them to.
The association part isn’t the challenge so much as the volume of data we now have and the number of item sets people want to analyze. SQL is great and finding how many times item 1 and item 2 appear in the same basket — but what about item 1, 2, 3, 4 & 5? Of the 5 items, which represents the best opportunity for up-sell and which for cross-sell?
Luckily, the analytics function fpgrowth can help. Like all of IBM Netezza’s analytic functions, they are presented as stored procedures that take a number of input parameters and output a table that you can then use as the basis of your analysis. Using the TPC-DS data set, an example of how you might run this appears below. This will create a series of tables all having the prefix OCT05_MBA for any product associations having a minimum absolute support of 500 transactions, using SS_TICKET_NUMBER as the transaction ID and SS_ITEM_SK as the item ID.
TPCDS(ADMIN)=> call nza..fpgrowth('intable=store_sales,tid=ss_ticket_number,item=ss_item_sk,supportType=absolute,support=500,pfx=OCT05_MBA'); NOTICE: RUNNING FPGrowth algorithm: DATASET : "STORE_SALES" Transaction column : "SS_TICKET_NUMBER" Item column : "SS_ITEM_SK" Group by : <none> Minimum support : 500 transactions Min frequent itemset size : 0 Max frequent itemset size : 1000000000 Level of conditional dbs : 1 Result tables prefix : "OCT05_MBA_FP_"
FPGROWTH ---------- 22123 (1 row)
The PFX parameter instructs the procedure to create tables having that prefix. A quick search will show us the item set ranges it identified:
TPCDS(ADMIN)=> \dt OCT05_MBA_FP List of relations Name | Type | Owner -------------------+-------+-------- OCT05_MBA_FP_SET1 | TABLE | USER1 OCT05_MBA_FP_SET2 | TABLE | USER1 OCT05_MBA_FP_SET3 | TABLE | USER1 (3 rows)
SET1 contains the support for item sets containing a single item; SET2 contains two item set and SET3 contains three item sets. You can use these together to identify the confidence for a particular set.
The tables all look very similar — with the exception of the number of item columns. Using SET3:
TPCDS(ADMIN)=> select * from OCT05_MBA_FP_SET3 order by sup desc limit 5; ITEM1 | ITEM2 | ITEM3 | SUP | GRP -------+-------+-------+-----+----- 80413 | 15877 | 47005 | 548 | 0 59137 | 10675 | 52789 | 538 | 0 63439 | 11491 | 27931 | 537 | 0 33211 | 58615 | 54007 | 534 | 0 76705 | 92431 | 51703 | 534 | 0
The “SUP” column is the number of transactions that contained all three of these products. To identify the confidence, you’d need to query the SET2 table for two of the items in the set and compare. For example:
TPCDS(ADMIN)=> select * from OCT05_MBA_FP_SET2 where (item1,item2) = ( 80413,47005); ITEM1 | ITEM2 | SUP | GRP -------+-------+-----+----- 80413 | 47005 | 592 | 0
So the three products appeared in the same transaction 548 times out of a possible 592 baskets.
Some other options that you might consider using are:
This will limit the analysis to a fixed item set size, as opposed to exploring every existing item set size.
supportType=[ absolute | percent ]
Here you can specify whether or not you want the support value you specify to be absolute (number of transactions containing the item set) or as a percentage of all transactions.
by = COLUMN NAME
This lets you identify associations within a particular group. For example, you could do a store-level market basket analysis with:
TPCDS(ADMIN)=> call nza..fpgrowth('intable=store_sales,tid=ss_ticket_number,item=ss_item_sk,supportType=absolute,support=500,pfx=OCT05_MBA,by=SS_STORE_SK');
This will then populate the GRP column of the resulting set tables with the actual value. For example:
ITEM1 | ITEM2 | SUP | GRP -------+--------+-----+----- 56491 | 25807 | 11 | 4 56491 | 73903 | 13 | 4 56491 | 90613 | 12 | 4 38275 | 100765 | 12 | 4
These affinities exist within store #4. You could use this to compare the dominant sets at the store (region, district, brand) levels.