Leveraging Netezza’s in-database analytic capabilities can significantly reduce the amount of time required to execute SPSS streams. By pushing the analytics to the data, we eliminate the need to pull the data out of the table and onto our SPSS server where the execution takes place. For this reason we’re constantly compromising on how much data to analyze knowing that much of the time spent is simply moving data across the network.
Using K-Means as an example, I ran a test against the Netezza provided census income demo data. I folded it over a few times to balloon the table to 12.5M records. This isn’t a huge number of individuals to want to cluster but enough to better illustrate the value of in-database modeling.
Client environment: Windows 7 64bit, SPSS Data Modeler 14.2 FP2, 8GB of RAM and quad-core Intel i5 2.67 GHz CPU, connected via VPN
Database environment: IBM Netezza 1000-12, 6.0.5 P5
Table details: 12,769,472 unique individual records containing income & demographic information
First things first: for the SPSS K-Means model to work, we first have to read the data so that the columns are properly recognized and thus usable.
This step alone — reading the data — took 700 seconds!
Next, we add the K-means model to the palette and customize it to create 10 clusters using a maximum of 5 iterations. This is done by adjusting the clusters and iterations sections on the model and expert tabs.
When ready click run. You’ll notice that once again we have to read through all of the records — all 12.5M individual records. Once that is completed then we can begin processing them. All told, it took 32m+ to read through all of the records and segment them into 10 clusters using 5 iterations.
And now the Netezza in-database model. First, we’ll review the fields to ensure that the ID is properly recognized and that all other fields in the table are inputs. Please note that we don’t have to read the data first. Since the data is in-database, SPSS doesn’t really need to understand what the fields are or how they will be used.
One difference between Netezza’s in-database K-Means and SPSS is that Netezza stores the results in a table. For this reason you’ll need to specify a table name to store the resulting cluster summary.
Next we indicate that we’d like 10 clusters identified and a maximum of 5 iterations — just as we did with the SPSS version of K-Means. Once done click run and watch the clock.
In the screenshot below we can see the entire process took 58 seconds to complete (59 if you round up).