Generating TPC-DS database and data in Netezza

“”The TPC-DS benchmark models the decision support system of a retail product supplier, including queries and data maintenance.”

via TPC-DS.

Working in retail, I’ve always wondered if a very generic data set existed that I could use on development systems just to prove some theories out — without worrying about PCI, P, PI (etc, etc, etc) security requirements.

I came across the TPC-DS (DS stands for decision support) and it includes a set of utilities to generate the data as well as a data model that you can easily create.

  1. Download this package:
  2. Copy the package to your Netezza’s Linux host
  3. mkdir /sandbox/tpcds_data; unzip -d /sandbox/tpcds_data
  4. cd /sandbox/tpcds_data; make -f Makefile.suite

You will now have the necessary binaries in the /sandbox/tpcds_data directory.

Next, create a database on your Netezza system and run the file tpcds.sql.  I’d suggest the feature Enable Random Table Distribute — which will randomly distribute any table not having an explicit distribute on clause (as opposed to selecting the first column or the primary key).  This can be done by editing tpcds.sql and adding:

set enable_random_table_distribute = true;

Once that has been done:

nzsql <<EOF
  create database tpcds_data;
  \c tpcds_data
  \i tpcds.sql

You’ll get a few warnings that primary key constraints aren’t enforced.  Once completed, you’ll have an empty shell of a database to load data into.

Now, to generate the data you’ll need the binary dsdgen — found in the /sandbox/tpcds_data directory.  You can specify how large of a database you want to create very easily with the ‘-scale’ parameter.  To create 100GB of data using 4 parallel threads, run the following commands (assuming of course your /nzscratch file system has the free space required):

/sandbox/tpcds_data/dsdgen -scale 100 -parallel 4 -dir /nzscratch/tpcds -child 1 &
/sandbox/tpcds_data/dsdgen -scale 100 -parallel 4 -dir /nzscratch/tpcds -child 2 &
/sandbox/tpcds_data/dsdgen -scale 100 -parallel 4 -dir /nzscratch/tpcds -child 3 &
/sandbox/tpcds_data/dsdgen -scale 100 -parallel 4 -dir /nzscratch/tpcds -child 4 &

You can monitor the progress with:

du -hs /nzscratch/tpcds

Once all four background jobs have completed, you can then load the data into your newly created database.

for THREAD in 1 2 3 4 ; do
         for DF in `ls *${THREAD}_4.dat` ; do
                 TBLNAME=`basename ${DF} _${THREAD}_4.dat`
                 nzload -db tpcds_data -t ${TBLNAME} -df ${DF} -delim '|'

Now you have a sample set of retail-like data that you can test with.

This entry was posted in General, Uncategorized and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s