TPC-H: Schema and Indexes

The TPC-H benchmark is often used a method for customers to evaluate data warehouse products to make purchasing decisions. Because it is such a crucial benchmark, it is important to understand the challenges it presents for database vendors. Unfortunately, the public information about tuning for TPC-H is rather sparse and it is generally hard to come by good documentation. Vendors do not like to be compared with other vendors – so their secrecy is understandable.

In this blog series, I will try to shed some light on the TPC-H benchmark, what I think is wrong with it, and provide some of my thoughts about the challenges you face when tuning it.

The Schema and Data

The basic TPC-H schema can be downloaded from here: http://www.tpc.org/tpch/spec/tpch2.17.0.pdf

Individual vendors implement the exact DDL for differently. But, what you are allowed to do with the schema is highly regulated by the TPC-H standard.

For your reference, here is the schema from the v2.17.0 of TPC-H along with the row counts from each table.

One of the problems with TPC-H is that its schema has some odd design choices and restrictions imposed.

Allow me to explain…

Normalisation

The schema represents a simple data warehouse dealing with sales, customers and suppliers. For anyone familiar with Kimball style data modeling, it only takes a quick glance at the TPC-H schema to realise that this is NOT the way things are done in the real world. Let me elaborate: The ORDERS and LINEITEM table are normalised in TPC-H. In a properly designed Kimball data warehouse, you would denormalise these two tables into a SALES table. Interestingly, if you look at the TPC-DS benchmark (which at the time of writing is still not widely adopted) it implements a proper star schema in almost exactly this way.

Similarly, in a proper star schema design NATION should be denormalised into REGION, CUSTOMER and SUPPLIER.

Scale Factor (SF)

When evaluating TPC-H results, it is important to understand the concept of Scale Factor (SF). The scale factor measures the size of the input data. Note that this not necessarily the final size of the database – which will be vendor dependent. For example, a SF = 1 database (1GB input) may become significantly smaller after it has been loaded into a column store database.

The major tables in the schema grow linearly with SF. Here is a list of data and table sizes at different scale factors:

SF	100	1000	10.000	100.000
Input Data Size	100GB	1TB	10TB	100TB
CUSTOMER rows	15M	150M	1.5B	15B
PARTS rows	20M	200M	2B	20B
LINEITEM rows	600M	6B	60B	600B
SUPPLIER rows	1M	10M	100M	1B

It should be obvious that this way to scale a database is completely silly. In case it is not clear to you why that is so:

An average Wallmart store has 142K items (PARTS) (source: http://news.walmart.com/news-archive/2005/01/07/our-retail-divisions). A 10TB warehouse loaded according to TPC-H would have 2B PARTS – hardly realistic.
There are currently 7B people on the planet. No one dealing in sales (not even Amazon, source: http://www.fool.com/investing/general/2014/05/24/how-many-customers-does-amazon-have.aspx) has 1.5B customers. Yet, at SF 10.000, only 10TB of data, 1.5B is the size of CUSTOMER. At SF 100.000 this is even more meaningless, as there are now twice as many CUSTOMER entries in the table as there a people on the planet.
There are around 300M Americans, if Wallmart has 10TB of data (which they do) then if we follow the scaling of SUPPLIER from a TPC-H SF 10.000 this would mean that every 3rd American would be a supplier (not a customer) to Wallmart

Clearly, there is something very broken about the way TPC-H scales up its tables. Obviously, ORDERS and LINEITEM needs to grow by the scale factor, big warehouses get big from having lots of transactions stored. But, it does not follow that almost every other table in the schema needs to grow linearly too.

It is interesting to note that the TPC-H standard admits that there are limits to how how SF can go. Quoting section 4.1.3.1:

Comment 2: The maximum size of the test database for a valid performance test is currently set at 100000 (i.e., SF = 100,000). The TPC recognizes that additional benchmark development work is necessary to allow TPC-H to scale beyond that limit.

I would argue that this is an understatement. This benchmark makes very little real world sense beyond SF 1000 – and even then, the size of SUPPLIER (at 10M suppliers) is a stretch.

Data Distribution

The TPC council maintains the source code for a tool called dbgen that allows you to generate CSV files which can be used to load the TPC-H data. The tool and the data it generates is documented in section 4.2 of the TPC-H standard.

Section 4.2.3 is of particular interest, as it describes the generated distribution of the data. Consider that LINEITEM and ORDERS is about 80% of the total dataset and that these tables would typically make up a fact table in a traditional, properly denormalised, data warehouse. Because touching data in LINEITEM and ORDERS is a major part of the workload, it is important that the data in those tables represents something that you are likely to see in the real world.

Transactional data often has very high degrees of correlation between columns, by the nature of where it comes from: human activity. Such highly correlated columns are often put to good use by data compression routines and column stores.

However, TPC-H is again unrealistic. Let us see why:

Every CUSTOMER is equally likely to be on a transaction. In the real world, some customers shop more than others
The L_QUANTITY column is uniformly, randomly distributed. In the real world, the quantity is highly skewed and often correlated with the PART you buy
L_DISCOUNT is uniformly, randomly distributed. In the real world the discount you get is highly correlated with the PART you buy
L_TAX is uniformly, randomly distributed. In the real world, the tax you pay is correlated both with the product and the country the customer is from.
There is no correlation between O_CUSTOMER and L_PARTKEY. In the real world, customers tend to cluster around the products they buy.
While the different dates (L_SHIPDATE, L_COMMITDATE etc.) are correlated with O_ORDERDATE their offset is random. In the real world, the time it takes to move orders is highly skewed

Because TPC-H is essentially a random, uncorrelated dataset (with a few exceptions like L_SUPPKEY and the date columns) the compression you achieve on TPC-H is in no way an expression of the compression rates you will see on real data.

Indexing and Partitioning

TPC-H is very strict about the indexing and partitioning strategy it allows a vendor to use. This is to some degree understandable, as allowing tricks like materialised views would make it trivial to tune the workload.

Clauses 1.4.2.2, 1.4.2.3 and 1.5.4 define what you are allowed to do when tuning the schema. Basically, these are the indexes and partitions you can create:

You may index a primary key (and those keys are strictly defined)
You may index a foreign key
You may partition any table on one and only one column that has the type date. This partitioning can be done down to the day level.

The standard does not define what type of index may be used, so a vendor is free to use B-trees, Hash indexes or any other trick. However, there is a requirement that you cannot explicitly define the index type in the DDL. This means that the database must either use a default or have some build in logic that picks index types. For example, MySQL will automatically create indexes on any column declared as a foreign key – and this is allowed by the benchmark as long as the rule is applied consistently to all foreign keys that are declared.

A lot of indexing tricks are explicitly disallowed, these include:

Join indexes
Materialised/indexed views
Computed columns
Partitioning on more than one column
Vertical partitioning (though column stores on every column are allowed
Indexes on more than one column, with the exception of the composite primary key in LINEITEM (L_ORDERKEY, L_LINENUMBER).
Any index on a non key or foreign key column

Overall, I think the TPC-H indexing rules are as fair as they can be. It is hard to come up with a rule set that prevents “cheating” by letting vendors implement highly aggregated data views or excessive indexing. However, it should be noted that in the real world, dimension tables are often heavily indexed (with no significant storage overhead) and that this is something the TPC-H explicitly disallows. To add insult to injury, the dimension tables are already too large with the SF – which makes queries filtering on PARTS and CUSTOMER (such as Q20 and Q22) largely irrelevant when evaluating real world scenarios.

Summary

In this blog, I have provided you some background on the TPC-H benchmark. As should be clear, I do not consider TPC-H to be a representation of a real world data warehouse system. It is a blunt, even dangerous, tool for comparing database vendors with each other when evaluating a data warehousing platform.

However, I am still in favour of the TPC-H benchmark as a tool to help database vendors develop their database engines. TPC-H is designed to stress a database by sending hard queries at it [insert evil cackle here]. This helps a vendor tune for queries they might not have foreseen and it provides a way to validate that the database does well even under unrealistically hard workloads. As a test suite, it is a good way to explore where the problems lie. But any TPC-H result should always be evaluated by someone who understand if the weaknesses exposed are relevant for a real world workload or not, particularly if the results of a TPC-H test are relevant to the customer building a warehouse.

In future blog entries, I will explore the 22 queries that are part of the benchmark and highlight some of the challenges involved in tuning them.

The post TPC-H: Schema and Indexes appeared first on Fighting Bad Data Modeling.