Commit 64d0b8b0 authored by Bruce Momjian's avatar Bruce Momjian

Attached is an update to contrib/tablefunc. It implements a new hashed

version of crosstab. This fixes a major deficiency in real-world use of
the original version. Easiest to undestand with an illustration:

Data:
-------------------------------------------------------------------
select * from cth;
  id | rowid |        rowdt        |   attribute    |      val
----+-------+---------------------+----------------+---------------
   1 | test1 | 2003-03-01 00:00:00 | temperature    | 42
   2 | test1 | 2003-03-01 00:00:00 | test_result    | PASS
   3 | test1 | 2003-03-01 00:00:00 | volts          | 2.6987
   4 | test2 | 2003-03-02 00:00:00 | temperature    | 53
   5 | test2 | 2003-03-02 00:00:00 | test_result    | FAIL
   6 | test2 | 2003-03-02 00:00:00 | test_startdate | 01 March 2003
   7 | test2 | 2003-03-02 00:00:00 | volts          | 3.1234
(7 rows)

Original crosstab:
-------------------------------------------------------------------
SELECT * FROM crosstab(
   'SELECT rowid, attribute, val FROM cth ORDER BY 1,2',4)
AS c(rowid text, temperature text, test_result text, test_startdate
text, volts text);
  rowid | temperature | test_result | test_startdate | volts
-------+-------------+-------------+----------------+--------
  test1 | 42          | PASS        | 2.6987         |
  test2 | 53          | FAIL        | 01 March 2003  | 3.1234
(2 rows)

Hashed crosstab:
-------------------------------------------------------------------
SELECT * FROM crosstab(
   'SELECT rowid, attribute, val FROM cth ORDER BY 1',
   'SELECT DISTINCT attribute FROM cth ORDER BY 1')
AS c(rowid text, temperature int4, test_result text, test_startdate
timestamp, volts float8);
  rowid | temperature | test_result |   test_startdate    | volts
-------+-------------+-------------+---------------------+--------
  test1 |          42 | PASS        |                     | 2.6987
  test2 |          53 | FAIL        | 2003-03-01 00:00:00 | 3.1234
(2 rows)

Notice that the original crosstab slides data over to the left in the
result tuple when it encounters missing data. In order to work around
this you have to be make your source sql do all sorts of contortions
(cartesian join of distinct rowid with distinct attribute; left join
that back to the real source data). The new version avoids this by
building a hash table using a second distinct attribute query.

The new version also allows for "extra" columns (see the README) and
allows the result columns to be coerced into differing datatypes if they
are suitable (as shown above).

In testing a "real-world" data set (69 distinct rowid's, 27 distinct
categories/attributes, multiple missing data points) I saw about a
5-fold improvement in execution time (from about 2200 ms old, to 440 ms
new).

I left the original version intact because: 1) BC, 2) it is probably
slightly faster if you know that you have no missing attributes.

README and regression test adjustments included. If there are no
objections, please apply.

Joe Conway
parent add932ee
......@@ -333,6 +333,125 @@ AS ct(row_name text, category_1 text, category_2 text, category_3 text);
==================================================================
Name
crosstab(text, text) - returns a set of row_name, extra, and
category value columns
Synopsis
crosstab(text source_sql, text category_sql)
Inputs
source_sql
A SQL statement which produces the source set of data. The SQL statement
must return one row_name column, one category column, and one value
column. It may also have one or more "extra" columns.
The row_name column must be first. The category and value columns
must be the last two columns, in that order. "extra" columns must be
columns 2 through (N - 2), where N is the total number of columns.
The "extra" columns are assumed to be the same for all rows with the
same row_name. The values returned are copied from the first row
with a given row_name and subsequent values of these columns are ignored
until row_name changes.
e.g. source_sql must produce a set something like:
SELECT row_name, extra_col, cat, value FROM foo;
row_name extra_col cat value
----------+------------+-----+---------
row1 extra1 cat1 val1
row1 extra1 cat2 val2
row1 extra1 cat4 val4
row2 extra2 cat1 val5
row2 extra2 cat2 val6
row2 extra2 cat3 val7
row2 extra2 cat4 val8
category_sql
A SQL statement which produces the distinct set of categories. The SQL
statement must return one category column only. category_sql must produce
at least one result row or an error will be generated. category_sql
must not produce duplicate categories or an error will be generated.
e.g. SELECT DISTINCT cat FROM foo;
cat
-------
cat1
cat2
cat3
cat4
Outputs
Returns setof record, which must be defined with a column definition
in the FROM clause of the SELECT statement, e.g.:
SELECT * FROM crosstab(source_sql, cat_sql)
AS ct(row_name text, extra text, cat1 text, cat2 text, cat3 text, cat4 text);
the example crosstab function produces a set something like:
<== values columns ==>
row_name extra cat1 cat2 cat3 cat4
---------+-------+------+------+------+------
row1 extra1 val1 val2 val4
row2 extra2 val5 val6 val7 val8
Notes
1. source_sql must be ordered by row_name (column 1).
2. The number of values columns is determined at run-time. The
column definition provided in the FROM clause must provide for
the correct number of columns of the proper data types.
3. Missing values (i.e. not enough adjacent rows of same row_name to
fill the number of result values columns) are filled in with nulls.
4. Extra values (i.e. source rows with category not found in category_sql
result) are skipped.
5. Rows with a null row_name column are skipped.
Example usage
create table cth(id serial, rowid text, rowdt timestamp, attribute text, val text);
insert into cth values(DEFAULT,'test1','01 March 2003','temperature','42');
insert into cth values(DEFAULT,'test1','01 March 2003','test_result','PASS');
insert into cth values(DEFAULT,'test1','01 March 2003','volts','2.6987');
insert into cth values(DEFAULT,'test2','02 March 2003','temperature','53');
insert into cth values(DEFAULT,'test2','02 March 2003','test_result','FAIL');
insert into cth values(DEFAULT,'test2','02 March 2003','test_startdate','01 March 2003');
insert into cth values(DEFAULT,'test2','02 March 2003','volts','3.1234');
SELECT * FROM crosstab
(
'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT attribute FROM cth ORDER BY 1'
)
AS
(
rowid text,
rowdt timestamp,
temperature int4,
test_result text,
test_startdate timestamp,
volts float8
);
rowid | rowdt | temperature | test_result | test_startdate | volts
-------+--------------------------+-------------+-------------+--------------------------+--------
test1 | Sat Mar 01 00:00:00 2003 | 42 | PASS | | 2.6987
test2 | Sun Mar 02 00:00:00 2003 | 53 | FAIL | Sat Mar 01 00:00:00 2003 | 3.1234
(2 rows)
==================================================================
Name
connectby(text, text, text, text, int[, text]) - returns a set
representing a hierarchy (tree structure)
......
......@@ -123,6 +123,79 @@ SELECT * FROM crosstab('SELECT rowid, attribute, val FROM ct where rowclass = ''
test2 | val5 | val6 | val7 | val8
(2 rows)
--
-- hash based crosstab
--
create table cth(id serial, rowid text, rowdt timestamp, attribute text, val text);
NOTICE: CREATE TABLE will create implicit sequence 'cth_id_seq' for SERIAL column 'cth.id'
insert into cth values(DEFAULT,'test1','01 March 2003','temperature','42');
insert into cth values(DEFAULT,'test1','01 March 2003','test_result','PASS');
-- the next line is intentionally left commented and is therefore a "missing" attribute
-- insert into cth values(DEFAULT,'test1','01 March 2003','test_startdate','28 February 2003');
insert into cth values(DEFAULT,'test1','01 March 2003','volts','2.6987');
insert into cth values(DEFAULT,'test2','02 March 2003','temperature','53');
insert into cth values(DEFAULT,'test2','02 March 2003','test_result','FAIL');
insert into cth values(DEFAULT,'test2','02 March 2003','test_startdate','01 March 2003');
insert into cth values(DEFAULT,'test2','02 March 2003','volts','3.1234');
-- return attributes as plain text
SELECT * FROM crosstab(
'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT attribute FROM cth ORDER BY 1')
AS c(rowid text, rowdt timestamp, temperature text, test_result text, test_startdate text, volts text);
rowid | rowdt | temperature | test_result | test_startdate | volts
-------+--------------------------+-------------+-------------+----------------+--------
test1 | Sat Mar 01 00:00:00 2003 | 42 | PASS | | 2.6987
test2 | Sun Mar 02 00:00:00 2003 | 53 | FAIL | 01 March 2003 | 3.1234
(2 rows)
-- this time without rowdt
SELECT * FROM crosstab(
'SELECT rowid, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT attribute FROM cth ORDER BY 1')
AS c(rowid text, temperature text, test_result text, test_startdate text, volts text);
rowid | temperature | test_result | test_startdate | volts
-------+-------------+-------------+----------------+--------
test1 | 42 | PASS | | 2.6987
test2 | 53 | FAIL | 01 March 2003 | 3.1234
(2 rows)
-- convert attributes to specific datatypes
SELECT * FROM crosstab(
'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT attribute FROM cth ORDER BY 1')
AS c(rowid text, rowdt timestamp, temperature int4, test_result text, test_startdate timestamp, volts float8);
rowid | rowdt | temperature | test_result | test_startdate | volts
-------+--------------------------+-------------+-------------+--------------------------+--------
test1 | Sat Mar 01 00:00:00 2003 | 42 | PASS | | 2.6987
test2 | Sun Mar 02 00:00:00 2003 | 53 | FAIL | Sat Mar 01 00:00:00 2003 | 3.1234
(2 rows)
-- source query and category query out of sync
SELECT * FROM crosstab(
'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT attribute FROM cth WHERE attribute IN (''temperature'',''test_result'',''test_startdate'') ORDER BY 1')
AS c(rowid text, rowdt timestamp, temperature int4, test_result text, test_startdate timestamp);
rowid | rowdt | temperature | test_result | test_startdate
-------+--------------------------+-------------+-------------+--------------------------
test1 | Sat Mar 01 00:00:00 2003 | 42 | PASS |
test2 | Sun Mar 02 00:00:00 2003 | 53 | FAIL | Sat Mar 01 00:00:00 2003
(2 rows)
-- if category query generates no rows, get expected error
SELECT * FROM crosstab(
'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT attribute FROM cth WHERE attribute = ''a'' ORDER BY 1')
AS c(rowid text, rowdt timestamp, temperature int4, test_result text, test_startdate timestamp, volts float8);
ERROR: load_categories_hash: provided categories SQL must return 1 column of at least one row
-- if category query generates more than one column, get expected error
SELECT * FROM crosstab(
'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT rowdt, attribute FROM cth ORDER BY 2')
AS c(rowid text, rowdt timestamp, temperature int4, test_result text, test_startdate timestamp, volts float8);
ERROR: load_categories_hash: provided categories SQL must return 1 column of at least one row
--
-- connectby
--
-- test connectby with text based hierarchy
CREATE TABLE connectby_text(keyid text, parent_keyid text);
\copy connectby_text from 'data/connectby_text.data'
......
......@@ -38,6 +38,61 @@ SELECT * FROM crosstab('SELECT rowid, attribute, val FROM ct where rowclass = ''
SELECT * FROM crosstab('SELECT rowid, attribute, val FROM ct where rowclass = ''group1'' ORDER BY 1,2;', 3) AS c(rowid text, att1 text, att2 text, att3 text);
SELECT * FROM crosstab('SELECT rowid, attribute, val FROM ct where rowclass = ''group1'' ORDER BY 1,2;', 4) AS c(rowid text, att1 text, att2 text, att3 text, att4 text);
--
-- hash based crosstab
--
create table cth(id serial, rowid text, rowdt timestamp, attribute text, val text);
insert into cth values(DEFAULT,'test1','01 March 2003','temperature','42');
insert into cth values(DEFAULT,'test1','01 March 2003','test_result','PASS');
-- the next line is intentionally left commented and is therefore a "missing" attribute
-- insert into cth values(DEFAULT,'test1','01 March 2003','test_startdate','28 February 2003');
insert into cth values(DEFAULT,'test1','01 March 2003','volts','2.6987');
insert into cth values(DEFAULT,'test2','02 March 2003','temperature','53');
insert into cth values(DEFAULT,'test2','02 March 2003','test_result','FAIL');
insert into cth values(DEFAULT,'test2','02 March 2003','test_startdate','01 March 2003');
insert into cth values(DEFAULT,'test2','02 March 2003','volts','3.1234');
-- return attributes as plain text
SELECT * FROM crosstab(
'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT attribute FROM cth ORDER BY 1')
AS c(rowid text, rowdt timestamp, temperature text, test_result text, test_startdate text, volts text);
-- this time without rowdt
SELECT * FROM crosstab(
'SELECT rowid, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT attribute FROM cth ORDER BY 1')
AS c(rowid text, temperature text, test_result text, test_startdate text, volts text);
-- convert attributes to specific datatypes
SELECT * FROM crosstab(
'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT attribute FROM cth ORDER BY 1')
AS c(rowid text, rowdt timestamp, temperature int4, test_result text, test_startdate timestamp, volts float8);
-- source query and category query out of sync
SELECT * FROM crosstab(
'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT attribute FROM cth WHERE attribute IN (''temperature'',''test_result'',''test_startdate'') ORDER BY 1')
AS c(rowid text, rowdt timestamp, temperature int4, test_result text, test_startdate timestamp);
-- if category query generates no rows, get expected error
SELECT * FROM crosstab(
'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT attribute FROM cth WHERE attribute = ''a'' ORDER BY 1')
AS c(rowid text, rowdt timestamp, temperature int4, test_result text, test_startdate timestamp, volts float8);
-- if category query generates more than one column, get expected error
SELECT * FROM crosstab(
'SELECT rowid, rowdt, attribute, val FROM cth ORDER BY 1',
'SELECT DISTINCT rowdt, attribute FROM cth ORDER BY 2')
AS c(rowid text, rowdt timestamp, temperature int4, test_result text, test_startdate timestamp, volts float8);
--
-- connectby
--
-- test connectby with text based hierarchy
CREATE TABLE connectby_text(keyid text, parent_keyid text);
\copy connectby_text from 'data/connectby_text.data'
......
This diff is collapsed.
......@@ -34,6 +34,7 @@
*/
extern Datum normal_rand(PG_FUNCTION_ARGS);
extern Datum crosstab(PG_FUNCTION_ARGS);
extern Datum crosstab_hash(PG_FUNCTION_ARGS);
extern Datum connectby_text(PG_FUNCTION_ARGS);
#endif /* TABLEFUNC_H */
......@@ -52,6 +52,11 @@ RETURNS setof record
AS 'MODULE_PATHNAME','crosstab'
LANGUAGE 'C' STABLE STRICT;
CREATE OR REPLACE FUNCTION crosstab(text,text)
RETURNS setof record
AS 'MODULE_PATHNAME','crosstab_hash'
LANGUAGE 'C' STABLE STRICT;
CREATE OR REPLACE FUNCTION connectby(text,text,text,text,int,text)
RETURNS setof record
AS 'MODULE_PATHNAME','connectby_text'
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment