NAME

pg_comparator - network and time efficient table content comparison.


SYNOPSIS

pg_comparator [options as --help --option --man] conn1 conn2


DESCRIPTION

This script performs a network and time efficient comparison of two possibly large tables on two servers. It makes only sense to use it if the expected differences are small.

The implementation is quite generic: multi-column keys, no assumption of data types other that they can be cast to text, subset of columns can be used for the comparison, handling of null values...


OPTIONS

Options allow to request help or to adjust some internal parameters. Short one-letter options are also available, usually with the first letter of the option name.

--aggregate=xor

Aggregation function to be used for summaries. Must operate on the result of the checksum function.

--ask-pass

Ask for passwords interactively.

--checksum='cksum8(%s)'

Checksum function to be used. The quality of this function in term of bit-spreading and uniformity is important for the quality of the results. A poor function might miss differences because of collisions or result in a more costly search. Cryptographic hash functions such as MD5 or SHA1 are a good choice.

--cleanup

Drop checksum tables. Useful after --notemp.

--factor=7

Folding factor: log2 of the number of rows grouped together at each stage. Default chosen after some basic tests on medium-size cases.

--help

Show short help.

--man

Show manual page.

--max-report=32

Maximum search effort, search will stop if above this threshold: it gives up if a single query at any level involves more than this many differences. Use 0 for no limit.

--max-levels=0

Maximum number of levels used. Allows to cut-off folding. Default is 0, meaning no cut-off. Setting a value of 1 only uses the checksum table, without summaries.

--null="COALESCE(%s::TEXT,'null')"

String processing template to take care of null values. If you set --null='%s', null values are set as they appear, which might damage the results in null columns are used because multi-column checksums will all be 0.

--option

Show option summary.

--prefix=cmp

Name prefix for comparison tables. May be schema-qualified.

--report --noreport

Report keys as they are found. Default is to report.

--separator=:

Separator string when concatenating columns.

--source='DBI:Pg:dbname=%b;host=%h;port=%p;'

DBI source template. Changing this might allow to use this command with another type of database, as the SQL used is quite standard. Tags are interpreted as follow: %b is base, %h is host, %p is port, %u is login.

--source2=...

DBI source template for second connection. Default is same as previous --source option.

--assume-size=...

Assume the provided value as the table size, thus skipping the COUNT query.

--temporary --notemporary

Whether to use temporary tables. Default is to use. If you don't, the tables are kept at the end, so they will have to be deleted by hand.

--threads

Use threads: a synonymous for ``segmentation fault'':-) It seems that DBI or DBD::Pg does not like threads at all...

--stats

Show various statistics.

--verbose

Be verbose about what is happening. The more you ask, the more verbose.

--version

Show version information.

--where=...

SQL boolean condition for partial comparison.


ARGUMENTS

The two arguments describe database connections with the following URL-like syntax, where square brackets denote optional parts. Many parts are optional with a default.

  [login[:pass]@][host[:port]]/base/[schema.]table[?key:cols]
login

Login to use when connecting to database. Default is username.

pass

Password to use when connecting to database. Default is none.

host

Hostname to connect to. Default is localhost.

port

Tcp-ip port to connect to. Default is 5432 for PostgreSQL.

base

Database catalog to connect to. Default is username.

schema.table

The possibly schema-qualified table to use for comparison. Default is same as first connection.

keys

Comma-separated list of key columns. Default is same as first connection.

cols

Comma-separated list of columns to compare. Default is same as first connection.


OUTPUT

The output of the command consists of lines describing the differences found between the two tables. They are expressed in term of insertions, updates or deletes and of tuple keys.

UPDATE k

Key k tuple is updated from table 1 to table 2. It exists in both tables with different values.

INSERT k

Key k tuple does not appear in table 2, but only in table 1. It must be inserted in table 2 to synchronize it wrt table 1.

DELETE k

Key k tuple appears in table 2, but not in table 1. It must be deleted from 2 to synchronize it wrt table 1.

In case of key-checksum or data-checksum collision, false positive or false negative results may occur. Changing the checksum function would help in such cases.


DEPENDENCES

Three support functions are needed on the database:

  1. The COALESCE function takes care of NULL values in columns. It may be changed with the --null option.

  2. A checksum function must be used to reduce and distribute key and columns values. It may be changed with the --checksum option.

    PostgreSQL MD5 function can be used for this purpose.

    Three other suitable implementations are available for PostgreSQL and can be loaded into the server by processing share/contrib/checksum.sql. The three functions cksum2, cksum4 and cksum8 differ on the size of the resulting checksum expressed in bytes. The default is to use the cksum8 version.

  3. An aggregate function is used to summarize checksums for a range of rows. It must operate on the result of the checksum function. It may be changed with the --aggregate option.

    Suitable implementations of a exclusive-or xor aggregate are available for PostgreSQL and can be loaded into the server by processing share/contrib/xor_aggregate.sql.

Moreover several perl modules are useful to run this script:

Getopt::Long for option management.

DBI and DBD::Pg to connect to PostgreSQL.

Term::ReadPassword for ask-pass option.

Pod::Usage for doc self-extraction (--man --opt --help).


EXAMPLES

Compare tables calvin and hobbes in database family on localhost, with key id and columns c1 and c2:

    ./pg_comparator /family/calvin?id:c1,c2 /family/hobbes

Compare tables calvin in default database on localhost and the same table in default database on sablons, with key id and column data:

    ./pg_comparator localhost/family/calvin?id:data sablons/family/calvin


ALGORITHM

The aim of the algorithm is to compare the content of two tables, possibly on different remote servers, with minimum network traffic. It is performed in three phases.

  1. A checksum table is computed on each side for the target table.

  2. A fist level summary table is computed on each side by aggregating chunks of the checksum table. Other levels of summary aggregations are then performed till there is only one row in the last table, which then stores a global checksum for the whole initial target tables.

  3. Starting from the upper summary tables, aggregated checksums are compared from both sides to look for differences, down to the initial checksum table. Keys of differing tuples are displayed.

CHECKSUM TABLE

The first phase computes the initial cheksum table t(0) on each side. Assuming that key is the table key columns, and cols is the table data columns that are to be checked for differences, then it is performed by querying target table t as follow:

  CREATE TABLE t(0) AS
  SELECT key AS id, checksum(key) AS idc, checksum(key || cols) AS cks
  FROM t;

The inititial key is kept, as it will be used to show differing keys at the end. The rational for the idc column is to randomize the key-values distribution so as to balance aggrates in the next phase. The key must appear in the cheksum also, otherwise content exchanged between two keys would not be detected in some cases.

SUMMARY TABLES

Now we compute a set of cascading summary tables by grouping f (folding factor) checksums together at each stage. The grouping is based on a mask on the idc column to take advantage of the checksum randomization. Starting from p=0 we build:

  CREATE TABLE t(p+1) AS
  SELECT idc & mask(p+1) AS idc, XOR(cks)
  FROM t(p)
  GROUP BY idc & mask(p+1);

The mask(p) is defined so that it groups together on average f checksums together: mask(0) = ceil2(size); mask(p) = mask(p-1)/f; This leads to a hierarchy of tables, each one being a smaller summary of the previous one:

level 0

checksum table, size rows, i.e. as many rows as the target table.

level 1

first summary table, (size/f) rows.

level p

intermediate summary table, (size/f**p) rows.

level n-1

one before last summary table, less than f rows.

level n

last summary table, mask is 0, 1 row.

It is important that the very same masks are used so that aggregations are the same, allowing to compare matching contents on both sides.

SEARCH FOR DIFFERENCES

After all these support tables are built on both sides comes the search for differences. When checking the checksum summary of the last tables (level n) with only one row, it is basically a comparison of the cheksum of the whole table contents. If they match, then both tables are equal, and we are done. Otherwise, if these checksums differ, some investigation is needed to detect offending keys.

The investigation is performed by going down the table hierarchy and looking for all idc for which there was a difference in the checksum on the previous level. The same query is performed on both side at each stage:

  SELECT idc, cks
  FROM t(p)
  WHERE idc & mask(p+1) IN (idc-with-diff-checksums-from-level-p+1)
  ORDER BY idc, cks;

And the results from both sides are merged together. When doing the merge procedure, four cases can arise:

  1. Both idc and cks match. Then there is no difference.

  2. Although idc does match, cks does not. Then this idc is to be investigated at the next level, as the checksum summary differs. If we are already at the last level, then the offending key can be shown.

  3. No idc match, one supplemental idc in the first side. Then this idc correspond to key(s) that must be inserted for syncing the second table wrt the first.

  4. No idc match, one supplemental idc in the second side. Then this idc correspond to key(s) that must be deleted for syncing the second table wrt the first.

Cases 3 and 4 are simply symmetrical, and it is only an interpretation to decide whether it is an insert or a delete, taking the first side as the reference.

IMPLEMENTATION ISSUES

The checksum implementation gives integers, which are constant length and easy to manipulate afterwards.

The xor aggregate is a good choice because there is no overflow issue with it and it takes into account all bits of the input.

Null values must be taken care appropriatelly.

The folding factor and all modules are taken as power of two...

There is a special management of large chunks of deletes or inserts which is implemented although not detailed in the algorithmic overview above nor the complexity analysis below.


ANALYSIS

Let n be the number of rows, r the row size, f the folding factor and k the number of differences to be detected. Then ISTM that:

network volume

is better than k*f*ceil(log(n)/log(f)): it is independent of r, the lower f the better, and you want k<<n.

number of requests

maximum is 6+2*ceil(log(n)/log(f)), minimum is 6+ceil(log(n)/log(f)) for equal tables.

disk I/O traffic

is about n*r+n*ln(n)*(f/(f-1)). Here a not too small f is better, as it reduces both the number of requests and of disk I/Os;

The choice of f is indeed a tradeoff.


REFERENCES

This script and algorithm was somehow inspired by:

Taming the Distributed Database Problem: A Case Study Using MySQL by Giuseppe Maxia in Sys Admin vol 13 num 8, Aug 2004, pp 29-40. See http://www.perlmonks.org/index.pl for details.

In the above paper, three algorithms are presented. The first one compares two tables with a checksum technique. The second one finds UPDATE or INSERT differences based on a 2-level (checksum and summary) table hierarchy. The algorithm is asymmetrical, as different queries are performed on the two tables to be compared. It seems that the network traffic volume is in k*(f+(n/f)+r), that it has a probabilistically-buggy merge procedure, and that it makes assumptions about the distribution of key values. The third algorithm looks for DELETE differences based on counting, with the implicit assumption that there are only such differences.

The algorithm used here implements all three tasks. It is fully symmetrical. It finds UPDATE, DELETE and INSERT between the two tables. The checksum and summary hierarchical level idea is reused and generalized so as to reduce the algorithmic complexity.

From the implementation standpoint, the script is as parametric as possible thru many options, and makes as few assumptions as possible about table structures, types and values.


SEE ALSO

Some products implement such features: http://www.programurl.com/software/sql-server-comparison.htm http://www.dbbalance.com/db_comparison.htm http://www.dkgas.com/dbdiff.htm http://www.sql-server-tool.com/ http://sourceforge.net/projects/mysqltoolkit


BUGS

All softwares have bugs. This is a software, hence it has bugs. Reporting bugs is good practice, so tell me if you find one!


VERSIONS

See http://pgfoundry.org/projects/pg-comparator/ for the latest version. My personnal site for the tool is http://www.coelho.net/pg_comparator/.

version 1.4.4 03/06/2008

Manual connection string parsing.

version 1.4.3 17/02/2008

Grumble! wrong tar pushed out.

version 1.4.2 17/02/2008

Minor makefile fix asked for by Roberto C. Sanchez.

version 1.4.1 14/02/2008

Minor fix for 8.3 by Roberto C. Sanchez.

version 1.4 24/12/2007

Port for 8.2. Better documentation. Fix masq bug: although the returned answer was correct, the table folding was not. DELETE/INSERT messages exchanged so as to match a 'sync' or 'copy' semantics.

version 1.3 31/08/2004

Project moved to http://pgfoundry.org/. Use cksum8 checksum function by default. Minor doc updates.

version 1.2 27/08/2004

Added --show-all-keys option for handling big chunks of deletes or inserts.

version 1.1 26/08/2004

Fix algorithmic bug: checksums must also include the key, otherwise exchanged data could be not detected if the keys were to be grouped together.

Algorithmic section added to manual page. Thanks to Giuseppe Maxia who asked for it.

Various code cleanups.

version 1.0 25/08/2004

Initial revision.


COPYRIGHT

Copyright (c) 2004-2008, Fabien Coelho <fabien at coelho dot net> http://www.coelho.net/

This softwere is distributed under the terms of the BSD Licence. Basically, you can do whatever you want, but you have to keep the license... and I'm not responsible for any consequences. Beware, you may lose your data or your hairs because of this software! See the LICENSE file enclosed with the distribution for details.

If you are very happy with this software, I would appreciate a postcard saying so (see my webpage for current address).