pg_comparator - network and time efficient table content comparison.
pg_comparator [options as --help --option --man] conn1 conn2
This script performs a network and time efficient comparison of two possibly large tables on two servers. It makes only sense to use it if the expected differences are small.
The implementation is quite generic: multi-column keys, no assumption of data types other that they can be cast to text, subset of columns can be used for the comparison, handling of null values...
Options allow to request help or to adjust some internal parameters. Short one-letter options are also available, usually with the first letter of the option name.
--aggregate=xorAggregation function to be used for summaries. Must operate on the result of the checksum function.
--ask-passAsk for passwords interactively.
--checksum='cksum8(%s)'Checksum function to be used. The quality of this function in term of bit-spreading and uniformity is important for the quality of the results. A poor function might miss differences because of collisions or result in a more costly search. Cryptographic hash functions such as MD5 or SHA1 are a good choice.
--cleanupDrop checksum tables. Useful after --notemp.
--factor=7Folding factor: log2 of the number of rows grouped together at each stage. Default chosen after some basic tests on medium-size cases.
--helpShow short help.
--manShow manual page.
--max-report=32Maximum search effort, search will stop if above this threshold: it gives up if a single query at any level involves more than this many differences. Use 0 for no limit.
--max-levels=0Maximum number of levels used. Allows to cut-off folding. Default is 0, meaning no cut-off. Setting a value of 1 only uses the checksum table, without summaries.
--null="COALESCE(%s::TEXT,'null')"String processing template to take care of null values. If you set --null='%s', null values are set as they appear, which might damage the results in null columns are used because multi-column checksums will all be 0.
--optionShow option summary.
--prefix=cmpName prefix for comparison tables. May be schema-qualified.
--report --noreportReport keys as they are found. Default is to report.
--separator=:Separator string when concatenating columns.
--source='DBI:Pg:dbname=%b;host=%h;port=%p;'DBI source template. Changing this might allow to use this command with another type of database, as the SQL used is quite standard. Tags are interpreted as follow: %b is base, %h is host, %p is port, %u is login.
--source2=...DBI source template for second connection. Default is same as previous --source option.
--assume-size=...Assume the provided value as the table size, thus skipping the COUNT query.
--temporary --notemporaryWhether to use temporary tables. Default is to use. If you don't, the tables are kept at the end, so they will have to be deleted by hand.
--threadsUse threads: a synonymous for ``segmentation fault'':-) It seems that DBI or DBD::Pg does not like threads at all...
--statsShow various statistics.
--verboseBe verbose about what is happening. The more you ask, the more verbose.
--versionShow version information.
--where=...SQL boolean condition for partial comparison.
The two arguments describe database connections with the following URL-like syntax, where square brackets denote optional parts. Many parts are optional with a default.
[login[:pass]@][host[:port]]/base/[schema.]table[?key:cols]
Login to use when connecting to database. Default is username.
Password to use when connecting to database. Default is none.
Hostname to connect to. Default is localhost.
Tcp-ip port to connect to. Default is 5432 for PostgreSQL.
Database catalog to connect to. Default is username.
The possibly schema-qualified table to use for comparison. Default is same as first connection.
Comma-separated list of key columns. Default is same as first connection.
Comma-separated list of columns to compare. Default is same as first connection.
The output of the command consists of lines describing the differences found between the two tables. They are expressed in term of insertions, updates or deletes and of tuple keys.
Key k tuple is updated from table 1 to table 2. It exists in both tables with different values.
Key k tuple does not appear in table 2, but only in table 1. It must be inserted in table 2 to synchronize it wrt table 1.
Key k tuple appears in table 2, but not in table 1. It must be deleted from 2 to synchronize it wrt table 1.
In case of key-checksum or data-checksum collision, false positive or false negative results may occur. Changing the checksum function would help in such cases.
Three support functions are needed on the database:
The COALESCE function takes care of NULL values in columns.
It may be changed with the --null option.
A checksum function must be used to reduce and distribute key and columns values. It may be changed with the --checksum option.
PostgreSQL MD5 function can be used for this purpose.
Three other suitable implementations are available for PostgreSQL
and can be loaded into the server by processing share/contrib/checksum.sql.
The three functions cksum2, cksum4 and cksum8 differ on the size
of the resulting checksum expressed in bytes. The default is to use
the cksum8 version.
An aggregate function is used to summarize checksums for a range of rows. It must operate on the result of the checksum function. It may be changed with the --aggregate option.
Suitable implementations of a exclusive-or xor aggregate are available
for PostgreSQL and can be loaded into the server by processing
share/contrib/xor_aggregate.sql.
Moreover several perl modules are useful to run this script:
Getopt::Long for option management.
DBI and DBD::Pg to connect to PostgreSQL.
Term::ReadPassword for ask-pass option.
Pod::Usage for doc self-extraction (--man --opt --help).
Compare tables calvin and hobbes in database family on localhost, with key id and columns c1 and c2:
./pg_comparator /family/calvin?id:c1,c2 /family/hobbes
Compare tables calvin in default database on localhost and the same table in default database on sablons, with key id and column data:
./pg_comparator localhost/family/calvin?id:data sablons/family/calvin
The aim of the algorithm is to compare the content of two tables, possibly on different remote servers, with minimum network traffic. It is performed in three phases.
A checksum table is computed on each side for the target table.
A fist level summary table is computed on each side by aggregating chunks of the checksum table. Other levels of summary aggregations are then performed till there is only one row in the last table, which then stores a global checksum for the whole initial target tables.
Starting from the upper summary tables, aggregated checksums are compared from both sides to look for differences, down to the initial checksum table. Keys of differing tuples are displayed.
The first phase computes the initial cheksum table t(0) on each side. Assuming that key is the table key columns, and cols is the table data columns that are to be checked for differences, then it is performed by querying target table t as follow:
CREATE TABLE t(0) AS SELECT key AS id, checksum(key) AS idc, checksum(key || cols) AS cks FROM t;
The inititial key is kept, as it will be used to show differing keys at the end. The rational for the idc column is to randomize the key-values distribution so as to balance aggrates in the next phase. The key must appear in the cheksum also, otherwise content exchanged between two keys would not be detected in some cases.
Now we compute a set of cascading summary tables by grouping f (folding factor) checksums together at each stage. The grouping is based on a mask on the idc column to take advantage of the checksum randomization. Starting from p=0 we build:
CREATE TABLE t(p+1) AS SELECT idc & mask(p+1) AS idc, XOR(cks) FROM t(p) GROUP BY idc & mask(p+1);
The mask(p) is defined so that it groups together on average f
checksums together: mask(0) = ceil2(size); mask(p) = mask(p-1)/f;
This leads to a hierarchy of tables, each one being a smaller summary
of the previous one:
checksum table, size rows, i.e. as many rows as the target table.
first summary table, (size/f) rows.
intermediate summary table, (size/f**p) rows.
one before last summary table, less than f rows.
last summary table, mask is 0, 1 row.
It is important that the very same masks are used so that aggregations are the same, allowing to compare matching contents on both sides.
After all these support tables are built on both sides comes the search for differences. When checking the checksum summary of the last tables (level n) with only one row, it is basically a comparison of the cheksum of the whole table contents. If they match, then both tables are equal, and we are done. Otherwise, if these checksums differ, some investigation is needed to detect offending keys.
The investigation is performed by going down the table hierarchy and looking for all idc for which there was a difference in the checksum on the previous level. The same query is performed on both side at each stage:
SELECT idc, cks FROM t(p) WHERE idc & mask(p+1) IN (idc-with-diff-checksums-from-level-p+1) ORDER BY idc, cks;
And the results from both sides are merged together. When doing the merge procedure, four cases can arise:
Both idc and cks match. Then there is no difference.
Although idc does match, cks does not. Then this idc is to be investigated at the next level, as the checksum summary differs. If we are already at the last level, then the offending key can be shown.
No idc match, one supplemental idc in the first side.
Then this idc correspond to key(s) that must be inserted
for syncing the second table wrt the first.
No idc match, one supplemental idc in the second side.
Then this idc correspond to key(s) that must be deleted
for syncing the second table wrt the first.
Cases 3 and 4 are simply symmetrical, and it is only an interpretation to decide whether it is an insert or a delete, taking the first side as the reference.
The checksum implementation gives integers, which are constant length and easy to manipulate afterwards.
The xor aggregate is a good choice because there is no overflow issue with it and it takes into account all bits of the input.
Null values must be taken care appropriatelly.
The folding factor and all modules are taken as power of two...
There is a special management of large chunks of deletes or inserts which is implemented although not detailed in the algorithmic overview above nor the complexity analysis below.
Let n be the number of rows, r the row size, f the folding factor and k the number of differences to be detected. Then ISTM that:
is better than k*f*ceil(log(n)/log(f)): it is independent of r, the lower f the better, and you want k<<n.
maximum is 6+2*ceil(log(n)/log(f)), minimum is 6+ceil(log(n)/log(f)) for equal tables.
is about n*r+n*ln(n)*(f/(f-1)). Here a not too small f is better, as it reduces both the number of requests and of disk I/Os;
The choice of f is indeed a tradeoff.
This script and algorithm was somehow inspired by:
Taming the Distributed Database Problem: A Case Study Using MySQL by Giuseppe Maxia in Sys Admin vol 13 num 8, Aug 2004, pp 29-40. See http://www.perlmonks.org/index.pl for details.
In the above paper, three algorithms are presented. The first one compares two tables with a checksum technique. The second one finds UPDATE or INSERT differences based on a 2-level (checksum and summary) table hierarchy. The algorithm is asymmetrical, as different queries are performed on the two tables to be compared. It seems that the network traffic volume is in k*(f+(n/f)+r), that it has a probabilistically-buggy merge procedure, and that it makes assumptions about the distribution of key values. The third algorithm looks for DELETE differences based on counting, with the implicit assumption that there are only such differences.
The algorithm used here implements all three tasks. It is fully symmetrical. It finds UPDATE, DELETE and INSERT between the two tables. The checksum and summary hierarchical level idea is reused and generalized so as to reduce the algorithmic complexity.
From the implementation standpoint, the script is as parametric as possible thru many options, and makes as few assumptions as possible about table structures, types and values.
Some products implement such features: http://www.programurl.com/software/sql-server-comparison.htm http://www.dbbalance.com/db_comparison.htm http://www.dkgas.com/dbdiff.htm http://www.sql-server-tool.com/ http://sourceforge.net/projects/mysqltoolkit
All softwares have bugs. This is a software, hence it has bugs. Reporting bugs is good practice, so tell me if you find one!
See http://pgfoundry.org/projects/pg-comparator/ for the latest version. My personnal site for the tool is http://www.coelho.net/pg_comparator/.
Manual connection string parsing.
Grumble! wrong tar pushed out.
Minor makefile fix asked for by Roberto C. Sanchez.
Minor fix for 8.3 by Roberto C. Sanchez.
Port for 8.2. Better documentation. Fix masq bug: although the returned answer was correct, the table folding was not. DELETE/INSERT messages exchanged so as to match a 'sync' or 'copy' semantics.
Project moved to http://pgfoundry.org/. Use cksum8 checksum function by default. Minor doc updates.
Added --show-all-keys option for handling big chunks of deletes or inserts.
Fix algorithmic bug: checksums must also include the key, otherwise exchanged data could be not detected if the keys were to be grouped together.
Algorithmic section added to manual page. Thanks to Giuseppe Maxia who asked for it.
Various code cleanups.
Initial revision.
Copyright (c) 2004-2008, Fabien Coelho <fabien at coelho dot net> http://www.coelho.net/
This softwere is distributed under the terms of the BSD Licence. Basically, you can do whatever you want, but you have to keep the license... and I'm not responsible for any consequences. Beware, you may lose your data or your hairs because of this software! See the LICENSE file enclosed with the distribution for details.
If you are very happy with this software, I would appreciate a postcard saying so (see my webpage for current address).