Migrating DSPAM database to MySQL 4.1

DSPAM 3.4 and newer versions come with a special MySQL database schema to improve performance on MySQL 4.1 and newer versions. This page describes how I migrated from the old database schema to the new one without loosing my training data.

Warning: Please note that I can't guarantee that the process described below will preserve your data! You may loose everything - make backups before. You have been warned!

Schema Differences

The two schemas don't differ that much. The tables dspam_token_data and dspam_signature_data show minor modifications (comparing the 'speed' to '4.1' schema):

  1. dspam_token_data: token is now bigint unsigned (was char(20)), the table has the new parameter PACK_KEYS=1.
  2. dspam_signature_data: the table has the new parameters max_rows=2500000 avg_row_length=8096.

Basic Overview

  1. If you want to switch the database machine while upgrading to the new MySQL schema, use mysqldump to get a copy of your existing dspam database and load it on the new machine.
  2. Do some 'ALTER TABLE' statements to change the database schema.

The modifications seem to be very easy, I first thought that some ALTER TABLE statements would suffice. Unfortunately, changing dspam_token_data wasn't that easy because MySQL complained: 'ERROR 1062 (23000): Duplicate entry 'XXXXXX-0' for key 1 '.

Detailled Procedure

After every step I noted the time it took MySQL to perform the action on my database (2.4 million records in dspam_token_data, 500k records in dspam_signature_data, sql dump is 4.3G big). The times will vary greatly depending on your computing power, the current load and the size of your database but these times may give a first impression how long it will take. Loading the data in a new database took me (approx) 1:40h.

Furthermore please remember that you should have some free space on your disk because the "alter table" statements will create a temporary copy of your tables so you should calculate that your disk space use for your dspam database will double (approximately) during the schema migration.

Old MySQL versions had only very limited built-in character support. But DSPAM will store tokens which will contain non-ascii characters therefore you should ensure that the character set definitions of the new database matches the old one. How do you determine which character you should use? My old MySQL 3.23.58 did not support any commands like 'SHOW CHARACTER SETS' but with 'SHOW VARIABLES;' you will see a variable named 'character_set' which should be the wanted value. Now you can modify the new database so that it will use the correct character set regardless of your global server settings (which you may not want to modify):

ALTER DATABASE dspam CHARACTER SET latin1 COLLATE latin1_bin;

step by step:

  1. Drop the index to get around the 'duplicate entry' error: drop index `id_token_data_01` on dspam_token_data; (23 seconds)
  2. Modify dspam_token_data: alter table dspam_token_data modify token bigint unsigned not NULL; (12 seconds)
  3. Add the PACK_KEYS option: alter table dspam_token_data PACK_KEYS=1; (13 seconds)
  4. Re-add the index: alter table dspam_token_data add primary key (uid,token); (40 seconds)
  5. Modify dspam_signature_data: alter table dspam_signature_data max_rows=2500000 avg_row_length=8096; This operation is by far the most "expensive" one (besides loading the initial data): It took 15.3 minutes at my computer.

Hopefully, the procedure above worked without any errors. At least it worked for me.