DSPAM 3.4 and newer versions come with a special MySQL database schema to improve performance on MySQL 4.1 and newer versions. This page describes how I migrated from the old database schema to the new one without loosing my training data.
The two schemas don't differ that much. The tables dspam_token_data and dspam_signature_data show minor modifications (comparing the 'speed' to '4.1' schema):
bigint unsigned
(was char(20)
), the table has the new parameter PACK_KEYS=1
.
max_rows=2500000 avg_row_length=8096
.
The modifications seem to be very easy, I first thought that some ALTER TABLE
statements would suffice. Unfortunately, changing dspam_token_data wasn't that easy because MySQL complained: 'ERROR 1062 (23000): Duplicate entry 'XXXXXX-0' for key 1 '.
After every step I noted the time it took MySQL to perform the action on my database (2.4 million records in dspam_token_data, 500k records in dspam_signature_data, sql dump is 4.3G big). The times will vary greatly depending on your computing power, the current load and the size of your database but these times may give a first impression how long it will take. Loading the data in a new database took me (approx) 1:40h.
Furthermore please remember that you should have some free space on your disk because the "alter table" statements will create a temporary copy of your tables so you should calculate that your disk space use for your dspam database will double (approximately) during the schema migration.
Old MySQL versions had only very limited built-in character support. But DSPAM will store tokens which will contain non-ascii characters therefore you should ensure that the character set definitions of the new database matches the old one. How do you determine which character you should use? My old MySQL 3.23.58 did not support any commands like 'SHOW CHARACTER SETS' but with 'SHOW VARIABLES;' you will see a variable named 'character_set' which should be the wanted value. Now you can modify the new database so that it will use the correct character set regardless of your global server settings (which you may not want to modify):
ALTER DATABASE dspam CHARACTER SET latin1 COLLATE latin1_bin;
drop index `id_token_data_01` on dspam_token_data;
(23 seconds)
alter table dspam_token_data modify token bigint unsigned not NULL;
(12 seconds)
alter table dspam_token_data PACK_KEYS=1;
(13 seconds)
alter table dspam_token_data add primary key (uid,token);
(40 seconds)
alter table dspam_signature_data max_rows=2500000 avg_row_length=8096;
This operation is by far the most "expensive" one (besides loading the initial data): It took 15.3 minutes at my computer.
Hopefully, the procedure above worked without any errors. At least it worked for me.