Friday, January 29, 2010

What is CHECK TABLE doing with InnoDB tables?

Recently we had a case where a customer got some corrupted blocks in his InnoDB tables. His largest tables where quite big, about 30 to 100 Gbyte. Why he got this corrupted blocks we did not find out yet (disk broken?).

When you have corrupted blocks in InnoDB, it is mandatory to get rid of them again. Otherwise your database can crash suddenly.
If you are lucky only "normal" tables are concerned. So you can dump, drop, recreate and load them again as described in the InnoDB recovery procedure in the MySQL documentation [1].
If you are not so lucky you have to recreate your complete database or go back to an old backup and do a restore with a Point-in-Time-Recovery (PITR).

To find out if some tables are corrupted MySQL provides 2 tools: The innochecksum utility [2] and the mysqlcheck utility [3] or you can use the CHECK TABLE command manually (which is used by mysqlcheck).

I wanted to know how CHECK TABLE works in detail. So I looked first in the MySQL documentation [4]. But unfortunately the MySQL documentation does not go into details that much very often on such specific questions.

So I dug into the code. The interesting lines you can find in the files handler/ha_innodb.cc and row/row0mysql.c. In the following snippets I have cut out a lot of detail stuff.

The function ha_innobase::check is the interface between the CHECK TABLE command and the InnoDB storage engine and does the call of the InnoDB table check:

// handler/ha_innodb.cc

int ha_innobase::check( THD* thd )
{

build_template(prebuilt, NULL, table, ROW_MYSQL_WHOLE_ROW);

ret = row_check_table_for_mysql(prebuilt);

if (ret == DB_SUCCESS) {
return(HA_ADMIN_OK);
}

return(HA_ADMIN_CORRUPT);
}

The function row_check_table_for_mysql does the different checks on an InnoDB table:

  • First it checks if the ibd file is missing.

  • Then the first index (dict_table_get_first_index) is checked on its consistency (btr_validate_index) by walking through all page tree levels. In InnoDB the first (primary) index is always equal to the table (= data).

  • If the index is consistent several other checks are performed (row_scan_and_check_index):

    • If entries are in ascendant order.

    • If unique constraint is not broken.

    • And the number of index entries is calculated.

  • Then the next and all other (secondary) indexes of the table are done in the same way.

  • At the end a WHOLE Adaptive Hash Index check for ALL InnoDB tables (btr_search_validate) is done for every CHECK TABLE!

// row/row0mysql.c

ulint row_check_table_for_mysql( row_prebuilt_t* prebuilt )
{

if ( prebuilt->table->ibd_file_missing ) {
fprintf(stderr, "InnoDB: Error: ...", prebuilt->table->name);
return(DB_ERROR);
}

index = dict_table_get_first_index(table);

while ( index != NULL ) {

if ( ! btr_validate_index(index, prebuilt->trx) ) {
ret = DB_ERROR;
}
else {

if ( ! row_scan_and_check_index(prebuilt, index, &n_rows) ) {
ret = DB_ERROR;
}

if ( index == dict_table_get_first_index(table) ) {
n_rows_in_table = n_rows;
}
else if ( n_rows != n_rows_in_table ) {

ret = DB_ERROR;

fputs("Error: ", stderr);
dict_index_name_print(stderr, prebuilt->trx, index);
fprintf(stderr, " contains %lu entries, should be %lu\n", n_rows, n_rows_in_table);
}
}

index = dict_table_get_next_index(index);
}

if ( ! btr_search_validate() ) {
ret = DB_ERROR;
}

return(ret);
}

A little detail which is NOT discussed in the code above is that the fatal lock wait timeout is set from 600 seconds (10 min) to 7800 seconds (2 h 10 min).

/* Enlarge the fatal lock wait timeout during CHECK TABLE. */
mutex_enter(&kernel_mutex);
srv_fatal_semaphore_wait_threshold += 7200; /* 2 hours */
mutex_exit(&kernel_mutex);

As far as I understand this has 2 impacts:
  1. CHECK TABLE for VERY large tables (> 200 - 400 Gbyte) will most probably fail because it will exceed the fatal lock timeout. This becomes more probable when you have bigger tables, slower disks, less memory or do not make use of your memory appropriately.

  2. Because srv_fatal_semaphore_wait_threshold is a global variable, during every CHECK TABLE the fatal lock wait timeout is set high for the whole system. Long enduring InnoDB locks will be detected late or not at all during a long running CHECK TABLE command.


If this is something which should be fixed to get a higher reliability of the system I cannot judge and is up to the InnoDB developers. But when you hit such symptoms during long running CHECK TABLE commands consider this.
For the first finding I have filed a feature request [5]. This "problem" was introduced long time ago with bug #2694 [6] in MySQL 4.0, Sep 2004. Thanks to Axel and Shane for their comments.
If you want to circumvent this situation you have either to recompile MySQL with higher values or you can use the concept of a pluggable User Defined Function (UDF) which I have described earlier [7], [8], [9].

An other detail is that at the end of each CHECK TABLE command a check of all Adaptive Hash Indexes of all tables is done. I do not know how expensive it is to check all Adaptive Hash Indexes, especially when they are large. But having a more optimized code there could help to speed up the CHECK TABLE command for a small percentage?

These information are valid up to MySQL/InnoDB 5.1.41 and the InnoDB plug-in 1.0.5.

Literature

[1] Forcing InnoDB Recovery
[2] innochecksum — Offline InnoDB File Checksum Utility
[3] mysqlcheck
[4] CHECK TABLE
[5] Bug #50723: InnoDB CHECK TABLE fatal semaphore wait timeout possibly too short for big table
[6] Bug #2694: CHECK TABLE for Innodb table can crash server
[7] MySQL useful add-on collection using UDF
[8] Using MySQL User-Defined Functions (UDF) to get MySQL internal informations
[9] MySQL useful add-on collection using UDF

I'm going to FOSDEM, the Free and Open Source Software Developers' European Meeting

Friday, January 22, 2010

MySQL on VMware Workstation/DRBD vs. VMWare ESX Server/SAN

Or an active-active fail-over cluster à la VMware.

Today I have learned about a totally crazy/cool looking architecture where the expensive VMware ESX server was replace by a free/cheap VMware Workstation version in combination with DRBD.

Basically DRBD we name "the little man's SAN" and that is exactly what this customer is doing. He replaced the SAN with DRBD and now he can easily move one VMware instance to the other host. Possibly it is not that flexible and powerful as an ESX Server but also not so expensive...

The architecture looks as follows:


According to this customer it works stable on about a dozed of installations and they have not experienced any troubles during the fail-overs.

Please let me know your experience, thoughts or concerns with this architecture...

PS: When you consider such an architecture do not expect a very good performance!

Wednesday, January 20, 2010

The battle against Oracle is probably over but has the real war begun yet?

According to different sources from the web the decision about the Oracle - Sun merger has been approved by the European commission soon. So at least in the West it is clear what is going on. Let us see what the East decides... [1], [2].

Oracles arch-enemy Microsoft has already brought its weapons in position against the target with its: "Microsoft offers Oracle-phobes MySQL migration tool" [3], [4]. So far so good. Nothing new, nothing special.

What made me a bit edgy was the following Oracle blog series about their Oracle Warehouse Builder (OWB):
  • OWB 11gR2 – MySQL Open Connectivity [5]
  • OWB 11gR2 – MySQL Bulk Extract [6]
OWB seems to be a great tool to move data around from different sources, to mix them and to extract some useful results.

It looks like with the new 11gR2 release there "... were significant changes to mapping to support native heterogeneous connectivity to systems ...".

What interests me more is what is MySQL related about it. In the second part of the series I find some text passages like:

"Next on the MySQL series, let's look at bulk extract to file from MySQL. ... and also rather than just unloading to file, we can optimize for other systems such as Oracle and create Load Code Templates that extract in bulk from MySQL (or whatever) transfer to the Oracle system and create an external table as the staging table. ... Hopefully this gives you a taster for the capabilities of the bulk extract capabilities using MySQL as an illustration."

This makes me just a bit nervous. But hopefully I am just overreacting because of the current situation... So let us carefully watch how it continues!

PS: .oO(How would it sound like this: "... and also rather than just unloading Oracle data to file, we can optimize for MySQL and create Load Code Templates that extract in bulk from Oracle (or whatever) transfer to the MySQL system ..."?)

1 Yahoo! NEWS
2 The Wall Street Journal
3 Microsoft offers Oracle-phobes MySQL migration tool
4 Free Download: Microsoft SQL Server Migration Assistant
5 OWB 11gR2 – MySQL Open Connectivity
6 OWB 11gR2 – MySQL Bulk Extract

Thursday, January 7, 2010

MySQL reporting to syslog

There are 2 different possible situations you can face when you have to deal with MySQL and syslog:
  1. MySQL is used as back-end for syslog to store the logging information. [6]
  2. MySQL itself should report to the syslog.
In this blog article we look at the second situation: How can you make MySQL reporting to the syslog.

Since the version 5.1.20 MySQL is capable to log to the syslog [1], [2]. This is done by the MySQL angel process mysqld_safe.

You can enable the syslog when you add the syslog parameter to the MySQL configuration file (my.cnf) in the mysqld_safe section:

[mysqld_safe]
syslog


Currently MySQL is not capable to log to more than one logging facility at the same time. So you have to decide if you want to log either to the error log or to the syslog.

If you specify both, syslog and error-log, at the same time you will receive an error message if you start mysqld like this:

bin/mysqld_safe --defaults-file=/etc/my.cnf

But I assume that most of the MySQL users are using some kind of start/stop wrapper script like the mysql.server as follows:

/etc/init.d/mysql start

or

rcmysql start

So you will never see the error message indicating you having a conflict between the syslog and the error log. And you are possibly wondering why it is not logging to the syslog. For this problem I have filed a bug report [3].

If you cannot wait for the fix, this excerpt of mysqld_safe should help [8]:

364 if [ $want_syslog -eq 1 ]
365 then
366 # User explicitly asked for syslog, so warn that it isn't used
367 logging=file # This line you have to add!
368 log_error "Can't log to error log and syslog at the same time. Remove all --log-error configuration options for --syslog to take effect."
369 fi


Logging to an other logging facility than daemon


Mysqld_safe uses the logger command to log the error messages to the syslog. With ps you will find a command which looks like this:

logger -t mysqld -p daemon.error

The logger command uses 2 parameters -t for tag and -p for priority. The tag can be influenced with the syslog-tag parameter [4].
The priority parameter configures into which facility and on which level the message should be logged.

An exceprt from the logger man page [7]:

Enter the message with the specified priority. The priority may be specified numerically or as a ''facility.level'' pair. For example, ''-p local3.info'' logs the message(s) as informational level in the local3 facility. The default is ''user.notice.''

Unfortunately the logging facility is hard-coded in the the mysqld_safe script as daemon.error and daemon.notice. It would be nice to have this parameter configurable as well thus I have filed a feature request for it [5].

When you cannot wait, this code snippets from mysqld_safe possibly will help you [8]:

26a27
> syslog_facility=daemon
110c111
<>&2
---
> log_generic $syslog_facility.error "$@" >&2
114c115
<> log_generic $syslog_facility.notice "$@"
128c129
< cmd="">&1 | logger -t '$syslog_tag_mysqld' -p daemon.error"
---
> cmd="$cmd 2>&1 | logger -t '$syslog_tag_mysqld' -p '$syslog_facility'.error"
189a191
> --syslog-facility=*) syslog_facility="$val" ;;
366a369
> logging=file


With the modified mysqld_safe you can change your MySQL configuration file (my.cnf) file as follows:

[mysqld_safe]
syslog
syslog-facility = local3


to log to the local3 logging facility for example.

To activate this facility you possibly have to adapt your syslog configuration file (in my case: /etc/rsyslog.d/50-default.conf, in other cases: /etc/syslog.conf) as follows:

local3.* /var/log/database

To make these changes active you have to restart the syslog daemon:

kill -KILL $(cat /var/run/rsyslogd.pid)

In my case kill -HUP was not strong enough because it did NOT display configuration errors in the log file.

After the restart of the syslog daemon you should find the created empty log file and you can test if the logging works with the following command:

logger -p local3.info test

Now MySQL should log everything to your syslog facility local3.

How to make MySQL logging to the error log AND the syslog facility?


Under normal circumstances mysqld_safe can not log to more than one logging facility.

With the following syslog configuration you can make MySQL logging to both facilities, the error log AND the syslog. Add the following lines to your syslog configuration file:

$FileOwner mysql
$FileGroup dba
local3.* /home/mysql/product/mysql-5.1.42/data/error.log

# Set the owner and group back to its original values
$FileOwner syslog
$FileGroup adm
local3.* /var/log/database


Then restart the syslog daemon as described above and you will see logging to both location. The only drawback is, that the messages in the MySQL error log look like the typical syslog messages and not like the typical MySQL error log messages any more. But this should be somehow configurable with the syslog parameters in its configuration file [9].

If you need more assistance with logging to the syslog please feel free and drop me a line.

Literature


[1] MySQL error log
[2] mysqld_safe syslog parameter
[3] Bug #50083: error-log and syslog conflict in mysqld_safe is not reported to the log file.
[4] mysqld_safe syslog-tag parameter
[5] Bug #50080: syslog priority cannot be configured
[6] Writing syslog messages to MySQL
[7] Logger man page
[8] Modified mysqld_safe for advance syslog logging
[9] Examples