Go to the first, previous, next, last section, table of contents.

7 MySQL テーブル型

As of MySQL Version 3.23.6, you can choose between three basic table formats (ISAM, HEAP and MyISAM. Newer MySQL may support additional table type (BDB, or InnoDB), depending on how you compile it. When you create a new table, you can tell MySQL which table type it should use for the table. MySQL will always create a .frm file to hold the table and column definitions. Depending on the table type, the index and data will be stored in other files.

InnoDB テーブルを使用するためには、少なくとも innodb_data_file_path 起動オプションが必要です。「7.6.2 InnoDB 起動オプション」節参照.

MySQL のデフォルトのテーブルタイプは MyISAM です。 If you are trying to use a table type that is not compiled-in or activated, MySQL will instead create a table of type MyISAM. This is a very useful feature when you want to copy tables between different SQL servers that supports different table types (like copying tables to a slave that is optimized for speed by not having transactional tables). This automatic table changing can however also be very confusing for new MySQL users. We plan to fix this by introducing warnings in MySQL 4.0 and giving a warning when a table type is automatically changed.

ALTER TABLE 文を使用すれば、テーブルを違う形式に変更できます。「6.5.4 ALTER TABLE 構文」節参照.

MySQL では、二つの違う種類のテーブルをサポートしていることに注意してください。一つはトランザクションセーフのテーブル(BDB, InnoDB)、もう一つはトランザクションを持たないテーブル (HEAP, ISAM, MERGE, and MyISAM).

トランザクションセーフのテーブル(Transaction Safe Tables) の利点は(TST):

Safer. Even if MySQL crashes or you get hardware problems, you can get your data back, either by automatic recovery or from a backup the transaction log.
You can combine many statements and accept these all in one go with the COMMIT command.
You can execute ROLLBACK to ignore your changes (if you are not running in auto commit mode).
If an update fails, all your changes will be restored. (With NTST tables all changes that have taken place are permanent)

非トランザクションセーフ(Not Transaction Safe Tables)のテーブルの利点は(NTST):

Much faster as there is no transaction overhead.
Will use less disk space as there is no overhead of transactions.
Will use less memory to do updates.

You can combine TST and NTST tables in the same statements to get the best of both worlds.

7.1 MyISAM Tables

MyISAM は、MySQL Version 3.23 でのデフォルトのテーブル形式です．これは ISAM コードを基にし、多くの便利な拡張機能を持っています。

インデックスは .MYI (MYIndex) 拡張子のつくファイルに保存され、データは、 .MYD (MYData) 拡張子のつくファイルに保存されます。 myisamchk ユーティリティを使用して、 MyISAM テーブルの検査・修復が可能です。「4.4.6.7 Using myisamchk for Crash Recovery」節参照. myisampack コマンドを使用して、MyISAM を圧縮して小さくすることが可能です「4.7.4 myisampack, MySQL の圧縮された読み込み専用テーブルジェネレータ」節参照.

The following is new in MyISAM:

There is a flag in the MyISAM file that indicates whether or not the table was closed correctly. If mysqld is started with --myisam-recover, MyISAM tables will automatically be checked and/or repaired on open if the table wasn't closed properly.
You can INSERT new rows in a table that doesn't have free blocks in the middle of the data file, at the same time other threads are reading from the table (concurrent insert). An free block can come from an update of a dynamic length row with much data to a row with less data or when deleting rows. When all free blocks are used up, all future inserts will be concurrent again.
大ファイル (63 bit) のサポート。ただし、filesystems/operating systems が大ファイルをサポートしている場合。
全データは、下位バイトが先にかかれます。これは、データを、マシン・OS 非依存にしました。 The only requirement is that the machine uses two's-complement signed integers (as every machine for the last 20 years has) and IEEE floating-point format (also totally dominant among mainstream machines). The only area of machines that may not support binary compatibility are embedded systems (because they sometimes have peculiar processors). There is no big speed penalty in storing data low byte first; The bytes in a table row is normally unaligned and it doesn't take that much more power to read an unaligned byte in order than in reverse order. The actual fetch-column-value code is also not time critical compared to other code.
全数値キーは高位バイトが先に書かれます。これはインデックスの圧縮率を良くします。
Internal handling of one AUTO_INCREMENT column. MyISAM will automatically update this on INSERT/UPDATE. The AUTO_INCREMENT value can be reset with myisamchk. This will make AUTO_INCREMENT columns faster (at least 10 %) and old numbers will not be reused as with the old ISAM. Note that when an AUTO_INCREMENT is defined on the end of a multi-part-key the old behavior is still present.
When inserted in sorted order (as when you are using an AUTO_INCREMENT column) the key tree will be split so that the high node only contains one key. This will improve the space utilization in the key tree.
BLOB と TEXT フィールドにインデックスが張れます
NULL 値をインデックスの張られたフィールドに許します． This takes 0-1 bytes/key.
現在、キーの最大長はデフォルトで 500 バイトです。 (再コンパイルで変更可能). In cases of keys longer than 250 bytes, a bigger key block size than the default of 1024 bytes is used for this key.
Maximum number of keys/table is 32 as default. This can be enlarged to 64 without having to recompile myisamchk.
myisamchk will mark tables as checked if one runs it with --update-state. myisamchk --fast will only check those tables that don't have this mark.
myisamchk -a stores statistics for key parts (and not only for whole keys as in ISAM).
Dynamic size rows will now be much less fragmented when mixing deletes with updates and inserts. This is done by automatically combining adjacent deleted blocks and by extending blocks if the next block is deleted.
myisampack は BLOB と VARCHAR フィールドをパックすることが可能です。
You can use put the datafile and index file on different directories to get more speed (with the DATA/INDEX DIRECTORY="path" option to CREATE TABLE). 「6.5.3 CREATE TABLE 構文」節参照.

MyISAM also supports the following things, which MySQL will be able to use in the near future:

Support for a true VARCHAR type; A VARCHAR column starts with a length stored in 2 bytes.
Tables with VARCHAR may have fixed or dynamic record length.
VARCHAR and CHAR may be up to 64K. All key segments have their own language definition. This will enable MySQL to have different language definitions per column.
A hashed computed index can be used for UNIQUE. This will allow you to have UNIQUE on any combination of columns in a table. (You can't search on a UNIQUE computed index, however.)

Note that index files are usually much smaller with MyISAM than with ISAM. This means that MyISAM will normally use less system resources than ISAM, but will need more CPU when inserting data into a compressed index.

The following options to mysqld can be used to change the behavior of MyISAM tables. 「4.5.5.4 SHOW VARIABLES」節参照.

Option	Meaning
`--myisam-recover=#`	Automatic recover of crashed tables.
`-O myisam_sort_buffer_size=#`	Buffer used when recovering tables.
`--delay-key-write-for-all-tables`	Don't flush key buffers between writes for any MyISAM table
`-O myisam_max_extra_sort_file_size=#`	Used to help MySQL to decide when to use the slow but safe key cache index create method. NOTE that this parameter is given in megabytes!
`-O myisam_max_sort_file_size=#`	Don't use the fast sort index method to created index if the temporary file would get bigger than this. NOTE that this paramter is given in megabytes!

The automatic recovery is activated if you start mysqld with --myisam-recover=#. 「4.1.1 mysqld コマンド行オプション」節参照. On open, the table is checked if it's marked as crashed or if the open count variable for the table is not 0 and you are running with --skip-locking. If either of the above is true the following happens.

The table is checked for errors.
If we found an error, try to do a fast repair (with sorting and without re-creating the data file) of the table.
If the repair fails because of an error in the data file (for example a duplicate key error), we try again, but this time we re-create the data file.
If the repair fails, retry once more with the old repair option method (write row by row without sorting) which should be able to repair any type of error with little disk requirements..

If the recover wouldn't be able to recover all rows from a previous completed statement and you didn't specify FORCE as an option to myisam-recover, then the automatic repair will abort with an error message in the error file:

Error: Couldn't repair table: test.g00pages

If you in this case had used the FORCE option you would instead have got a warning in the error file:

Warning: Found 344 of 354 rows when repairing ./test/g00pages

Note that if you run automatic recover with the BACKUP option, you should have a cron script that automatically moves file with names like `tablename-datetime.BAK' from the database directories to a backup media.

「4.1.1 mysqld コマンド行オプション」節参照.

7.1.1 Space Needed for Keys

MySQL can support different index types, but the normal type is ISAM or MyISAM. These use a B-tree index, and you can roughly calculate the size for the index file as (key_length+4)/0.67, summed over all keys. (This is for the worst case when all keys are inserted in sorted order and we don't have any compressed keys.)

String indexes are space compressed. If the first index part is a string, it will also be prefix compressed. Space compression makes the index file smaller than the above figures if the string column has a lot of trailing space or is a VARCHAR column that is not always used to the full length. Prefix compression is used on keys that start with a string. Prefix compression helps if there are many strings with an identical prefix.

In MyISAM tables, you can also prefix compress numbers by specifying PACK_KEYS=1 when you create the table. This helps when you have many integer keys that have an identical prefix when the numbers are stored high-byte first.

7.1.2 MyISAM Table Formats

MyISAM supports 3 different table types. Two of them are chosen automatically depending on the type of columns you are using. The third, compressed tables, can only be created with the myisampack tool.

7.1.2.1 Static (Fixed-length) Table Characteristics

This is the default format. It's used when the table contains no VARCHAR, BLOB, or TEXT columns.

このフォーマットは、最も単純、かつ、安全なフォーマットです。これは, Disk 上に作られるテーブルの中で、最も速いフォーマットでもあります。これはディスク上のデータを見つけやすいからです。 When looking up something with an index and static format it is very simple. Just multiply the row number by the row length.

Also, when scanning a table it is very easy to read a constant number of records with each disk read.

The security is evidenced if your computer crashes when writing to a fixed-size MyISAM file, in which case myisamchk can easily figure out where each row starts and ends. So it can usually reclaim all records except the partially written one. Note that in MySQL all indexes can always be reconstructed:

全ての CHAR, NUMERIC, DECIMAL フィールドは、そのフィールド長に足りない部分にはスペースが埋められます。
とても速い
キャッシュしやすい。
クラッシュの後再構築しやすい。なぜならレコードが固定された位置に割り当てられているから。
Doesn't have to be reorganized (with myisamchk) unless a huge number of records are deleted and you want to return free disk space to the operating system.
通常、動的テーブルよりも多くのディスク容量が必要。

7.1.2.2 Dynamic Table Characteristics

This format is used if the table contains any VARCHAR, BLOB, or TEXT columns or if the table was created with ROW_FORMAT=dynamic.

この形式は少し複雑です。なぜならそれぞれのレコードが、レコードがどのぐらいの長さを持っているかを記録するヘッダーを持っているからです。 One record can also end up at more than one location when it is made longer at an update.

OPTIMIZE table か myisamchk を使用して、テーブルのフラグメンテーションを修正することが可能です。 If you have static data that you access/change a lot in the same table as some VARCHAR or BLOB columns, it might be a good idea to move the dynamic columns to other tables just to avoid fragmentation:

全ての文字フィールドが動的になります（ただし4byteより短い物は除きます）
それぞれのレコードの先頭には、フィールドの状態を表すビット・マップがきます。このビットは、どの文字型フィールドが空文字('')なのか、どの数値フィールドがゼロなのかを示します。 (これはフィールドの値が NULL 値とは違います)。もし、文字型フィールドの文字列の長さ(後に続く空白は取り除かれる)がゼロであったり、あるいは、数値フィールドの値がゼロであった場合は、そのフィールドはビット・マップにマークされ、値はディスクには保存されません。空文字ではない場合は、文字列のバイト数がビット・マップに記録され、文字列自身がフィールドに保存されます。
通常、固定長のテーブルよりも少ないディスク容量ですみます
それぞれのレコードは、要求されただけのスペースを使用します。もしあるレコードが大きくなると、要求された分、それを多くの断片に分けます。この結果、レコードのフラグメンテーションが発生します。
If you update a row with information that extends the row length, the row will be fragmented. In this case, you may have to run myisamchk -r from time to time to get better performance. Use myisamchk -ei tbl_name for some statistics.
Not as easy to reconstruct after a crash, because a record may be fragmented into many pieces and a link (fragment) may be missing.
期待されるレコードの長さは：
```
3
+ (フィールド数 + 7) / 8
+ (char フィールドの数)
+ 数値フィールドをパックしたサイズ
+ 文字の長さ
+ (NULL フィールドの数 + 7) / 8
```
There is a penalty of 6 bytes for each link. A dynamic record is linked whenever an update causes an enlargement of the record. Each new link will be at least 20 bytes, so the next enlargement will probably go in the same link. If not, there will be another link. You may check how many links there are with myisamchk -ed. All links may be removed with myisamchk -r.

7.1.2.3 Compressed Table Characteristics

This is a read-only type that is generated with the optional myisampack tool (pack_isam for ISAM tables):

All MySQL distributions, even those that existed before MySQL went GPL, can read tables that were compressed with myisampack.
Compressed tables take very little disk space. This minimizes disk usage, which is very nice when using slow disks (like CD-ROMs).
Each record is compressed separately (very little access overhead). The header for a record is fixed (1-3 bytes) depending on the biggest record in the table. Each column is compressed differently. Some of the compression types are:
- There is usually a different Huffman table for each column.
- Suffix space compression.
- Prefix space compression.
- Numbers with value 0 are stored using 1 bit.
- If values in an integer column have a small range, the column is stored using the smallest possible type. For example, a BIGINT column (8 bytes) may be stored as a TINYINT column (1 byte) if all values are in the range 0 to 255.
- If a column has only a small set of possible values, the column type is converted to ENUM.
- A column may use a combination of the above compressions.
Can handle fixed- or dynamic-length records, but not BLOB or TEXT columns.
Can be uncompressed with myisamchk.

7.1.3 MyISAM table problems.

The file format that MySQL uses to store data has been extensively tested, but there are always circumstances that may cause database tables to become corrupted.

7.1.3.1 Corrupted MyISAM tables.

Even if the MyISAM table format is very reliable (all changes to a table is written before the SQL statements returns) , you can still get corrupted tables if some of the following things happens:

The mysqld process being killed in the middle of a write.
Unexpected shutdown of the computer (for example, if the computer is turned off).
A hardware error.
You are using an external program (like myisamchk) on a live table.
A software bug in the MySQL or MyISAM code.

Typial typical symptoms for a corrupt table is:

You get the error Incorrect key file for table: '...'. Try to repair it while selecting data from the table.
Queries doesn't find rows in the table or returns incomplete data.

You can check if a table is ok with the command CHECK TABLE. 「4.4.4 CHECK TABLE 構文」節参照.

You can repair a corrupted table with REPAIR TABLE. 「4.4.5 REPAIR TABLE 構文」節参照. You can also repair a table, when mysqld is not running with the myisamchk command. myisamchk syntax.

If your tables get corrupted a lot you should try to find the reason for this! 「A.4.1 What To Do If MySQL Keeps Crashing」節参照.

In this case the most important thing to know is if the table got corrupted if the mysqld died (one can easily verify this by checking if there is a recent row restarted mysqld in the mysqld error file). If this isn't the case, then you should try to make a test case of this. 「G.1.6 Making a test case when you experience table corruption」節参照.

7.1.3.2 Clients is using or hasn't closed the table properly

Each MyISAM .MYI file has in the header a counter that can be used to check if a table has been closed properly.

If you get the following warning from CHECK TABLE or myisamchk:

# clients is using or hasn't closed the table properly

this means that this counter has come out of sync. This doesn't mean that the table is corrupted, but means that you should at least do a check on the table to verify that it's ok.

The counter works as follows:

The first time a table is updated in MySQL, a counter in the header of the index files is incremented.
The counter is not changed during further updates.
When the last instance of a table is closed (because of a FLUSH or because there isn't room in the table cache) the counter is decremented if the table has been updated at any point.
When you repair the table or check the table and it was ok, the counter is reset to 0.
To avoid problems with interaction with other processes that may do a check on the table, the counter is not decremented on close if it was 0.

In other words, the only ways this can go out of sync are:

The MyISAM tables are copied without a LOCK and FLUSH TABLES.
MySQL has crashed between an update and the final close (Note that the table may still be ok, as MySQL always issues writes for everything between each statement).
Someone has done a myisamchk --repair or myisamchk --update-stateon a table that was in use by mysqld.
Many mysqld servers are using the table and one has done a REPAIR or CHECK of the table while it was in use by another server. In this setup the CHECK is safe to do (even if you will get the warning from other servers), but REPAIR should be avoided as it currently replaces the data file with a new one, which is not signaled to the other servers.

7.2 MERGE Tables

MERGE tables are new in MySQL Version 3.23.25. The code is still in gamma, but should be resonable stable.

A MERGE table is a collection of identical MyISAM tables that can be used as one. You can only SELECT, DELETE, and UPDATE from the collection of tables. If you DROP the MERGE table, you are only dropping the MERGE specification.

Note that DELETE FROM merge_table used without a WHERE will only clear the mapping for the table, not delete everything in the mapped tables. (We plan to fix this in 4.0).

With identical tables we mean that all tables are created with identical column and key information. You can't put a MERGE over tables where the columns are packed differently, doesn't have exactly the same columns or have the keys in different order. Some of the tables can however be compressed with myisampack. 「4.7.4 myisampack, MySQL の圧縮された読み込み専用テーブルジェネレータ」節参照.

When you create a MERGE table, you will get a .frm table definition file and a .MRG table list file. The .MRG just contains a list of the index files (.MYI files) that should be used as one. All used tables must be in the same database as the MERGE table itself.

For the moment you need to have SELECT, UPDATE, and DELETE privileges on the tables you map to a MERGE table.

MERGE tables can help you solve the following problems:

Easily manage a set of log tables. For example, you can put data from different months into separate files, compress some of them with myisampack, and then create a MERGE to use these as one.
Give you more speed. You can split a big read-only table based on some criteria and then put the different table part on different disks. A MERGE table on this could be much faster than using the big table. (You can, of course, also use a RAID to get the same kind of benefits.)
Do more efficient searches. If you know exactly what you are looking after, you can search in just one of the split tables for some queries and use MERGE table for others. You can even have many different MERGE tables active, with possible overlapping files.
More efficient repairs. It's easier to repair the individual files that are mapped to a MERGE file than trying to repair a real big file.
Instant mapping of many files as one. A MERGE table uses the index of the individual tables. It doesn't need to maintain an index of its one. This makes MERGE table collections VERY fast to make or remap. Note that you must specify the key definitions when you create a MERGE table!.
If you have a set of tables that you join to a big table on demand or batch, you should instead create a MERGE table on them on demand. This is much faster and will save a lot of disk space.
Go around the file size limit for the operating system.
You can create an alias/synonym for a table by just using MERGE over one table. There shouldn't be any really notable performance impacts of doing this (only a couple of indirect calls and memcpy's for each read).

The disadvantages with MERGE tables are:

You can't use INSERT on MERGE tables, as MySQL can't know in which of the tables we should insert the row.
You can only use identical MyISAM tables for a MERGE table.
MERGE tables uses more file descriptors. If you are using a MERGE that maps over 10 tables and 10 users are using this, you are using 10*10 + 10 file descriptors. (10 data files for 10 users and 10 shared index files.)
Key reads are slower. When you do a read on a key, the MERGE handler will need to issue a read on all underlying tables to check which one most closely matches the given key. If you then do a 'read-next' then the merge table handler will need to search the read buffers to find the next key. Only when one key buffer is used up, the handler will need to read the next key block. This makes MERGE keys much slower on eq_ref searches, but not much slower on ref searches. 「5.2.1 EXPLAIN 構文 (SELECTについての情報を得る)」節参照.
You can't do DROP TABLE, ALTER TABLE or DELETE FROM table_name without a WHERE clause on any of the table that is mapped by a MERGE table that is 'open'. If you do this, the MERGE table may still refer to the original table and you will get unexpected results.

The following example shows you how to use MERGE tables:

CREATE TABLE t1 (a INT AUTO_INCREMENT PRIMARY KEY, message CHAR(20));
CREATE TABLE t2 (a INT AUTO_INCREMENT PRIMARY KEY, message CHAR(20));
INSERT INTO t1 (message) VALUES ("Testing"),("table"),("t1");
INSERT INTO t2 (message) VALUES ("Testing"),("table"),("t2");
CREATE TABLE total (a INT NOT NULL, message CHAR(20), KEY(a)) TYPE=MERGE UNION=(t1,t2);

Note that we didn't create a UNIQUE or PRIMARY KEY in the total table as the key isn't going to be unique in the total table.

Note that you can also manipulate the .MRG file directly from the outside of the MySQL server:

shell> cd /mysql-data-directory/current-database
shell> ls -1 t1.MYI t2.MYI > total.MRG
shell> mysqladmin flush-tables

Now you can do things like:

mysql> select * from total;
+---+---------+
| a | message |
+---+---------+
| 1 | Testing |
| 2 | table   |
| 3 | t1      |
| 1 | Testing |
| 2 | table   |
| 3 | t2      |
+---+---------+

To remap a MERGE table you can do one of the following:

DROP the table and re-create it
Use ALTER TABLE table_name UNION(...)
Change the .MRG file and issue a FLUSH TABLE on the MERGE table and all underlying tables to force the handler to read the new definition file.

7.3 ISAM Tables

You can also use the deprecated ISAM table type. This will disappear rather soon because MyISAM is a better implementation of the same thing. ISAM uses a B-tree index. The index is stored in a file with the .ISM extension, and the data is stored in a file with the .ISD extension. You can check/repair ISAM tables with the isamchk utility. 「4.4.6.7 Using myisamchk for Crash Recovery」節参照.

ISAM has the following features/properties:

Compressed and fixed-length keys
Fixed and dynamic record length
16 keys with 16 key parts/key
Max key length 256 (default)
Data is stored in machine format; this is fast, but is machine/OS dependent.

Most of the things true for MyISAM tables are also true for ISAM tables. 「7.1 MyISAM Tables」節参照. The major differences compared to MyISAM tables are:

ISAM tables are not binary portable across OS/Platforms.
Can't handle tables > 4G.
Only support prefix compression on strings.
Smaller key limits.
Dynamic tables get more fragmented.
Tables are compressed with pack_isam rather than with myisampack.

If you want to convert an ISAM table to a MyISAM table so that you can use utilities such as mysqlcheck, use an ALTER TABLE statement:

mysql> ALTER TABLE tbl_name TYPE = MYISAM;

7.4 HEAP Tables

HEAP tables use a hashed index and are stored in memory. This makes them very fast, but if MySQL crashes you will lose all data stored in them. HEAP is very useful for temporary tables!

MySQL 内部 HEAP テーブルは、100% ダイナッミック・ハッシングを使用しています（オーバーフローエリア無しに）。 There is no extra space needed for free lists. HEAP tables also don't have problems with delete + inserts, which normally is common with hashed tables:

mysql> CREATE TABLE test TYPE=HEAP SELECT ip,SUM(downloads) as down
        FROM log_table GROUP BY ip;
mysql> SELECT COUNT(ip),AVG(down) FROM test;
mysql> DROP TABLE test;

Here are some things you should consider when you use HEAP tables:

You should always use specify MAX_ROWS in the CREATE statement to ensure that you accidentally do not use all memory.
Indexes will only be used with = and <=> (but are VERY fast).
HEAP tables can only use whole keys to search for a row; compare this to MyISAM tables where any prefix of the key can be used to find rows.
HEAP tables use a fixed record length format.
HEAP doesn't support BLOB/TEXT columns.
HEAP doesn't support AUTO_INCREMENT columns.
HEAP doesn't support an index on a NULL column.
You can have non-unique keys in a HEAP table (this isn't common for hashed tables).
HEAP tables are shared between all clients (just like any other table).
You can't search for the next entry in order (that is, to use the index to do an ORDER BY).
Data for HEAP tables are allocated in small blocks. The tables are 100% dynamic (on inserting). No overflow areas and no extra key space are needed. Deleted rows are put in a linked list and are reused when you insert new data into the table.
You need enough extra memory for all HEAP tables that you want to use at the same time.
To free memory, you should execute DELETE FROM heap_table, TRUNCATE heap_table or DROP TABLE heap_table.
MySQL cannot find out approximately how many rows there are between two values (this is used by the range optimizer to decide which index to use). This may affect some queries if you change a MyISAM table to a HEAP table.
To ensure that you accidentally don't do anything foolish, you can't create HEAP tables bigger than max_heap_table_size.

The memory needed for one row in a HEAP table is:

SUM_OVER_ALL_KEYS(max_length_of_key + sizeof(char*) * 2)
+ ALIGN(length_of_row+1, sizeof(char*))

sizeof(char*) is 4 on 32-bit machines and 8 on 64-bit machines.

7.5 BDB or Berkeley_DB Tables

7.5.1 Overview of BDB Tables

Support for BDB tables is included in the MySQL source distribution starting from Version 3.23.34 and is activated in the MySQL-Max binary.

BerkeleyDB, available at http://www.sleepycat.com/ has provided MySQL with a transactional table handler. By using BerkeleyDB tables, your tables may have a greater chance of surviving crashes, and also provides COMMIT and ROLLBACK on transactions. The MySQL source distribution comes with a BDB distribution that has a couple of small patches to make it work more smoothly with MySQL. You can't use a non-patched BDB version with MySQL.

We at MySQL AB are working in close cooperation with Sleepycat to keep the quality of the MySQL/BDB interface high.

When it comes to supporting BDB tables, we are committed to help our users to locate the problem and help creating a reproducable test case for any problems involving BDB tables. Any such test case will be forwarded to Sleepycat who in turn will help us find and fix the problem. As this is a two stage operation, any problems with BDB tables may take a little longer for us to fix than for other table handlers. However, as the BerkeleyDB code itself has been used by many other applications than MySQL, we don't envision any big problems with this. 「1.3.5.6 Support for other table handlers」節参照.

7.5.2 Installing BDB

If you have downloaded a binary version of MySQL that includes support for BerkeleyDB, simply follow the instructions for installing a binary version of MySQL. 「M.1 Installing a MySQL Binary Distribution」節参照. 「4.7.5 mysqld-max, An extended mysqld server」節参照.

To compile MySQL with Berkeley DB support, download MySQL Version 3.23.34 or newer and configure MySQL with the --with-berkeley-db option. 「2.3 MySQL ソースディストリビューションのインストール」節参照.

cd /path/to/source/of/mysql-3.23.34
./configure --with-berkeley-db

Please refer to the manual provided with the BDB distribution for more updated information.

Even though Berkeley DB is in itself very tested and reliable, the MySQL interface is still considered beta quality. We are actively improving and optimizing it to get it stable very soon.

7.5.3 BDB startup options

If you are running with AUTOCOMMIT=0 then your changes in BDB tables will not be updated until you execute COMMIT. Instead of commit you can execute ROLLBACK to forget your changes. 「6.7.1 BEGIN/COMMIT/ROLLBACK 構文」節参照.

If you are running with AUTOCOMMIT=1 (the default), your changes will be committed immediately. You can start an extended transaction with the BEGIN WORK SQL command, after which your changes will not be committed until you execute COMMIT (or decide to ROLLBACK the changes).

The following options to mysqld can be used to change the behavior of BDB tables:

Option	Meaning
`--bdb-home=directory`	Base directory for BDB tables. This should be the same directory you use for --datadir.
`--bdb-lock-detect=#`	Berkeley lock detect. One of (DEFAULT, OLDEST, RANDOM, or YOUNGEST).
`--bdb-logdir=directory`	Berkeley DB log file directory.
`--bdb-no-sync`	Don't synchronously flush logs.
`--bdb-no-recover`	Don't start Berkeley DB in recover mode.
`--bdb-shared-data`	Start Berkeley DB in multi-process mode (Don't use `DB_PRIVATE` when initializing Berkeley DB)
`--bdb-tmpdir=directory`	Berkeley DB tempfile name.
`--skip-bdb`	Don't use berkeley db.
`-O bdb_max_lock=1000`	Set the maximum number of locks possible. 「4.5.5.4 `SHOW VARIABLES`」節参照.

If you use --skip-bdb, MySQL will not initialize the Berkeley DB library and this will save a lot of memory. Of course, you cannot use BDB tables if you are using this option.

Normally you should start mysqld without --bdb-no-recover if you intend to use BDB tables. This may, however, give you problems when you try to start mysqld if the BDB log files are corrupted. 「2.4.2 MySQL サーバー起動時の問題」節参照.

With bdb_max_lock you can specify the maximum number of locks (10000 by default) you can have active on a BDB table. You should increase this if you get errors of type bdb: Lock table is out of available locks or Got error 12 from ... when you have do long transactions or when mysqld has to examine a lot of rows to calculate the query.

You may also want to change binlog_cache_size and max_binlog_cache_size if you are using big multi-line transactions. 「6.7.1 BEGIN/COMMIT/ROLLBACK 構文」節参照.

7.5.4 Some characteristic of `BDB` tables:

To be able to rollback transactions BDB maintain log files. For maximum performance you should place these on another disk than your databases by using the --bdb_log_dir options.
MySQL performs a checkpoint each time a new BDB log file is started, and removes any log files that are not needed for current transactions. One can also run FLUSH LOGS at any time to checkpoint the Berkeley DB tables. For disaster recovery, one should use table backups plus MySQL's binary log. 「4.4.1 データベースのバックアップ」節参照. Warning: If you delete old log files that are in use, BDB will not be able to do recovery at all and you may lose data if something goes wrong.
MySQL requires a PRIMARY KEY in each BDB table to be able to refer to previously read rows. If you don't create one, MySQL will create an maintain a hidden PRIMARY KEY for you. The hidden key has a length of 5 bytes and is incremented for each insert attempt.
If all columns you access in a BDB table are part of the same index or part of the primary key, then MySQL can execute the query without having to access the actual row. In a MyISAM table the above holds only if the columns are part of the same index.
The PRIMARY KEY will be faster than any other key, as the PRIMARY KEY is stored together with the row data. As the other keys are stored as the key data + the PRIMARY KEY, it's important to keep the PRIMARY KEY as short as possible to save disk and get better speed.
LOCK TABLES works on BDB tables as with other tables. If you don't use LOCK TABLE, MYSQL will issue an internal multiple-write lock on the table to ensure that the table will be properly locked if another thread issues a table lock.
Internal locking in BDB tables is done on page level.
SELECT COUNT(*) FROM table_name is slow as BDB tables doesn't maintain a count of the number of rows in the table.
Scanning is slower than with MyISAM tables as one has data in BDB tables stored in B-trees and not in a separate data file.
The application must always be prepared to handle cases where any change of a BDB table may make an automatic rollback and any read may fail with a deadlock error.
Keys are not compressed to previous keys as with ISAM or MyISAM tables. In other words, the key information will take a little more space in BDB tables compared to MyISAM tables which don't use PACK_KEYS=0.
There is often holes in the BDB table to allow you to insert new rows in the middle of the key tree. This makes BDB tables somewhat larger than MyISAM tables.
The optimizer needs to know an approximation of the number of rows in the table. MySQL solves this by counting inserts and maintaining this in a separate segment in each BDB table. If you don't do a lot of DELETE or ROLLBACK:s this number should be accurate enough for the MySQL optimizer, but as MySQL only store the number on close, it may be wrong if MySQL dies unexpectedly. It should not be fatal even if this number is not 100 % correct. One can update the number of rows by executing ANALYZE TABLE or OPTIMIZE TABLE. 「4.5.2 ANALYZE TABLE Syntax」節参照 . 「4.5.1 OPTIMIZE TABLE 構文」節参照.
If you get full disk with a BDB table, you will get an error (probably error 28) and the transaction should roll back. This is in contrast with MyISAM and ISAM tables where mysqld will wait for enough free disk before continuing.

7.5.5 Some things we need to fix for BDB in the near future:

It's very slow to open many BDB tables at the same time. If you are going to use BDB tables, you should not have a very big table cache (> 256 ?) and you should use --no-auto-rehash with the mysql client. We plan to partly fix this in 4.0.
SHOW TABLE STATUS doesn't yet provide that much information for BDB tables.
Optimize performance.
Change to not use page locks at all when we are scanning tables.

7.5.6 Operating systems supported by BDB

If you after having built MySQL with support for BDB tables get the following error in the log file when you start mysqld:

bdb: architecture lacks fast mutexes: applications cannot be threaded
Can't init dtabases

This means that BDB tables are not supported for your architecture. In this case you have to rebuild MySQL without BDB table support.

NOTE: The following list is not complete; We will update this as we get more information about this.

Currently we know that BDB tables works with the following operating system.

Linux 2.x intel
Solaris sparc
SCO OpenServer
SCO UnixWare 7.0.1

It doesn't work with the following operating systems:

Linux 2.x Alpha
Max OS X

7.5.7 Errors You May Get When Using BDB Tables

If you get the following error in the hostname.err log when starting mysqld:
```
bdb:  Ignoring log file: .../log.XXXXXXXXXX: unsupported log version #
```
it means that the new BDB version doesn't support the old log file format. In this case you have to delete all BDB log BDB from your database directory (the files that has the format log.XXXXXXXXXX ) and restart mysqld. We would also recommend you to do a mysqldump --opt of your old BDB tables, delete the old table and restore the dump.
If you are running in not auto_commit mode and delete a table you are using by another thread you may get the following error messages in the MySQL error file:
```
001119 23:43:56  bdb:  Missing log fileid entry
001119 23:43:56  bdb:  txn_abort: Log undo failed for LSN: 1 3644744: Invalid
```
This is not fatal but we don't recommend that you delete tables if you are not in auto_commit mode, until this problem is fixed (the fix is not trivial).

7.6 InnoDB テーブル

7.6.1 InnoDB テーブル概要

InnoDB はバージョン 3.23.34a から、 MySQL のソース配布に含まれるようになり、そして、MySQL-max バイナリで有効になりました。

If you have downloaded a binary version of MySQL that includes support for InnoDB (mysqld-max), simply follow the instructions for installing a binary version of MySQL. 「M.1 Installing a MySQL Binary Distribution」節参照. 「4.7.5 mysqld-max, An extended mysqld server」節参照.

InnoDB をサポートするように MySQL をコンパイルするには、MySQL-3.23.37 以上をダウンロードし、 --with-innodb オプションで MySQL を configure します。「2.3 MySQL ソースディストリビューションのインストール」節参照.

cd /path/to/source/of/mysql-3.23.37
./configure --with-innodb

InnoDB を動作させるためには、InnoDB テーブルがどこに保存されるのかを innodb_data_file_path オプションに指定しなくてはなりません。これはコマンドラインで与えるか、MySQL オプションファイルで指定します。「7.6.2 InnoDB 起動オプション」節参照. もし MySQL に InnoDB を組み込んだものの、上記のオプションを指定しなかった場合には、mysqld は起動時に以下のようなメッセージを出します：

Can't initialize InnoDB as 'innodb_data_file_path' is not set

InnoDB provides MySQL with a transaction-safe table handler with commit, rollback, and crash recovery capabilities. InnoDB does locking on row level, and also provides an Oracle-style consistent non-locking read in SELECTS, which increases transaction concurrency. There is not need for lock escalation in InnoDB, because row level locks in InnoDB fit in very small space.

InnoDB has been designed for maximum performance when processing large data volumes. Its CPU efficiency is probably not matched by any other disk-based relational database engine.

You can find the latest information about InnoDB at http://www.innodb.com. The most up-to-date version of the InnoDB manual is always placed there, and you can also order commercial support for InnoDB.

Technically, InnoDB is a database backend placed under MySQL. InnoDB has its own buffer pool for caching data and indexes in main memory. InnoDB stores its tables and indexes in a tablespace, which may consist of several files. This is different from, for example, MyISAM tables where each table is stored as a separate file.

InnoDB is distributed under the GNU GPL License Version 2 (of June 1991). In the source distribution of MySQL, InnoDB appears as a subdirectory.

7.6.2 InnoDB 起動オプション

MySQL-3.23.37 から、オプションの接頭語が innobase_... から innodb_... に変わりました！

InnoDB テーブルを使用するためには、configuration ファイル `my.cnf' の [mysqld] セクションに設定をしなくてはなりません。「4.1.2 my.cnf オプションファイル」節参照.

InnoDB を使用するための、唯一、必須のパラメタは innodb_data_file_path で、他のオプションはよりよいパフォーマンスを得たい場合に設定します。

あなたの機械が Windows NT で 128M の RAM と 10GB のハードディスクだと仮定します。以下はその場合の `my.cnf' の InnoDB パラメターの例です：

innodb_data_file_path = ibdata1:2000M;ibdata2:2000M
innodb_data_home_dir = c:\ibdata
set-variable = innodb_mirrored_log_groups=1
innodb_log_group_home_dir = c:\iblogs
set-variable = innodb_log_files_in_group=3
set-variable = innodb_log_file_size=30M
set-variable = innodb_log_buffer_size=8M
innodb_flush_log_at_trx_commit=1
innodb_log_arch_dir = c:\iblogs
innodb_log_archive=0
set-variable = innodb_buffer_pool_size=80M
set-variable = innodb_additional_mem_pool_size=10M
set-variable = innodb_file_io_threads=4
set-variable = innodb_lock_wait_timeout=50

data files は 4G 未満でなくてはなりません！（OSによっては 2G 未満）。 The total size of data files has to be >= 10 MB. InnoDB は自動でディレクトリを作りません: 自分の手でディレクトリを作成しなくてはなりません。

あなたの機械が Linux で 512M の RAM と 20GB のハードディスクだと仮定します。 (ディレクトリパスは `/', `/dr2', `/dr3') 以下はその場合の `my.cnf' の InnoDB パラメターの例です：

innodb_data_file_path = ibdata/ibdata1:2000M;dr2/ibdata/ibdata2:2000M
innodb_data_home_dir = /
set-variable = innodb_mirrored_log_groups=1
innodb_log_group_home_dir = /dr3/iblogs
set-variable = innodb_log_files_in_group=3
set-variable = innodb_log_file_size=50M
set-variable = innodb_log_buffer_size=8M
innodb_flush_log_at_trx_commit=1
innodb_log_arch_dir = /dr3/iblogs
innodb_log_archive=0
set-variable = innodb_buffer_pool_size=400M
set-variable = innodb_additional_mem_pool_size=20M
set-variable = innodb_file_io_threads=4
set-variable = innodb_lock_wait_timeout=50

Note that we have placed the two data files on different disks. The reason for the name innodb_data_file_path is that you can also specify paths to your data files, and innodb_data_home_dir is just textually catenated before your data file paths, adding a possible slash or backslash in between. InnoDB will fill the tablespace formed by the data files from bottom up. In some cases it will improve the performance of the database if all data is not placed on the same physical disk. Putting log files on a different disk from data is very often beneficial for performance.

The meanings of the configuration parameters are the following:

`innodb_data_home_dir`	全ての InnoDB データファイルの、共通トップディレクトリのパス。
`innodb_data_file_path`	単一のデータファイル(individual data files)のパスとそのサイズ。 (訳注： innodb は保存すべきデータをいくつかのファイルに分割して書き込みます。この 'individual data files' は、その、実際にデータを保存している個々のファイルの事を指しています。) それぞれのデータファイルへのフルパスは、`innodb_data_home_dir` で指定されたパスと連結されることで求められます。ファイルサイズはメガバイトで与えられ、上記のように 'M' がサイズの後にきます。ファイルサイズは 4000M より大きくしてはいけません。ほとんどのオペレーティングシステムでは 2000M より大きなファイルを扱えません。 InnoDB は 'G' の省略形も認識します。 1G は 1024M になります。データファイルのサイズの合計は、少なくとも、10MB 以上必要です。
`innodb_mirrored_log_groups`	データベースのために維持しておくログのグループのコピーの数。現在、これは 1 しかセットできません。
`innodb_log_group_home_dir`	InnoDB ログファイルのディレクトリのパス。
`innodb_log_files_in_group`	log group 内の、ログファイルの数。 InnoDB はログファイルを、ローテートするやり方で書きます。 3 が推奨値です。
`innodb_log_file_size`	log group 内の、各ログファイルの大きさ(Mega bytes)。値は 1M から後述するバッファプールまでの範囲です。 The bigger the value, the less checkpoint flush activity is needed in the buffer pool, saving disk i/o. But bigger log files also mean that recovery will be slower in case of a crash. File size restriction as for a data file.
`innodb_log_buffer_size`	The size of the buffer which InnoDB uses to write log to the log files on disk. Sensible values range from 1M to half the combined size of log files. A big log buffer allows large transactions to run without a need to write the log to disk until the transaction commit. Thus, if you have big transactions, making the log buffer big will save disk i/o.
`innodb_flush_log_at_trx_commit`	Normally this is set to 1, meaning that at a transaction commit the log is flushed to disk, and the modifications made by the transaction become permanent, and survive a database crash. If you are willing to compromise this safety, and you are running small transactions, you may set this to 0 to reduce disk i/o to the logs.
`innodb_log_arch_dir`	The directory where fully written log files would be archived if we used log archiving. ＊現在のところ、これには、 `innodb_log_group_home_dir` と同じ値をセットしなくてはなりません。＊
`innodb_log_archive`	This value should currently be set to 0. As recovery from a backup is done by MySQL using its own log files, there is currently no need to archive InnoDB log files.
`innodb_buffer_pool_size`	InnoDB がデータやテーブルのインデックスををキャッシュするために使用するメモリのサイズ。大きな値をセットすると、テーブルのデータへのアクセスに必要なディスク i/o が少なくなります。データベースサーバ専用のマシンでは、このパラメタを物理メモリの 80% までセットしてもかまいません。物理メモリの競合がオペレーティングシステムのページングの原因になるかもしれないので、あまりに大きすぎる値は与えないように。
`innodb_additional_mem_pool_size`	Size of a memory pool InnoDB uses to store data dictionary information and other internal data structures. A sensible value for this might be 2M, but the more tables you have in your application the more you will need to allocate here. If InnoDB runs out of memory in this pool, it will start to allocate memory from the operating system, and write warning messages to the MySQL error log.
`innodb_file_io_threads`	Number of file i/o threads in InnoDB. Normally, this should be 4, but on Windows NT disk i/o may benefit from a larger number.
`innodb_lock_wait_timeout`	Timeout in seconds an InnoDB transaction may wait for a lock before being rolled back. InnoDB automatically detects transaction deadlocks in its own lock table and rolls back the transaction. If you use `LOCK TABLES` command, or other transaction-safe table handlers than InnoDB in the same transaction, then a deadlock may arise which InnoDB cannot notice. In cases like this the timeout is useful to resolve the situation.
`innodb_flush_method`	(Available from 3.23.40 up.) The default value for this is `fdatasync`. Another option is `O_DSYNC`.

7.6.3 InnoDB テーブルの保存先の作成

あなたが MySQL を既にインストールしており、`my.cnf' には InnoDB configuration パラメタが記述されていると仮定します。 MySQL を起動する前に、あなたが指定している InnoDB データファイルとログファイルを保存するディレクトリが存在するか、そして、それらのディレクトリのパーミッションが正しいかを確認するべきです。 InnoDB はディレクトリを自動では作成できません。ファイルのみです。データとログファイルを保存するために十分なディスクの空きがあるかもチェックしてください。

MySQL を起動すると、InnoDB は data file と log file を作成します。 InnoDB は以下のようなメッセージを出力します：

~/mysqlm/sql > mysqld
InnoDB: The first specified data file /home/heikki/data/ibdata1 did not exist:
InnoDB: a new database to be created!
InnoDB: Setting file /home/heikki/data/ibdata1 size to 134217728
InnoDB: Database physically writes the file full: wait...
InnoDB: Data file /home/heikki/data/ibdata2 did not exist: new to be created
InnoDB: Setting file /home/heikki/data/ibdata2 size to 262144000
InnoDB: Database physically writes the file full: wait...
InnoDB: Log file /home/heikki/data/logs/ib_logfile0 did not exist: new to be c
reated
InnoDB: Setting log file /home/heikki/data/logs/ib_logfile0 size to 5242880
InnoDB: Log file /home/heikki/data/logs/ib_logfile1 did not exist: new to be c
reated
InnoDB: Setting log file /home/heikki/data/logs/ib_logfile1 size to 5242880
InnoDB: Log file /home/heikki/data/logs/ib_logfile2 did not exist: new to be c
reated
InnoDB: Setting log file /home/heikki/data/logs/ib_logfile2 size to 5242880
InnoDB: Started
mysqld: ready for connections

新しく InnoDB データベースが、これで作成されました。 mysql のような MySQL クライアントを使用して、MySQL サーバに接続することが可能です。 `mysqladmin shutdown' で MySQL をシャットダウンしたときには、 InnoDB は以下のようなメッセージを出力します：

010321 18:33:34  mysqld: Normal shutdown
010321 18:33:34  mysqld: Shutdown Complete
InnoDB: Starting shutdown...
InnoDB: Shutdown completed

You can now look at the data files and logs directories and you will see the files created. ログディレクトリには、`ib_arch_log_0000000000' という名前の、小さなファイルが含まれているはずです。 That file resulted from the database creation, after which InnoDB switched off log archiving. MySQL が次回に起動したとき、出力は次のようになります：

~/mysqlm/sql > mysqld
InnoDB: Started
mysqld: ready for connections

7.6.3.1 InnoDB データベース作成に失敗した場合

もし InnoDB database 作成時になにか問題が起きたならば、 InnoDB によって作成された全てのファイルを削除すべきです。 This means all data files, all log files, the small archived log file, and in the case you already did create some InnoDB tables, delete also the corresponding `.frm' files for these tables from the MySQL database directories. Then you can try the InnoDB database creation again.

7.6.4 InnoDB テーブルの作成

mysql test として MySQL クライアントを実行したとします。 InnoDB 形式のテーブルを作成するためには、あなたは SQL コマンドのテーブル作成文に、TYPE = InnoDB を指定しなくてはなりません。

CREATE TABLE CUSTOMER (A INT, B CHAR (20), INDEX (A)) TYPE = InnoDB;

この SQL コマンドは、`my.cnf' で定義された InnoDB のテーブル空間に存在するデータファイル内に、一つのテーブルと、一つのインデックス (A フィールドに張られた)を作成します。 MySQL は `CUSTOMER.frm' ファイルを MuSQL データベースディレクトリ `test' に作成します。内部では、InnoDB は、InnoDB 自身のデータディレクトリをもち、そこに 'test/CUSTOMER' テーブルのエントリを追加します。よって、MySQL 内の違うデータベース内に、同じ CUSTOMER という名前を持つテーブルを作成することが可能で、もちろんこの名前は、 InnoDB 内でも他とは衝突しません。

MySQL の table status コマンドを使用して、TYPE = InnoDB で作成したテーブル全てに対して、InnoDB のテーブル空間の未使用量がどれくらいあるかを出すことができます。テーブル空間の未使用領域の総量は、SHOW で出力されたテーブルの comment セクションに現われます。例：

SHOW TABLE STATUS FROM test LIKE 'CUSTOMER'

SHOW を使用して得られた InnoDB のテーブルの情報は概算です；それらは SQL オプティマイゼイションで使用されます。ただし、テーブルとインデックスに割り当てられているサイズ(bytes)は正確です。

注意：DROP DATABASE は現在、InnoDB テーブルに対しては動作しません！テーブルを個別に drop しなくてはなりません。 `.frm' ファイルを InnoDB データベースに対して、手動で消したり追加したりしてもいけません：この場合は CREATE TABLE と DROP TABLE コマンドを使用します。 InnoDB は内部では独自のデータの辞書をもっていて、もし MySQL `.frm' ファイルが InnoDB の内部の辞書と同期していない場合は、問題が起こります。

7.6.4.1 MyISAM テーブルを InnoDB テーブルに変換

InnoDB does not have a special optimization for separate index creation. Therefore it does not pay to export and import the table and create indexes afterwards. 素早くテーブルを InnoDB に変換する方法は、 InnoDB に直接 insert することです。テーブルの変換には、ALTER TABLE ... TYPE=INNODB を使用するか、空の InnoDB テーブルを作成してデータを INSERT INTO ... SELECT * FROM ... でインサートします。

insert 時においては、大きなテーブルの場合は、いくつかにテーブルを分割して行うとよいでしょう：

INSERT INTO newtable SELECT * FROM oldtable WHERE yourkey > something
                                             AND yourkey <= somethingelse;

全てのデータが挿入できたら、テーブル名を変更します。

大きなテーブルを変換している間は、InnoDB のバッファプールサイズを大きくすることで、disk i/o を軽減できます。ただし、物理メモリの 80% 以上をバッファに割り当てないように。 InnoDB log ファイルのサイズもログのバッファも大きく取るべきです。

確実に、テーブル空間より大きくならないようにします； InnoDB テーブルは MyISAM テーブルよりも多くの disk を使用します。 If an ALTER TABLE runs out of space, it will start a rollback, and that can take hours if it is disk-bound. In inserts InnoDB uses the insert buffer to merge secondary index records to indexes in batches. That saves a lot of disk i/o. In rollback no such mechanism is used, and the rollback can take 30 times longer than the insertion.

In the case of a runaway rollback, if you do not have valuable data in your database, it is better that you kill the database process and delete all InnoDB data and log files and all InnoDB table `.frm' files, and start your job again, rather than wait for millions of disk i/os to complete.

7.6.5 InnoDB データとログの追加と削除

InnoDB データ領域のサイズを大きくすることが可能です。テーブルの空間を大きくするためには、新しいデータファイルを追加します。これを行なうには、MySQL サーバーを一度シャットダウンし、 `my.cnf' ファイルを編集して新しいデータファイルを innodb_data_file_path に追加し、MySQL サーバを起動します。

今のところ、データファイルを InnoDB から削除することはできません。データベースのサイズを小さくするには、一度 mysqldump でテーブルをダンプし、新しくデータベースを作成し、ダンプしたテーブルを取り込みます。

もし InnoDB の log ファイルのサイズを変更したいならば、 MySQL サーバを停止しなくてはなりません(エラー無しで確実に止まるようにしてください)。なにかシャットダウン時に問題があった場合には、古いログファイルを安全な場所にコピーし、データベースの修復をしましょう。そして、古いログファイルを log ファイルのディレクトリから消去し、 `my.cnf' ファイルを編集した後、MySQL サーバを起動します。 InnoDB は新しいログファイルを作成したことを告げるでしょう。

7.6.6 InnoDB データベースのバックアップと修復

The key to safe database management is taking regular backups. データベースの 'バイナリ' バックアップを取るには、以下のようにします：

MySQL を停止。エラー無しで確実に止まるように。
全てのデータファイルを安全な場所にコピー。
全ての InnoDB ログファイルを安全な場所にコピー。
`my.cnf' 設定ファイルを安全な場所にコピー。
InnoDB テーブルの `.frm' ファイルを安全な場所にコピー。

現在、InnoDB 用の、オンラインでのインクリメンタルバックアップのツールは存在しません。これは TODO リストにあげています。

上のバイナリバックアップに加えて、`mysqldump' で通常のテーブルのダンプも取るべきです。この理由は、バイナリファイルはあなたの知らないところでおかしくなっているかもしれないからです。テキストファイルのダンプされたテーブルは人間が読むことができ、そしてバイナリファイルよりもずっと簡素です。ダンプされたファイルからテーブルのおかしくなった箇所を見つけるのは容易で、そして、致命的なデータの不正を少なくするチャンスでもあります。

データベースのバイナリバックアップと同時に、ダンプを取ることは、よい考えです。 You have to shut out all clients from your database to get a consistent snapshot of all your tables into your dumps. Then you can take the binary backup, and you will then have a consistent snapshot of your database in two formats.

To be able to recover your InnoDB database to the present from the binary backup described above, you have to run your MySQL database with the general logging and log archiving of MySQL switched on. Here by the general logging we mean the logging mechanism of the MySQL server which is independent of InnoDB logs.

MySQL サーバプロセスのクラッシュからのリカバリを行なうには、 MySQL をリスタートすることがただ一つの方法です。 InnoDB は自動でログをチェックし、データベースの roll-forward を行ないます。 InnoDB は、クラッシュ時にコッミトされていないトランザクションを自動的にロールバックします。リカバリの間、InnoDB は以下のような出力をします：

~/mysqlm/sql > mysqld
InnoDB: Database was not shut down normally.
InnoDB: Starting recovery from log files...
InnoDB: Starting log scan based on checkpoint at
InnoDB: log sequence number 0 13674004
InnoDB: Doing recovery: scanned up to log sequence number 0 13739520
InnoDB: Doing recovery: scanned up to log sequence number 0 13805056
InnoDB: Doing recovery: scanned up to log sequence number 0 13870592
InnoDB: Doing recovery: scanned up to log sequence number 0 13936128
...
InnoDB: Doing recovery: scanned up to log sequence number 0 20555264
InnoDB: Doing recovery: scanned up to log sequence number 0 20620800
InnoDB: Doing recovery: scanned up to log sequence number 0 20664692
InnoDB: 1 uncommitted transaction(s) which must be rolled back
InnoDB: Starting rollback of uncommitted transactions
InnoDB: Rolling back trx no 16745
InnoDB: Rolling back of trx no 16745 completed
InnoDB: Rollback of uncommitted transactions completed
InnoDB: Starting an apply batch of log records to the database...
InnoDB: Apply batch completed
InnoDB: Started
mysqld: ready for connections

If your database gets corrupted or your disk fails, you have to do the recovery from a backup. In the case of corruption, you should first find a backup which is not corrupted. From a backup do the recovery from the general log files of MySQL according to instructions in the MySQL manual.

7.6.6.1 Checkpoints

InnoDB は fuzzy checkpoint と呼ばれる checkpoint メカニズムを持っています。 InnoDB will flush modified database pages from the buffer pool in small batches, there is no need to flush the buffer pool in one single batch, which would in practice stop processing of user SQL statements for a while.

In crash recovery InnoDB looks for a checkpoint label written to the log files. It knows that all modifications to the database before the label are already present on the disk image of the database. Then InnoDB scans the log files forward from the place of the checkpoint applying the logged modifications to the database.

InnoDB writes to the log files in a circular fashion. All committed modifications which make the database pages in the buffer pool different from the images on disk must be available in the log files in case InnoDB has to do a recovery. This means that when InnoDB starts to reuse a log file in the circular fashion, it has to make sure that the database page images on disk already contain the modifications logged in the log file InnoDB is going to reuse. In other words, InnoDB has to make a checkpoint and often this involves flushing of modified database pages to disk.

The above explains why making your log files very big may save disk i/o in checkpointing. It can make sense to set the total size of the log files as big as the buffer pool or even bigger. The drawback in big log files is that crash recovery can last longer because there will be more log to apply to the database.

7.6.7 Moving an InnoDB database to another machine

InnoDB データとログファイルは、もし、その機械の浮動小数点数のフォーマットが同じであれば、全プラットフォームでバイナリ互換を持ちます。 InnoDB は、単純に、関連ファイルを全てコピーすることで移動可能です。 (関連ファイルは前節で述べられています)　

You can move an InnoDB database simply by copying all the relevant files, which we already listed in the previous section on backing up a database. If the floating point formats on the machines are different but you have not used FLOAT or DOUBLE data types in your tables then the procedure is the same: just copy the relevant files. If the formats are different and your tables contain floating point data, you have to use `mysqldump' and `mysqlimport' to move those tables.

A performance tip is to switch off the auto commit when you import data into your database, assuming your tablespace has enough space for the big rollback segment the big import transaction will generate. Do the commit only after importing a whole table or a segment of a table.

7.6.8 InnoDB transaction model

In the InnoDB transaction model the goal has been to combine the best sides of a multiversioning database to traditional two-phase locking. InnoDB does locking on row level and runs queries by default as non-locking consistent reads, in the style of Oracle. The lock table in InnoDB is stored so space-efficiently that lock escalation is not needed: typically several users are allowed to lock every row in the database, or any random subset of the rows, without InnoDB running out of memory.

In InnoDB all user activity happens inside transactions. If the auto commit mode is used in MySQL, then each SQL statement will form a single transaction. If the auto commit mode is switched off, then we can think that a user always has a transaction open. If he issues the SQL COMMIT or ROLLBACK statement, that ends the current transaction, and a new starts. Both statements will release all InnoDB locks that were set during the current transaction. A COMMIT means that the changes made in the current transaction are made permanent and become visible to other users. A ROLLBACK on the other hand cancels all modifications made by the current transaction.

7.6.8.1 Consistent read

A consistent read means that InnoDB uses its multiversioning to present to a query a snapshot of the database at a point in time. The query will see the changes made by exactly those transactions that committed before that point of time, and no changes made by later or uncommitted transactions. The exception to this rule is that the query will see the changes made by the transaction itself which issues the query.

When a transaction issues its first consistent read, InnoDB assigns the snapshot, or the point of time, which all consistent reads in the same transaction will use. In the snapshot are all transactions that committed before assigning the snapshot. Thus the consistent reads within the same transaction will also be consistent with respect to each other. You can get a fresher snapshot for your queries by committing the current transaction and after that issuing new queries.

Consistent read is the default mode in which InnoDB processes SELECT statements. A consistent read does not set any locks on the tables it accesses, and therefore other users are free to modify those tables at the same time a consistent read is being performed on the table.

7.6.8.2 Locking reads

A consistent read is not convenient in some circumstances. Suppose you want to add a new row into your table CHILD, and make sure that the child already has a parent in table PARENT.

Suppose you use a consistent read to read the table PARENT and indeed see the parent of the child in the table. Can you now safely add the child row to table CHILD? No, because it may happen that meanwhile some other user has deleted the parent row from the table PARENT, and you are not aware of that.

The solution is to perform the SELECT in a locking mode, LOCK IN SHARE MODE.

SELECT * FROM PARENT WHERE NAME = 'Jones' LOCK IN SHARE MODE;

Performing a read in share mode means that we read the latest available data, and set a shared mode lock on the rows we read. If the latest data belongs to a yet uncommitted transaction of another user, we will wait until that transaction commits. A shared mode lock prevents others from updating or deleting the row we have read. After we see that the above query returns the parent 'Jones', we can safely add his child to table CHILD, and commit our transaction. This example shows how to implement referential integrity in your application code.

Let us look at another example: we have an integer counter field in a table CHILD_CODES which we use to assign a unique identifier to each child we add to table CHILD. Obviously, using a consistent read or a shared mode read to read the present value of the counter is not a good idea, since then two users of the database may see the same value for the counter, and we will get a duplicate key error when we add the two children with the same identifier to the table.

In this case there are two good ways to implement the reading and incrementing of the counter: (1) update the counter first by incrementing it by 1 and only after that read it, or (2) read the counter first with a lock mode FOR UPDATE, and increment after that:

SELECT COUNTER_FIELD FROM CHILD_CODES FOR UPDATE;
UPDATE CHILD_CODES SET COUNTER_FIELD = COUNTER_FIELD + 1;

A SELECT ... FOR UPDATE will read the latest available data setting exclusive locks on each row it reads. Thus it sets the same locks a searched SQL UPDATE would set on the rows.

7.6.8.3 Next-key locking: avoiding the phantom problem

In row level locking InnoDB uses an algorithm called next-key locking. InnoDB does the row level locking so that when it searches or scans an index of a table, it sets shared or exclusive locks on the index records in encounters. Thus the row level locks are more precisely called index record locks.

The locks InnoDB sets on index records also affect the 'gap' before that index record. If a user has a shared or exclusive lock on record R in an index, then another user cannot insert a new index record immediately before R in the index order. This locking of gaps is done to prevent the so-called phantom problem. Suppose I want to read and lock all children with identifier bigger than 100 from table CHILD, and update some field in the selected rows.

SELECT * FROM CHILD WHERE ID > 100 FOR UPDATE;

Suppose there is an index on table CHILD on column ID. Our query will scan that index starting from the first record where ID is bigger than 100. Now, if the locks set on the index records would not lock out inserts made in the gaps, a new child might meanwhile be inserted to the table. If now I in my transaction execute

SELECT * FROM CHILD WHERE ID > 100 FOR UPDATE;

again, I will see a new child in the result set the query returns. This is against the isolation principle of transactions: a transaction should be able to run so that the data it has read does not change during the transaction. If we regard a set of rows as a data item, then the new 'phantom' child would break this isolation principle.

When InnoDB scans an index it can also lock the gap after the last record in the index. Just that happens in the previous example: the locks set by InnoDB will prevent any insert to the table where ID would be bigger than 100.

You can use the next-key locking to implement a uniqueness check in your application: if you read your data in share mode and do not see a duplicate for a row you are going to insert, then you can safely insert your row and know that the next-key lock set on the successor of your row during the read will prevent anyone meanwhile inserting a duplicate for your row. Thus the next-key locking allows you to 'lock' the non-existence of something in your table.

7.6.8.4 Locks set by different SQL statements in InnoDB

SELECT ... FROM ... : this is a consistent read, reading a snapshot of the database and setting no locks.
SELECT ... FROM ... LOCK IN SHARE MODE : sets shared next-key locks on all index records the read encounters.
SELECT ... FROM ... FOR UPDATE : sets exclusive next-key locks on all index records the read encounters.
INSERT INTO ... VALUES (...) : sets an exclusive lock on the inserted row; note that this lock is not a next-key lock and does not prevent other users from inserting to the gap before the inserted row. If a duplicate key error occurs, sets a shared lock on the duplicate index record.
INSERT INTO T SELECT ... FROM S WHERE ... sets an exclusive (non-next-key) lock on each row inserted into T. Does the search on S as a consistent read, but sets shared next-key locks on S if the MySQL logging is on. InnoDB has to set locks in the latter case because in roll-forward recovery from a backup every SQL statement has to be executed in exactly the same way as it was done originally.
CREATE TABLE ... SELECT ... performs the SELECT as a consistent read or with shared locks, like in the previous item.
REPLACE is done like an insert if there is no collision on a unique key. Otherwise, an exclusive next-key lock is placed on the row which has to be updated.
UPDATE ... SET ... WHERE ... : sets an exclusive next-key lock on every record the search encounters.
DELETE FROM ... WHERE ... : sets an exclusive next-key lock on every record the search encounters.
LOCK TABLES ... : sets table locks. In the implementation the MySQL layer of code sets these locks. The automatic deadlock detection of InnoDB cannot detect deadlocks where such table locks are involved: see the next section below. See also section 13 'InnoDB restrictions' about the following: since MySQL does know about row level locks, it is possible that you get a table lock on a table where another user currently has row level locks. But that does not put transaction integerity into danger.

7.6.8.5 Deadlock detection and rollback

InnoDB automatically detects a deadlock of transactions and rolls back the transaction whose lock request was the last one to build a deadlock, that is, a cycle in the waits-for graph of transactions. InnoDB cannot detect deadlocks where a lock set by a MySQL LOCK TABLES statement is involved, or if a lock set in another table handler than InnoDB is involved. You have to resolve these situations using innodb_lock_wait_timeout set in `my.cnf'.

When InnoDB performs a complete rollback of a transaction, all the locks of the transaction are released. However, if just a single SQL statement is rolled back as a result of an error, some of the locks set by the SQL statement may be preserved. This is because InnoDB stores row locks in a format where it cannot afterwards know which was set by which SQL statement.

7.6.8.6 An example of how the consistent read works in InnoDB

When you issue a consistent read, that is, an ordinary SELECT statement, InnoDB will give your transaction a timepoint according to which your query sees the database. Thus, if transaction B deletes a row and commits after your timepoint was assigned, then you will not see the row deleted. Similarly with inserts and updates.

You can advance your timepoint by committing your transaction and then doing another SELECT.

This is called multiversioned concurrency control.

                  User A                 User B

              set autocommit=0;      set autocommit=0;
time
|             SELECT * FROM t;
|             empty set
|                                    INSERT INTO t VALUES (1, 2);
|
v             SELECT * FROM t;
              empty set
                                     COMMIT;

              SELECT * FROM t;
              empty set;

              COMMIT;

              SELECT * FROM t;
              ----------------------
              |     1    |    2   |
              ----------------------

Thus user A sees the row inserted by B only when B has committed the insert, and A has committed his own transaction so that the timepoint is advanced past the the commit of B.

If you want to see the 'freshest' state of the database, you should use a locking read:

SELECT * FROM t LOCK IN SHARE MODE;

7.6.9 Performance tuning tips

1. もし Unix `top' や Windows `Task Manager' が CPU 使用率を 70% 未満に表示している場合、おそらく、disk アクセスに処理が取られています。とても多くのトランザクションのコミットを作成しているか、バッファプールが小さいのでしょう。バッファプールを大きくすれば良くなりますが、しかし、バッファプールは物理メモリの 80% より大きくしないように。

2. Wrap several modifications into one transaction. InnoDB must flush the log to disk at each transaction commit, if that transaction made modifications to the database. Since the rotation speed of a disk is typically at most 167 revolutions/second, that constrains the number of commits to the same 167/second if the disk does not fool the operating system.

3. If you can afford the loss of some latest committed transactions, you can set the `my.cnf' parameter innodb_flush_log_at_trx_commit to zero. InnoDB tries to flush the log anyway once in a second, though the flush is not guaranteed.

4. Make your log files big, even as big as the buffer pool. When InnoDB has written the log files full, it has to write the modified contents of the buffer pool to disk in a checkpoint. Small log files will cause many unnecessary disk writes. The drawback in big log files is that recovery time will be longer.

5. Also the log buffer should be quite big, say 8 MB.

6. (Relevant from 3.23.39 up.) Linux や　Unix のいくつかのバージョンでは、disk のファイルのフラッシュに Unix fdatasync やそれに似た方法を使用しますが、これは驚くほど遅いです。 InnoDB のデフォルトの方法は、fdatasync 関数を使用します。もしデータベースの書き込みのパフォーマンスに満足しない場合には、 `my.cnf' ファイルで innodb_flush_method を O_DSYNC にセットしてもかまいません。しかし O_DSYNC はほとんどのシステムでは遅いようです。

7. InnoDB にデータを流し込む場合には、 MySQL の設定が autocommit=1 になっていないようにします。それぞれの全ての insert が log をディスクにフラッシュすることを要求するからです。取り込む SQL の最初に

set autocommit=0;

を追加し、最後に

commit;

を書きます。

もし `mysqldump' を --opt オプションで使用しているなら、上記のように set autocommit=0; ... commit; でダンプファイルを囲まないでも、早く InnoDB テーブルにダンプを取り込むことが可能です。

8. Beware of big rollbacks of mass inserts: InnoDB uses the insert buffer to save disk i/o in inserts, but in a corresponding rollback no such mechanism is used. A disk-bound rollback can take 30 times the time of the corresponding insert. Killing the database process will not help because the rollback will start again at the database startup. The only way to get rid of a runaway rollback is to increase the buffer pool so that the rollback becomes CPU-bound and runs fast, or delete the whole InnoDB database.

9. Beware also of other big disk-bound operations. Use DROP TABLE or TRUNCATE (from MySQL-4.0 up) to empty a table, not DELETE FROM yourtable.

10. もしたくさんのレコードをインサートする必要があるならば、サーバーとクライアントのコミュニケーションのオーバーヘッドを軽減するために、マルチライン INSERT を使用します：

INSERT INTO yourtable VALUES (1, 2), (5, 5);

この方法は InnoDB だけではなく、他のテーブル型にインサートする時も使用できます。

7.6.9.1 The InnoDB Monitor

バージョン 3.23.41 から、InnoDB は InnoDB Monitor を含むようになりました。これは InnoDB 内部の状態を表示するものです。このスイッチを on にすると、MySQL サーバは 10 秒毎に標準出力にデータを出力するようになります。このデータはパフォーマンス・チューニングに便利です。

The printed information includes data on:

table and record locks held by each active transaction,
lock waits of a transactions,
semaphore waits of threads,
pending file i/o requests,
buffer pool statistics, and
purge and insert buffer merge activity of the main thread of InnoDB.

InnoDB モニタは、以下の SQL コマンドでスタートできます：

CREATE TABLE innodb_monitor(a int) type = innodb;

停止は：

DROP TABLE innodb_monitor;

The CREATE TABLE syntax is just a way to pass a command to the InnoDB engine through the MySQL SQL parser: the created table is not relevant at all for InnoDB Monitor. If you shut down the database when the monitor is running, and you want to start the monitor again, you have to drop the table before you can issue a new CREATE TABLE to start the monitor. This syntax may change in a future release.

InnoDB モニタの出力サンプル:

================================
010809 18:45:06 INNODB MONITOR OUTPUT
================================
--------------------------
LOCKS HELD BY TRANSACTIONS
--------------------------
LOCK INFO:
Number of locks in the record hash table 1294
LOCKS FOR TRANSACTION ID 0 579342744
TABLE LOCK table test/mytable trx id 0 582333343 lock_mode IX

RECORD LOCKS space id 0 page no 12758 n bits 104 table test/mytable index
PRIMARY trx id 0 582333343 lock_mode X
Record lock, heap no 2 PHYSICAL RECORD: n_fields 74; 1-byte offs FALSE;
info bits 0
 0: len 4; hex 0001a801; asc ;; 1: len 6; hex 000022b5b39f; asc ";; 2: len 7;
hex 000002001e03ec; asc ;; 3: len 4; hex 00000001;
...
-----------------------------------------------
CURRENT SEMAPHORES RESERVED AND SEMAPHORE WAITS
-----------------------------------------------
SYNC INFO:
Sorry, cannot give mutex list info in non-debug version!
Sorry, cannot give rw-lock list info in non-debug version!
-----------------------------------------------------
SYNC ARRAY INFO: reservation count 6041054, signal count 2913432
4a239430 waited for by thread 49627477 op. S-LOCK file NOT KNOWN line 0 
Mut ex 0 sp 5530989 r 62038708 sys 2155035; rws 0 8257574 8025336; rwx 0 1121090 1848344
-----------------------------------------------------
CURRENT PENDING FILE I/O'S
--------------------------
Pending normal aio reads:
Reserved slot, messages 40157658 4a4a40b8
Reserved slot, messages 40157658 4a477e28
...
Reserved slot, messages 40157658 4a4424a8
Reserved slot, messages 40157658 4a39ea38
Total of 36 reserved aio slots
Pending aio writes:
Total of 0 reserved aio slots
Pending insert buffer aio reads:
Total of 0 reserved aio slots
Pending log writes or reads:
Reserved slot, messages 40158c98 40157f98
Total of 1 reserved aio slots
Pending synchronous reads or writes:
Total of 0 reserved aio slots
-----------
BUFFER POOL
-----------
LRU list length 8034 
Free list length 0 
Flush list length 999 
Buffer pool size in pages 8192
Pending reads 39 
Pending writes: LRU 0, flush list 0, single page 0
Pages read 31383918, created 51310, written 2985115
----------------------------
END OF INNODB MONITOR OUTPUT
============================
010809 18:45:22 InnoDB starts purge
010809 18:45:22 InnoDB purged 0 pages

Some notes on the output:

If the section LOCKS HELD BY TRANSACTIONS reports lock waits, then your application may have lock contention. The output can also help to trace reasons for transaction deadlocks.
Section SYNC INFO will report reserved semaphores if you compile InnoDB with <code>UNIV_SYNC_DEBUG</code> defined in <tt>univ.i</tt>.
Section SYNC ARRAY INFO reports threads waiting for a semaphore and statistics on how many times threads have needed a spin or a wait on a mutex or a rw-lock semaphore. A big number of threads waiting for semaphores may be a result of disk i/o, or contention problems inside InnoDB. Contention can be due to heavy parallelism of queries, or problems in operating system thread scheduling.
Section CURRENT PENDING FILE I/O'S lists pending file i/o requests. A large number of these indicates that the workload is disk i/o -bound.
Section BUFFER POOL gives you statistics on pages read and written. You can calculate from these numbers how many data file i/o's your queries are currently doing.

7.6.10 Implementation of multiversioning

Since InnoDB is a multiversioned database, it must keep information of old versions of rows in the tablespace. This information is stored in a data structure we call a rollback segment after an analogous data structure in Oracle.

InnoDB internally adds two fields to each row stored in the database. A 6-byte field tells the transaction identifier for the last transaction which inserted or updated the row. Also a deletion is internally treated as an update where a special bit in the row is set to mark it as deleted. Each row also contains a 7-byte field called the roll pointer. The roll pointer points to an undo log record written to the rollback segment. If the row was updated, then the undo log record contains the information necessary to rebuild the content of the row before it was updated.

InnoDB uses the information in the rollback segment to perform the undo operations needed in a transaction rollback. It also uses the information to build earlier versions of a row for a consistent read.

Undo logs in the rollback segment are divided into insert and update undo logs. Insert undo logs are only needed in transaction rollback and can be discarded as soon as the transaction commits. Update undo logs are used also in consistent reads, and they can be discarded only after there is no transaction present for which InnoDB has assigned a snapshot that in a consistent read could need the information in the update undo log to build an earlier version of a database row.

You must remember to commit your transactions regularly. Otherwise InnoDB cannot discard data from the update undo logs, and the rollback segment may grow too big, filling up your tablespace.

The physical size of an undo log record in the rollback segment is typically smaller than the corresponding inserted or updated row. You can use this information to calculate the space need for your rollback segment.

In our multiversioning scheme a row is not physically removed from the database immediately when you delete it with an SQL statement. Only when InnoDB can discard the update undo log record written for the deletion, it can also physically remove the corresponding row and its index records from the database. This removal operation is called a purge, and it is quite fast, usually taking the same order of time as the SQL statement which did the deletion.

7.6.11 Table and index structures

Every InnoDB table has a special index called the clustered index where the data of the rows is stored. If you define a PRIMARY KEY on your table, then the index of the primary key will be the clustered index.

If you do not define a primary key for your table, InnoDB will internally generate a clustered index where the rows are ordered by the row id InnoDB assigns to the rows in such a table. The row id is a 6-byte field which monotonically increases as new rows are inserted. Thus the rows ordered by the row id will be physically in the insertion order.

Accessing a row through the clustered index is fast, because the row data will be on the same page where the index search leads us. In many databases the data is traditionally stored on a different page from the index record. If a table is large, the clustered index architecture often saves a disk i/o when compared to the traditional solution.

The records in non-clustered indexes (we also call them secondary indexes), in InnoDB contain the primary key value for the row. InnoDB uses this primary key value to search for the row from the clustered index. Note that if the primary key is long, the secondary indexes will use more space.

7.6.11.1 Physical structure of an index

All indexes in InnoDB are B-trees where the index records are stored in the leaf pages of the tree. The default size of an index page is 16 kB. When new records are inserted, InnoDB tries to leave 1 / 16 of the page free for future insertions and updates of the index records.

If index records are inserted in a sequential (ascending or descending) order, the resulting index pages will be about 15/16 full. If records are inserted in a random order, then the pages will be 1/2 - 15/16 full. If the fillfactor of an index page drops below 1/2, InnoDB will try to contract the index tree to free the page.

7.6.11.2 Insert buffering

It is a common situation in a database application that the primary key is a unique identifier and new rows are inserted in the ascending order of the primary key. Thus the insertions to the clustered index do not require random reads from a disk.

On the other hand, secondary indexes are usually non-unique and insertions happen in a relatively random order into secondary indexes. This would cause a lot of random disk i/o's without a special mechanism used in InnoDB.

If an index record should be inserted to a non-unique secondary index, InnoDB checks if the secondary index page is already in the buffer pool. If that is the case, InnoDB will do the insertion directly to the index page. But, if the index page is not found from the buffer pool, InnoDB inserts the record to a special insert buffer structure. The insert buffer is kept so small that it entirely fits in the buffer pool, and insertions can be made to it very fast.

The insert buffer is periodically merged to the secondary index trees in the database. Often we can merge several insertions on the same page in of the index tree, and hence save disk i/o's. It has been measured that the insert buffer can speed up insertions to a table up to 15 times.

7.6.11.3 Adaptive hash indexes

If a database fits almost entirely in main memory, then the fastest way to perform queries on it is to use hash indexes. InnoDB has an automatic mechanism which monitors index searches made to the indexes defined for a table, and if InnoDB notices that queries could benefit from building of a hash index, such an index is automatically built.

But note that the hash index is always built based on an existing B-tree index on the table. InnoDB can build a hash index on a prefix of any length of the key defined for the B-tree, depending on what search pattern InnoDB observes on the B-tree index. A hash index can be partial: it is not required that the whole B-tree index is cached in the buffer pool. InnoDB will build hash indexes on demand to those pages of the index which are often accessed.

In a sense, through the adaptive hash index mechanism InnoDB adapts itself to ample main memory, coming closer to the architecture of main memory databases.

7.6.11.4 Physical record structure

Each index record in InnoDB contains a header of 6 bytes. The header is used to link consecutive records together, and also in the row level locking.
Records in the clustered index contain fields for all user-defined columns. In addition, there is a 6-byte field for the transaction id and a 7-byte field for the roll pointer.
If the user has not defined a primary key for a table, then each clustered index record contains also a 6-byte row id field.
Each secondary index record contains also all the fields defined for the clustered index key.
A record contains also a pointer to each field of the record. If the total length of the fields in a record is < 128 bytes, then the pointer is 1 byte, else 2 bytes.

7.6.11.5 How an auto-increment column works in InnoDB

After a database startup, when a user first does an insert to a table T where an auto-increment column has been defined, and the user does not provide an explicit value for the column, then InnoDB executes SELECT MAX(auto-inc-column) FROM T, and assigns that value incremented by one to the the column and the auto-increment counter of the table. We say that the auto-increment counter for table T has been initialized.

InnoDB follows the same procedure in initializing the auto-increment counter for a freshly created table.

Note that if the user specifies in an insert the value 0 to the auto-increment column, then InnoDB treats the row like the value would not have been specified.

After the auto-increment counter has been initialized, if a user inserts a row where he explicitly specifies the column value, and the value is bigger than the current counter value, then the counter is set to the specified column value. If the user does not explicitly specify a value, then InnoDB increments the counter by one and assigns its new value to the column.

The auto-increment mechanism, when assigning values from the counter, bypasses locking and transaction handling. Therefore you may also get gaps in the number sequence if you roll back transactions which have got numbers from the counter.

The behavior of auto-increment is not defined if a user gives a negative value to the column or if the value becomes bigger than the maximum integer that can be stored in the specified integer type.

7.6.12 File space management and disk i/o

7.6.12.1 Disk i/o

In disk i/o InnoDB uses asynchronous i/o. On Windows NT it uses the native asynchronous i/o provided by the operating system. On Unix, InnoDB uses simulated asynchronous i/o built into InnoDB: InnoDB creates a number of i/o threads to take care of i/o operations, such as read-ahead. In a future version we will add support for simulated aio on Windows NT and native aio on those versions of Unix which have one.

On Windows NT InnoDB uses non-buffered i/o. That means that the disk pages InnoDB reads or writes are not buffered in the operating system file cache. This saves some memory bandwidth.

Starting from 3.23.41 InnoDB uses a novel file flush technique called doublewrite. It adds safety to crash recovery after an operating system crash or a power outage, and improves performance on most Unix flavors by reducing the need for fsync operations.

Doublewrite means that InnoDB before writing pages to a data file first writes them to a contiguous tablespace area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer has completed, InnoDB writes the pages to their proper positions in the data file. If the operating system crashes in the middle of a page write, InnoDB will in recovery find a good copy of the page from the doublewrite buffer.

Starting from 3.23.41 you can also use a raw disk partition as a data file, though this has not been tested yet. When you create a new data file you have to put the keyword newraw immediately after the data file size in innodb_data_file_path. The partition must be >= than you specify as the size. Note that 1M in InnoDB is 1024 x 1024 bytes, while in disk specifications 1 MB usually means 1000 000 bytes.

innodb_data_file_path=hdd1:3Gnewraw;hdd2:2Gnewraw

When you start the database again you MUST change the keyword to raw. Otherwise InnoDB will write over your partition!

innodb_data_file_path=hdd1:3Graw;hdd2:2Graw

Using a raw disk you can on some Unixes perform non-buffered i/o.

There are two read-ahead heuristics in InnoDB: sequential read-ahead and random read-ahead. In sequential read-ahead InnoDB notices that the access pattern to a segment in the tablespace is sequential. Then InnoDB will post in advance a batch of reads of database pages to the i/o system. In random read-ahead InnoDB notices that some area in a tablespace seems to be in the process of being fully read into the buffer pool. Then InnoDB posts the remaining reads to the i/o system.

7.6.12.2 File space management

The data files you define in the configuration file form the tablespace of InnoDB. The files are simply catenated to form the tablespace, there is no striping in use. Currently you cannot directly instruct where the space is allocated for your tables, except by using the following fact: from a newly created tablespace InnoDB will allocate space starting from the low end.

The tablespace consists of database pages whose default size is 16 kB. The pages are grouped into extents of 64 consecutive pages. The 'files' inside a tablespace are called segments in InnoDB. The name of the rollback segment is somewhat misleading because it actually contains many segments in the tablespace.

For each index in InnoDB we allocate two segments: one is for non-leaf nodes of the B-tree, the other is for the leaf nodes. The idea here is to achieve better sequentiality for the leaf nodes, which contain the data.

When a segment grows inside the tablespace, InnoDB allocates the first 32 pages to it individually. After that InnoDB starts to allocate whole extents to the segment. InnoDB can add to a large segment up to 4 extents at a time to ensure good sequentiality of data.

Some pages in the tablespace contain bitmaps of other pages, and therefore a few extents in an InnoDB tablespace cannot be allocated to segments as a whole, but only as individual pages.

When you issue a query SHOW TABLE STATUS FROM ... LIKE ... to ask for available free space in the tablespace, InnoDB will report you the space which is certainly usable in totally free extents of the tablespace. InnoDB always reserves some extents for clean-up and other internal purposes; these reserved extents are not included in the free space.

When you delete data from a table, InnoDB will contract the corresponding B-tree indexes. It depends on the pattern of deletes if that frees individual pages or extents to the tablespace, so that the freed space is available for other users. Dropping a table or deleting all rows from it is guaranteed to release the space to other users, but remember that deleted rows can be physically removed only in a purge operation after they are no longer needed in transaction rollback or consistent read.

7.6.12.3 Defragmenting a table

If there are random insertions or deletions in the indexes of a table, the indexes may become fragmented. By fragmentation we mean that the physical ordering of the index pages on the disk is not close to the alphabetical ordering of the records on the pages, or that there are many unused pages in the 64-page blocks which were allocated to the index.

It can speed up index scans if you periodically use mysqldump to dump the table to a text file, drop the table, and reload it from the dump. Another way to do the defragmenting is to ALTER the table type to MyISAM and back to InnoDB again. Note that a MyISAM table must fit in a single file on your operating system.

If the insertions to and index are always ascending and records are deleted only from the end, then the the file space management algorithm of InnoDB guarantees that fragmentation in the index will not occur.

7.6.13 Error handling

The error handling in InnoDB is not always the same as specified in the ANSI SQL standards. According to the ANSI standard, any error during an SQL statement should cause the rollback of that statement. InnoDB sometimes rolls back only part of the statement. The following list specifies the error handling of InnoDB.

If you run out of file space in the tablespace, you will get the MySQL 'Table is full' error and InnoDB rolls back the SQL statement.
A transaction deadlock or a timeout in a lock wait will give 'Table handler error 1000000' and InnoDB rolls back the SQL statement.
A duplicate key error only rolls back the insert of that particular row, even in a statement like INSERT INTO ... SELECT .... This will probably change so that the SQL statement will be rolled back if you have not specified the IGNORE option in your statement.
A 'row too long' error rolls back the SQL statement.
Other errors are mostly detected by the MySQL layer of code, and they roll back the corresponding SQL statement.

7.6.14 Some restrictions on InnoDB tables

SHOW TABLE STATUS does not give accurate statistics on InnoDB tables, except for the physical size reserved by the table. The row count is only a rough estimate used in SQL optimization.
If you try to create an unique index on a prefix of a column you will get an error:
```
CREATE TABLE T (A CHAR(20), B INT, UNIQUE (A(5))) TYPE = InnoDB;
```
If you create a non unique index on a prefix of a column, InnoDB will create an index over the whole column.
INSERT DELAYED is not supported for InnoDB tables.
The MySQL LOCK TABLES operation does not know of InnoDB row level locks set in already completed SQL statements: this means that you can get a table lock on a table even if there still exist transactions of other users which have row level locks on the same table. Thus your operations on the table may have to wait if they collide with these locks of other users. Also a deadlock is possible. However, this does not endanger transaction integrity, because the row level locks set by InnoDB will always take care of the integrity. Also, a table lock prevents other transactions from acquiring more row level locks (in a conflicting lock mode) on the table.
You cannot have a key on a BLOB or TEXT column.
A table cannot contain more than 1000 columns.
DELETE FROM TABLE does not regenerate the table but instead deletes all rows, one by one, which is not that fast. In future versions of MySQL you can use TRUNCATE which is fast.
Before dropping a database with InnoDB tables one has to drop the individual InnoDB tables first.
The default database page size in InnoDB is 16 kB. By recompiling the code one can set it from 8 kB to 64 kB. The maximun row length is slightly less than half of a database page in versions <= 3.23.40 of InnoDB. Starting from source release 3.23.41 BLOB and TEXT columns are allowed to be < 4 GB, the total row length must also be < 4 GB. InnoDB does not store fields whose size is <= 30 bytes on separate pages. After InnoDB has modified the row by storing long fields on separate pages, the remaining length of the row must be slightly less than half a database page.
The maximum data or log file size is 2 GB or 4 GB depending on how large files your operating system supports. Support for > 4 GB files will be added to InnoDB in a future version.
The maximum tablespace size is 4 billion database pages. This is also the maximum size for a table. The minimum tablespace size is 10 MB.

7.6.15 InnoDB contact information

Contact information of Innobase Oy, producer of the InnoDB engine. Website: http://www.innodb.com. Email: Heikki.Tuuri@innodb.com

phone: 358-9-6969 3250 (office) 358-40-5617367 (mobile)
InnoDB Oy Inc.
World Trade Center Helsinki
Aleksanterinkatu 17
P.O.Box 800
00101 Helsinki
Finland

Go to the first, previous, next, last section, table of contents.