MySQL Index Merge Optimization Practices

  sonic0002        2024-09-10 04:52:26       966        0    

In production environment databases, it is often seen that some SQL where conditions include:

equal condition on a normal index + primary key range query + order by limit

Although using a normal index would be more efficient, the system chooses to use index merge instead in some cases. This article explores such index merge situations.

Index Merge Official Introduction  

The Index Merge access method retrieves rows with multiple range scans and merges their results into one.

Generally, for a single table, the optimizer chooses one index. However, in the case of index merge, the optimizer can use multiple indexes to retrieve data and merge their results.

Here, the system first retrieves the primary key using secondary indexes, sorts and merges the primary key, and then retrieves the table based on the primary key to avoid random IO.

Merge Sort Algorithm  

Before introducing the methods and algorithms of index merge, let's briefly review the merge sort algorithm to better understand index merge in MySQL.

The merge algorithm is an application of the divide-and-conquer principle. It recursively divides the list into smaller sub-lists, sorts the sub-lists, and then merges them to obtain an ordered list.

Index Merge in MySQL

In MySQL, the index merge algorithm has the following variations:

1. index_merge_intersection: Performs an intersection of indexes, appearing in the execution plan as Extra: Using intersect(...) and corresponds to the QUICK_ROR_INTERSECT_SELECT class in the source code.

2. index_merge_union: Performs a union of indexes, seen as Extra: Using union(...) and corresponding to the QUICK_ROR_UNION_SELECT class in the source code.

3. index_merge_sort_union: Performs a sorted union, shown in the execution plan as Extra: Using sort_union(...) and corresponds to the QUICK_INDEX_MERGE_SELECT class in the source code.

In the first two methods, the class name contains the keyword ROR, which stands for Rowid-Ordered Retrieval. This indicates that the result set returned by a single index is ordered by the primary key. According to the merge sort algorithm, merging the result sets does not require recursive sorting but only a merge of the ordered lists.

For the third method, since the results are not sorted by the primary key, a merge sort must be performed first.

In the first two cases, if the result sets are ordered by the primary key, the where condition must meet the following criteria:

  • The secondary index condition should include all indexed fields and be an equality match. For example, for an index idx(a,b,c), the where condition should be a=? AND b=? AND c=?.
  • There must be a primary key range query.

The intersection algorithm index_merge_intersection is advantageous when the conditions connected by AND yield a smaller primary key result set compared to a single index. The union algorithm index_merge_union is beneficial when conditions connected by OR result in a lower cost than a full table scan.

If neither of the first two algorithms is applicable and the index merge is still more cost-effective, MySQL chooses the sort union algorithm. This method requires deduplication and sorting, using a tree structure. Below is an explanation of the source code in sql/opt_range.cc, specifically the function QUICK_INDEX_MERGE_SELECT::read_keys_and_merge().

This function is used in index merging. It scans all indexes utilized (excluding the Clustered Primary Key, or CPK) and merges their rowid.

If a record can be retrieved through the primary key range condition, that row is skipped.

Otherwise, the unique_add method is executed afterward. This method inserts elements into a tree structure, implementing sorting and deduplication.

/* skip row if it will be retrieved by clustered PK scan */
if (pk_quick_select && pk_quick_select->row_in_ranges())
  continue;

cur_quick->file->position(cur_quick->record);
result= unique->unique_add((char*)cur_quick->file->ref);
if (result)
  DBUG_RETURN(1);

The unique_add method calls tree_insert inside:

inline bool unique_add(void *ptr)
{
  DBUG_ENTER("unique_add");
  DBUG_PRINT("info", ("tree %u - %lu", tree.elements_in_tree, max_elements));
  if (tree.elements_in_tree > max_elements && flush())
    DBUG_RETURN(1);
  DBUG_RETURN(!tree_insert(&tree, ptr, 0, tree.custom_arg));
}

In tree_insert, while traversing the tree, there is a line of code:

It indicates that if the current element is not the null node of the tree, a comparison is performed to check if the element equals the value to be inserted. If they are equal, the loop breaks (thus ensuring no duplicate values are inserted into the tree).

for (;;)
{
  if (element == &tree->null_element ||
      (cmp = (*tree->compare)(custom_arg, ELEMENT_KEY(tree,element), key)) == 0)
    break;
    ...

Example SQL for the Three Algorithms:

-- can use index_merge_intersection/index_merge_union algo
SELECT * FROM innodb_table
  WHERE primary_key < 10 AND/OR key1 = 20;


SELECT * FROM innodb_table
  WHERE (key1_part1 = 1 AND key1_part2 = 2) AND/OR key2 = 2;

-- can use index_merge_sort_union algo
SELECT * FROM innodb_table
  WHERE (key1_part1 = 2 or key1_part1 = 7) or key2_part1 = 4

SELECT * FROM innodb_table
  WHERE key1_part1 < 10 OR key2_part1 < 20;

In the index_merge_intersection scenario, if one condition is a primary key range, the primary key condition only serves as a filter. The rows satisfying the condition are not extracted and merged. Specifically, during the retrieval of rowids from the secondary index, MySQL checks whether the rowid falls within the primary key range. If it does, the row is retained; otherwise, the rowid is ignored.

Here, we return to the issue mentioned at the beginning of the article, where the `WHERE` condition includes: an equality check on a regular index, a range query on the primary key, and an `ORDER BY LIMIT` clause. In many cases, using a single secondary index can be faster than index merging.

The main reasons are as follows:

  • Index merging requires scanning all records that meet the conditions of the secondary index.
  • When using a single secondary index, due to the presence of the `LIMIT` clause, it is not necessary to scan all the records that meet the conditions. Scanning stops once enough records have been retrieved, which can be more efficient in some cases.

In fact, when MySQL calculates the cost, it initially does not consider the LIMIT clause. However, after the cost calculation, in some scenarios, there is a rechecking_index_usage process. One reason for this recheck is LOW_LIMIT. The logic determines if a recheck is needed based on whether the number of rows specified by the LIMIT is less than the table's "fanout" (rows fetched * filter). After some logical evaluation, it decides whether to change the execution plan.

The code snippet for rechecking index usage is as follows:

if ((tab->type() == JT_ALL || tab->type() == JT_RANGE ||
            tab->type() == JT_INDEX_MERGE || tab->type() == JT_INDEX_SCAN) &&
            tab->use_quick != QS_RANGE)
	{/*
            We plan to scan (table/index/range scan).
            Check again if we should use an index. We can use an index if:

            1a) There is a condition that range optimizer can work on, and
            1b) There are non-constant conditions on one or more keys, and
            1c) Some of the non-constant fields may have been read
                already. This may be the case if this is not the first
                table in the join OR this is a subselect with
                non-constant conditions referring to an outer table
                (dependent subquery)
                or,
            2a) There are conditions only relying on constants
            2b) This is the first non-constant table
            2c) There is a limit of rows to read that is lower than
                the fanout for this table, predicate filters included
                (i.e., the estimated number of rows that will be
                produced for this table per row combination of
                previous tables)
            2d) The query is NOT run with FOUND_ROWS() (because in that
                case we have to scan through all rows to count them anyway)
          */
          enum { DONT_RECHECK, NOT_FIRST_TABLE, LOW_LIMIT }
          recheck_reason= DONT_RECHECK;

          assert(tab->const_keys.is_subset(tab->keys()));

          const join_type orig_join_type= tab->type();
          const QUICK_SELECT_I *const orig_quick= tab->quick();

          if (cond &&                                                // 1a
              (tab->keys() != tab->const_keys) &&                      // 1b
              (i > 0 ||                                              // 1c
               (join->select_lex->master_unit()->item &&
                cond->used_tables() & OUTER_REF_TABLE_BIT)))
            recheck_reason= NOT_FIRST_TABLE;
          else if (!tab->const_keys.is_clear_all() &&                // 2a
                   i == join->const_tables &&                        // 2b
                   (join->unit->select_limit_cnt <
                    (tab->position()->rows_fetched *
                     tab->position()->filter_effect)) &&               // 2c
                   !join->calc_found_rows)                             // 2d
            recheck_reason= LOW_LIMIT;


    ...

if (recheck_reason == LOW_LIMIT)
            {
              int read_direction= 0;

              /*
                If the current plan is to use range, then check if the
                already selected index provides the order dictated by the
                ORDER BY clause.
              */
              if (tab->quick() && tab->quick()->index != MAX_KEY)
              {
                const uint ref_key= tab->quick()->index;

                read_direction= test_if_order_by_key(join->order,
                                                     tab->table(), ref_key);
                /*
                  If the index provides order there is no need to recheck
                  index usage; we already know from the former call to
                  test_quick_select() that a range scan on the chosen
                  index is cheapest. Note that previous calls to
                  test_quick_select() did not take order direction
                  (ASC/DESC) into account, so in case of DESC ordering
                  we still need to recheck.
                */
                if ((read_direction == 1) ||
                    (read_direction == -1 && tab->quick()->reverse_sorted()))
                {
                  recheck_reason= DONT_RECHECK;
                }
              }
              /*
                We do a cost based search for an ordering index here. Do this
                only if prefer_ordering_index switch is on or an index is
                forced for order by
              */
              if (recheck_reason != DONT_RECHECK &&
                  (tab->table()->force_index_order ||
                   thd->optimizer_switch_flag(
                       OPTIMIZER_SWITCH_PREFER_ORDERING_INDEX)))
              {
                int best_key= -1;
                ha_rows select_limit= join->unit->select_limit_cnt;

                /* Use index specified in FORCE INDEX FOR ORDER BY, if any. */
                if (tab->table()->force_index)
                  usable_keys.intersect(tab->table()->keys_in_use_for_order_by);

                /* Do a cost based search on the indexes that give sort order */
                test_if_cheaper_ordering(tab, join->order, tab->table(),
                                         usable_keys, -1, select_limit,
                                         &best_key, &read_direction,
                                         &select_limit);
                if (best_key < 0)
                  recheck_reason= DONT_RECHECK; // No usable keys
                else
                {
                  // Only usable_key is the best_key chosen
                  usable_keys.clear_all();
                  usable_keys.set_bit(best_key);
                  interesting_order= (read_direction == -1 ? ORDER::ORDER_DESC :
                                      ORDER::ORDER_ASC);
                }
              }
            } 

Case Study

A common scenario where index merging is less efficient:

-- table structure
CREATE TABLE `t` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `c1` char(64) NOT NULL,
  `c2` char(64) NOT NULL,
  `c3` char(64) NOT NULL,
  `c4` char(64) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_c1` (`c1`)
) ENGINE=InnoDB AUTO_INCREMENT=978914 DEFAULT CHARSET=utf8mb4


-- records count
mysql> select count(*) from t where c1='d';
+----------+
| count(*) |
+----------+
|    65536 |
+----------+
1 row in set (0.12 sec)
mysql> select count(*) from t where id>=600000;
+----------+
| count(*) |
+----------+
|   313385 |
+----------+
1 row in set (0.39 sec)
mysql> select count(*) from t where c1 ='d' and id>=600000;
+----------+
| count(*) |
+----------+
|    26112 |
+----------+
1 row in set (0.11 sec)


-- Here's something to note: In fact, even without the ORDER BY id clause, the returned records are already sorted. However, the execution plan still shows Using filesort, which means additional sorting is required.
mysql> explain select * from t where c1 ='d' and id>=600000  order by id limit 1000;
+----+-------------+-------+------------+-------------+----------------+----------------+---------+------+-------+----------+--------------------------------------------------------------+
| id | select_type | table | partitions | type        | possible_keys  | key            | key_len | ref  | rows  | filtered | Extra                                                        |
+----+-------------+-------+------------+-------------+----------------+----------------+---------+------+-------+----------+--------------------------------------------------------------+
|  1 | SIMPLE      | t     | NULL       | index_merge | PRIMARY,idx_c1 | idx_c1,PRIMARY | 260,4   | NULL | 26667 |   100.00 | Using intersect(idx_c1,PRIMARY); Using where; Using filesort |
+----+-------------+-------+------------+-------------+----------------+----------------+---------+------+-------+----------+--------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)


-- execution time: 170ms
1000 rows in set (0.17 sec)

-- slowlog scan number:27112 
# Time: 2024-06-13T19:00:38.741184+08:00
# User[@Host](https://my.oschina.net/u/116016): root[root] @ VMS184912 [10.61.250.57]  Id:   148
# Query_time: 0.174327  Lock_time: 0.000802 Rows_sent: 1000  Rows_examined: 27112
SET timestamp=1718276438;
select * from t where c1 ='d' and id>=600000  order by id limit 1000;


-- If the ORDER BY clause is removed: 10ms, the execution plan still shows index merge but no longer has Using filesort.
mysql> explain select * from t where c1 ='d' and id>=600000 limit 1000;
+----+-------------+-------+------------+-------------+----------------+----------------+---------+------+-------+----------+----------------------------------------------+
| id | select_type | table | partitions | type        | possible_keys  | key            | key_len | ref  | rows  | filtered | Extra                                        |
+----+-------------+-------+------------+-------------+----------------+----------------+---------+------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | t     | NULL       | index_merge | PRIMARY,idx_c1 | idx_c1,PRIMARY | 260,4   | NULL | 26667 |   100.00 | Using intersect(idx_c1,PRIMARY); Using where |
+----+-------------+-------+------------+-------------+----------------+----------------+---------+------+-------+----------+----------------------------------------------+
1 row in set, 1 warning (0.00 sec)
-- execution time
1000 rows in set (0.01 sec)

-- Slow log content: Only 1,000 rows are needed, so it doesn't require reading all the records that match the secondary index.
# Time: 2024-06-13T19:00:59.217427+08:00
# User[@Host](https://my.oschina.net/u/116016): root[root] @ VMS184912 [10.61.250.57]  Id:   148
# Query_time: 0.010773  Lock_time: 0.000145 Rows_sent: 1000  Rows_examined: 1000
SET timestamp=1718276459;
select * from t where c1 ='d' and id>=600000 limit 1000;

-- If the original SQL query is forced to use the secondary index
mysql> explain select * from t force index(idx_c1) where c1 ='d' and id>=600000  order by id limit 1000;
+----+-------------+-------+------------+-------+---------------+--------+---------+------+-------+----------+-----------------------+
| id | select_type | table | partitions | type  | possible_keys | key    | key_len | ref  | rows  | filtered | Extra                 |
+----+-------------+-------+------------+-------+---------------+--------+---------+------+-------+----------+-----------------------+
|  1 | SIMPLE      | t     | NULL       | range | idx_c1        | idx_c1 | 260     | NULL | 53334 |   100.00 | Using index condition |
+----+-------------+-------+------------+-------+---------------+--------+---------+------+-------+----------+-----------------------+
1 row in set, 1 warning (0.01 sec)


-- execution time: 10ms
1000 rows in set (0.01 sec)

Optimizations

  • Evaluate the necessity of the `ORDER BY` clause: In some cases, the result is already ordered by the primary key when using secondary indexes, and ORDER BY may be omitted. However, this depends on the index used.
  • Modify the `id` range condition: Changing the condition to id + 0 >= can force the use of a secondary index.

Summary

In most cases, when accessing a table, MySQL chooses a single index. However, under the following conditions in a WHERE clause with a range condition, it may use two indexes through index merge:

1. Secondary index conditions are met

The WHERE clause must include all indexed fields and they must be matched using equality conditions. For example, if the index is idx(a,b,c), the WHERE condition must be a=? AND b=? AND c=? for the index merge to be used.

2. Primary key range queries

When the primary key field is queried using range conditions such as >=, <=, or BETWEEN, MySQL may combine this query with the equality match conditions from the secondary index to utilize multiple indexes.

References  

Author: Zhang Luodan, passionate about database technology, constantly exploring, with hopes of writing more in-depth articles and delivering more valuable content in the future!

For more technical articles, visit ActionSky Open Source Documentation.

MYSQL  INDEX MERGE  SECONDARY INDEX  PRIMARY INDEX 

       

  RELATED


  0 COMMENT


No comment for this article.



  RANDOM FUN

Nano or Vim