Analyses your database queries and schema and suggests indices improvements. You can use index-digest
as your database linter. The goal is to provide the user with actionable reports instead of just a list of statistics and schema details. Inspired by Percona's pt-index-usage
.
NEW You can also use index-digest
as GitHub's Action.
index-digest
does the following:
- it checks the schema of all tables in a given database and suggests improvements (e.g. removal of redundant indices, adding a primary key to ease replication, dropping tables with just a single column or no rows)
- if provided with SQL queries log (via
--sql-log
option) it:- checks if all tables, columns and indices are used by these queries
- reports text columns with character set different than
utf
- reports queries that do not use indices
- reports queries that use filesort, temporary file or full table scan
- reports queries that are not quite kosher (e.g.
LIKE "%foo%"
,INSERT IGNORE
,SELECT *
,HAVING
clause, highOFFSET
in pagination queries)
- if run with
--analyze-data
switch it:- reports tables with old data (by querying for
MIN()
value of time column) where data retency can be reviewed - reports tables with not up-to-date data (by querying for
MAX()
value of time column)
- reports tables with old data (by querying for
- if run with
--check-empty-databases
switch it:- report empty databases on the current MySQL server
This tool supports MySQL 5.7, 8.0, 8.1, Percona Server 8.0 and MariaDB 10.1, 10.2, 10.5, 10.6 and runs under Python 3.8+.
Results can be reported in a human-readable form, as YAML or sent to syslog and later aggregated & processed using ELK stack.
pip install indexdigest
git clone [email protected]:macbre/index-digest.git && cd index-digest
sudo apt-get install libmysqlclient-dev python3-dev virtualenv
virtualenv -ppython3 env
source env/bin/activate
make install
When using MacOS, you should follow this mysql_config
installation steps.
We assume that the test database is running locally on port 53306. You can use the following to test your changes locally before pushing them (this one uses MySQL 8.0.20):
docker run --rm -p 53306:3306 --health-cmd="mysqladmin ping" --health-interval=10s --health-timeout=5s --health-retries=3 -e "MYSQL_ALLOW_EMPTY_PASSWORD=yes" -e "MYSQL_DATABASE=index_digest" --name=index_digest_mysql mysql:8.0.22 "--default-authentication-plugin=mysql_native_password"
Wait until the server is up and running.
mysql --protocol=tcp --port=53306 -u root --password="" -v < setup.sql
./sql/populate.sh
mysql --protocol=tcp --port=53306 -uindex_digest -pqwerty index_digest -v -e '\s; SHOW TABLES; SHOW DATABASES;'
make test
$ docker run --network=host -t macbre/index-digest:latest mysql://index_digest:qwerty@debian/index_digest | head -n 20
------------------------------------------------------------
Found 61 issue(s) to report for "index_digest" database
------------------------------------------------------------
MySQL v5.7.22 at debian
index-digest v1.2.0
------------------------------------------------------------
redundant_indices → table affected: 0004_id_foo
✗ "idx" index can be removed as redundant (covered by "PRIMARY")
- redundant: UNIQUE KEY idx (item_id, foo)
- covered_by: PRIMARY KEY (item_id, foo)
- schema: CREATE TABLE `0004_id_foo` (
`item_id` int(9) NOT NULL AUTO_INCREMENT,
`foo` varbinary(16) NOT NULL DEFAULT '',
PRIMARY KEY (`item_id`,`foo`),
UNIQUE KEY `idx` (`item_id`,`foo`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
- table_data_size_mb: 0.015625
- table_index_size_mb: 0.015625
...
$ index_digest -h
index_digest
Analyses your database queries and schema and suggests indices improvements.
Usage:
index_digest DSN [--sql-log=<file>] [--format=<formatter>] [--analyze-data] [--checks=<checks> | --skip-checks=<skip-checks>] [--tables=<tables> | --skip-tables=<skip-tables>]
index_digest (-h | --help)
index_digest --version
Options:
DSN Data Source Name of database to check
--sql-log=<file> Text file with SQL queries to check against the database
--format=<formatter> Use a given results formatter (plain, syslog, yaml)
--analyze-data Run additional checks that will query table data (can be slow!)
--checks=<list> Comma-separated lists of checks to report
--skip-checks=<list> Comma-separated lists of checks to skip from report
--tables=<list> Comma-separated lists of tables to report
--skip-tables=<list> Comma-separated lists of tables to skip from report
-h --help Show this screen.
--version Show version.
Examples:
index_digest mysql://username:password@localhost/dbname
index_digest mysql://index_digest:qwerty@localhost/index_digest --sql-log=sql.log
index_digest mysql://index_digest:qwerty@localhost/index_digest --skip-checks=non_utf_columns
index_digest mysql://index_digest:qwerty@localhost/index_digest --analyze-data --checks=data_too_old,data_not_updated_recently
index_digest mysql://index_digest:qwerty@localhost/index_digest --analyze-data --skip-tables=DATABASECHANGELOG,DATABASECHANGELOGLOCK
Visit <https://github.com/macbre/index-digest>
It's a text file with a single SQL query in each line (no line breaks are allowed). Lines that do start with --
(SQL comment) are ignored. The file can be generated using query-digest
when --sql-log
output mode is selected.
An example:
-- A comment
select * from 0002_not_used_indices order by id
select * from 0002_not_used_indices where foo = 'foo' and id = 2
select count(*) from 0002_not_used_indices where foo = 'foo'
/* foo bar */ select * from 0002_not_used_indices where bar = 'foo'
INSERT IGNORE INTO `0070_insert_ignore` VALUES ('123', 9, '2017-01-01');
From MySQL slow query log
MySQL's slow query log needs to be pre-processed first (to remove comments and timestamps):
cat mysql-slow.log | egrep -v '^(SET timestamp|#|throttle: )' > queries.log
Then you can run index_digest --sql-log=queries.log ...
.
index-digest
can return results in various formats (use --format
to choose one).
Emits human-readable report to a console. You can disable colored and bold text by setting env variable ANSI_COLORS_DISABLED=1
.
Pushes JSON-formatted messages via syslog, so they can be aggregated using ELK stack.
Use SYSLOG_IDENT
env variable to customize syslog's ident
messages are sent with (defaults to index-digest
).
Dec 28 15:59:58 debian index-digest[17485]: {"meta": {"version": "index-digest v0.1.0", "database_name": "index_digest", "database_host": "debian", "database_version": "MySQL v5.7.20"}, "report": {"type": "redundant_indices", "table": "0004_id_foo", "message": "\"idx\" index can be removed as redundant (covered by \"PRIMARY\")", "context": {"redundant": "UNIQUE KEY idx (id, foo)", "covered_by": "PRIMARY KEY (id, foo)", "schema": "CREATE TABLE `0004_id_foo` (\n `id` int(9) NOT NULL AUTO_INCREMENT,\n `foo` varbinary(16) NOT NULL DEFAULT '',\n PRIMARY KEY (`id`,`foo`),\n UNIQUE KEY `idx` (`id`,`foo`)\n) ENGINE=InnoDB DEFAULT CHARSET=latin1", "table_data_size_mb": 0.015625, "table_index_size_mb": 0.015625}}}
Outputs YML file with results and metadata.
You can select which checks should be reported by the tool by using --checks
command line option. Certain checks can also be skipped via --skip-checks
option. Refer to index_digest --help
for examples.
Number of checks: 24
redundant_indices
: reports indices that are redundant and covered by othernon_utf_columns
: reports text columns that have characters encoding set tolatin1
(utf is the way to go)missing_primary_index
: reports tables with no primary or unique key (see MySQL bug #76252 and Wikia/app#9863). Primary keys can be enforced on MySQL config level since 8.0.13 (viasql_require_primary_key
variable).test_tables
: reports tables that seem to be test leftovers (e.g.some_guy_test_table
)single_column
: reports tables with just a single columnempty_tables
: reports tables with no rowsgeneric_primary_key
: reports tables with a primary key onid
column (a more meaningful name should be used)use_innodb
: reports table using storage engines different thanInnoDB
(a default for MySQL 5.5+ and MariaDB 10.2+)low_cardinality_index
: reports indices with low cardinality
You need to provide SQL log file via
--sql-log
option
not_used_columns
: checks which columns were not used by SELECT queriesnot_used_indices
: checks which indices are not used by SELECT queriesnot_used_tables
: checks which tables are not used by SELECT queriesqueries_not_using_index
: reports SELECT queries that do not use any indexqueries_using_filesort
: reports SELECT queries that require filesort (a sort can’t be performed from an index and quicksort is used)queries_using_temporary
: reports SELECT queries that require a temporary table to hold the resultqueries_using_full_table_scan
: reports SELECT queries that require a full table scanselects_with_like
: reports SELECT queries that useLIKE '%foo'
conditions (they can not use an index)insert_ignore
: reports queries usingINSERT IGNORE
select_star
: reports queries usingSELECT *
having_clause
: reports queries usingHAVING
clausehigh_offset_selects
: report SELECT queries using high OFFSET
You need to use
--analyze-data
command line switch. Please note that these checks will query your tables. These checks can take a while if queried columns are not indexed.
data_too_old
: reports tables that have really old data, maybe it's worth checking if such long data retention is actually needed (defaults to three months threshold, can be customized viaINDEX_DIGEST_DATA_TOO_OLD_THRESHOLD_DAYS
env variable)data_not_updated_recently
: reports tables that were not updated recently, check if it should be up-to-date (defaults a month threshold, can be customized viaINDEX_DIGEST_DATA_NOT_UPDATED_RECENTLY_THRESHOLD_DAYS
env variable)
You need to use
--check-empty-databases
command line switch.
empty_database
: reports databases that have noBASE TABLE
tables (as provided byinformation_schema.TABLES
)
$ index_digest mysql://index_digest:qwerty@localhost/index_digest --sql-log sql/0002-not-used-indices-log
------------------------------------------------------------
Found 85 issue(s) to report for "index_digest" database
------------------------------------------------------------
MySQL v5.7.21 at debian
index-digest v1.0.0
------------------------------------------------------------
redundant_indices → table affected: 0004_id_foo
✗ "idx" index can be removed as redundant (covered by "PRIMARY")
- redundant: UNIQUE KEY idx (id, foo)
- covered_by: PRIMARY KEY (id, foo)
- schema: CREATE TABLE `0004_id_foo` (
`id` int(9) NOT NULL AUTO_INCREMENT,
`foo` varbinary(16) NOT NULL DEFAULT '',
PRIMARY KEY (`id`,`foo`),
UNIQUE KEY `idx` (`id`,`foo`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
- table_data_size_mb: 0.015625
- table_index_size_mb: 0.015625
------------------------------------------------------------
redundant_indices → table affected: 0004_id_foo_bar
✗ "idx_foo" index can be removed as redundant (covered by "idx_foo_bar")
- redundant: KEY idx_foo (foo)
- covered_by: KEY idx_foo_bar (foo, bar)
- schema: CREATE TABLE `0004_id_foo_bar` (
`id` int(9) NOT NULL AUTO_INCREMENT,
`foo` varbinary(16) NOT NULL DEFAULT '',
`bar` varbinary(16) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
KEY `idx_foo` (`foo`),
KEY `idx_foo_bar` (`foo`,`bar`),
KEY `idx_id_foo` (`id`,`foo`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
- table_data_size_mb: 0.015625
- table_index_size_mb: 0.046875
------------------------------------------------------------
missing_primary_index → table affected: 0034_querycache
✗ "0034_querycache" table does not have any primary or unique index
- schema: CREATE TABLE `0034_querycache` (
`qc_type` varbinary(32) NOT NULL,
`qc_value` int(10) unsigned NOT NULL DEFAULT '0',
`qc_namespace` int(11) NOT NULL DEFAULT '0',
`qc_title` varchar(255) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL DEFAULT '',
KEY `qc_type` (`qc_type`,`qc_value`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
------------------------------------------------------------
test_tables → table affected: 0075_some_guy_test_table
✗ "0075_some_guy_test_table" seems to be a test table
- schema: CREATE TABLE `0075_some_guy_test_table` (
`id` int(9) NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
------------------------------------------------------------
single_column → table affected: 0074_bag_of_ints
✗ "0074_bag_of_ints" has just a single column
- schema: CREATE TABLE `0074_bag_of_ints` (
`id` int(9) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
------------------------------------------------------------
empty_tables → table affected: 0089_empty_table
✗ "0089_empty_table" table has no rows, is it really needed?
- schema: CREATE TABLE `0089_empty_table` (
`id` int(9) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
------------------------------------------------------------
generic_primary_key → table affected: 0094_generic_primary_key
✗ "0094_generic_primary_key" has a primary key called id, use a more meaningful name
- schema: CREATE TABLE `0094_generic_primary_key` (
`id` int(9) NOT NULL AUTO_INCREMENT,
`foo` varchar(16) NOT NULL DEFAULT '',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
------------------------------------------------------------
use_innodb → table affected: 0036_use_innodb_myisam
✗ "0036_use_innodb_myisam" uses MyISAM storage engine
- schema: CREATE TABLE `0036_use_innodb_myisam` (
`item_id` int(9) NOT NULL AUTO_INCREMENT,
`foo` int(8) DEFAULT NULL,
PRIMARY KEY (`item_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
- engine: MyISAM
------------------------------------------------------------
not_used_indices → table affected: 0002_not_used_indices
✗ "test_id_idx" index was not used by provided queries
- not_used_index: KEY test_id_idx (test, id)
------------------------------------------------------------
not_used_tables → table affected: 0020_big_table
✗ "0020_big_table" table was not used by provided queries
- schema: CREATE TABLE `0020_big_table` (
`id` int(9) NOT NULL AUTO_INCREMENT,
`val` int(9) NOT NULL,
`text` char(5) NOT NULL,
PRIMARY KEY (`id`),
KEY `text_idx` (`text`)
) ENGINE=InnoDB AUTO_INCREMENT=100001 DEFAULT CHARSET=utf8
- table_size_mb: 5.03125
- rows_estimated: 100405
------------------------------------------------------------
insert_ignore → table affected: 0070_insert_ignore
✗ "INSERT IGNORE INTO `0070_insert_ignore` VALUES (9,..." query uses a risky INSERT IGNORE
- query: INSERT IGNORE INTO `0070_insert_ignore` VALUES (9, '123', '2017-01-01');
- schema: CREATE TABLE `0070_insert_ignore` (
`id` int(9) NOT NULL,
`text` char(5) NOT NULL,
`time` datetime DEFAULT NULL,
UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
------------------------------------------------------------
non_utf_columns → table affected: 0032_latin1_table
✗ "name" text column has "latin1" character set defined
- column: name
- column_character_set: latin1
- column_collation: latin1_swedish_ci
- schema: CREATE TABLE `0032_latin1_table` (
`item_id` int(9) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`utf8_column` varchar(255) CHARACTER SET utf8 COLLATE utf8_polish_ci NOT NULL,
`ucs2_column` varchar(255) CHARACTER SET ucs2 DEFAULT NULL,
`utf8mb4_column` varchar(255) CHARACTER SET utf8mb4 DEFAULT NULL,
`utf16_column` varchar(255) CHARACTER SET utf16 DEFAULT NULL,
`utf32_column` varchar(255) CHARACTER SET utf32 DEFAULT NULL,
`binary_column` varbinary(255) DEFAULT NULL,
`latin_blob` blob,
PRIMARY KEY (`item_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
------------------------------------------------------------
(...)
------------------------------------------------------------
queries_using_filesort → table affected: 0020_big_table
✗ "SELECT val, count(*) FROM 0020_big_table WHERE id ..." query used filesort
- query: SELECT val, count(*) FROM 0020_big_table WHERE id BETWEEN 10 AND 20 GROUP BY val
- explain_extra: Using where; Using temporary; Using filesort
- explain_rows: 11
- explain_filtered: None
- explain_key: PRIMARY
------------------------------------------------------------
queries_using_temporary → table affected: 0020_big_table
✗ "SELECT val, count(*) FROM 0020_big_table WHERE id ..." query used temporary
- query: SELECT val, count(*) FROM 0020_big_table WHERE id BETWEEN 10 AND 20 GROUP BY val
- explain_extra: Using where; Using temporary; Using filesort
- explain_rows: 11
- explain_filtered: None
- explain_key: PRIMARY
------------------------------------------------------------
queries_using_full_table_scan → table affected: 0020_big_table
✗ "SELECT * FROM 0020_big_table" query triggered full table scan
- query: SELECT * FROM 0020_big_table
- explain_rows: 9041
------------------------------------------------------------
selects_with_like → table affected: 0020_big_table
✗ "SELECT * FROM 0020_big_table WHERE text LIKE '%00'" query uses LIKE with left-most wildcard
- query: SELECT * FROM 0020_big_table WHERE text LIKE '%00'
- explain_extra: Using where
- explain_rows: 100623
------------------------------------------------------------
select_star → table affected: bar
✗ "SELECT t.* FROM bar AS t" query uses SELECT *
- query: SELECT t.* FROM bar AS t;
------------------------------------------------------------
having_clause → table affected: sales
✗ "SELECT s.cust_id,count(s.cust_id) FROM SH.sales s ..." query uses HAVING clause
- query: SELECT s.cust_id,count(s.cust_id) FROM SH.sales s GROUP BY s.cust_id HAVING s.cust_id != '1660' AND s.cust_id != '2'
(...)
------------------------------------------------------------
low_cardinality_index → table affected: 0020_big_table
✗ "num_idx" index on "num" column has low cardinality, check if it is needed
- column_name: num
- index_name: num_idx
- index_cardinality: 2
- schema: CREATE TABLE `0020_big_table` (
`item_id` int(9) NOT NULL AUTO_INCREMENT,
`val` int(9) NOT NULL,
`text` char(5) NOT NULL,
`num` int(3) NOT NULL,
PRIMARY KEY (`item_id`),
KEY `text_idx` (`text`),
KEY `num_idx` (`num`)
) ENGINE=InnoDB AUTO_INCREMENT=100001 DEFAULT CHARSET=utf8
- value_usage: 33.24788541334185
(...)
------------------------------------------------------------
data_too_old → table affected: 0028_data_too_old
✗ "0028_data_too_old" has rows added 184 days ago, consider changing retention policy
- diff_days: 184
- data_since: 2017-08-17 12:03:44
- data_until: 2018-02-17 12:03:44
- date_column_name: timestamp
- schema: CREATE TABLE `0028_data_too_old` (
`item_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
`cnt` int(8) unsigned NOT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`item_id`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1
- rows: 4
- table_size_mb: 0.015625
------------------------------------------------------------
data_not_updated_recently → table affected: 0028_data_not_updated_recently
✗ "0028_data_not_updated_recently" has the latest row added 40 days ago, consider checking if it should be up-to-date
- diff_days: 40
- data_since: 2017-12-29 12:03:44
- data_until: 2018-01-08 12:03:44
- date_column_name: timestamp
- schema: CREATE TABLE `0028_data_not_updated_recently` (
`item_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
`cnt` int(8) unsigned NOT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`item_id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
- rows: 3
- table_size_mb: 0.015625
------------------------------------------------------------
high_offset_selects → table affected: page
✗ "SELECT /* CategoryPaginationViewer::processSection..." query uses too high offset impacting the performance
- query: SELECT /* CategoryPaginationViewer::processSection */ page_namespace,page_title,page_len,page_is_redirect,cl_sortkey_prefix FROM `page` INNER JOIN `categorylinks` FORCE INDEX (cl_sortkey) ON ((cl_from = page_id)) WHERE cl_type = 'page' AND cl_to = 'Spotify/Song' ORDER BY cl_sortkey LIMIT 927600,200
- limit: 200
- offset: 927600
------------------------------------------------------------
empty_database → table affected: index_digest_empty
✗ "index_digest_empty" database has no tables
------------------------------------------------------------
Queries performed: 100
Want to add your entry here? Submit a pull request
- By running
index-digest
at Wikia on shared database clusters (including tables storing ~450 mm of rows with 300+ GiB of data) we were able to reclaim around 1.25 TiB of MySQL storage space across all replicas.