Skip to content
This repository has been archived by the owner on Sep 27, 2019. It is now read-only.

[15721] Index Suggestion #1347

Open
wants to merge 182 commits into
base: master
Choose a base branch
from

Conversation

sivaprasadsudhir
Copy link
Contributor

@sivaprasadsudhir sivaprasadsudhir commented May 5, 2018

This PR implements a search based algorithm for suggesting the best set of indexes for a given workload based on the AutoAdmin paper from Microsoft.

The main workflow and classes are given below:

  • The helper classes and methods for our algorithm are defined in brain/index_selection_util[.h/.c]. This includes the IndexSelectionKnobs (tunable knobs of the algorithm), HypotheticalIndexObject (a hypothetical index class), IndexConfiguration (the set of hypothetical indexes). Multiple helper methods and overloaded operators for IndexConfiguration are defined here

  • IndexSelection is the main wrapper around our tool which takes in a workload (set of queries) and the tunable parameters of the algorithm. It returns the best index configuration through the main external API GetBestIndexes

  • The methods for running the search based algorithm are present in brain/index_selection[.h/.c]. The three modules which are used for each search iteration are:

    • GenerateCandidateIndexes generates the admissible indexes (indexes that benefit at least one query in the workload) from the provided queries and prune the useless ones
    • Enumerate that gets the top k indexes for the workload which would reduce the cost of executing them through a combination of exhaustive search (ExhaustiveEnumeration) and greedy search (GreedySearch)
    • GenerateMultiColumnIndexes generates multi-column indexes from the single column indexes by doing a cross product and adds it into the result which will be used for the next iteration.
  • The WhatIfIndex in brain/what_if_index[.h/.c] returns best physical plan tree and the cost associated with a hypothetical index configuration. This is possible with the set of appropriate changes made to optimizer/optimizer[.h/.c]

  • The IndexSelectionContext in brain/index_selection_context[.h/.c] memoizes the cost of the query for a given configuration to reduce the number of calls sent to WhatIfAPI

  • Integrating this module to work with the self-driving infrastructure of Peloton (Brain) is a work in progress

vkonagar and others added 29 commits May 12, 2018 13:23
catalog cache eviction is not done properly.
# Conflicts:
#	src/brain/what_if_index.cpp
warning: the specified comparator type does not provide a const call
operator [-Wuser-defined-warnings]
DEFUALT_SCHEMA_NAME can't be found error. Fix this when merging with
master.
DEFUALT_SCHEMA_NAME can't be found error. Fix this when merging with
master.
Copy link
Member

@linmagit linmagit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the code quality is good! The documentation is very good. I left some comments to fix. Besides the comments, there're also two things:

  1. I didn't check all the files, but it looks like you didn't use forward-declaration to reduce the dependency. You should check where you're only using pointers in the .h file and forward-declare the classes and move the includes to the .cpp file as much as possible.
  2. Some tests on Jenkins are failing. PLease fix them as well.

}
}

Workload *w;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this w used?

// Column is a table column name.
// 2. GROUP BY (if present)
// 3. ORDER BY (if present)
// 4. all updated columns for UPDATE query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only get the columns in the where clause of the UPDATE query, not all updated columns for UPDATE query, right? The code looks correct, but the comment looks wrong?

auto result = std::hash<std::string>()(key.second->GetInfo());
for (auto index : indexes) {
// TODO[Siva]: Use IndexObjectHasher to hash this
result ^= std::hash<std::string>()(index->ToString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to fix this now.

// The mapping from the object to the shared pointer
std::unordered_map<HypotheticalIndexObject,
std::shared_ptr<HypotheticalIndexObject>,
IndexObjectHasher> map_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we decide to do it this way instead of directly storing a set of HypotheticalIndexObjects? Is it for efficiency consideration?

}

PELOTON_ASSERT(index->column_oids.size() > 0);
auto response = request.send().wait(client.getWaitScope());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the response and through some warning if it does not succeed?

// Get index objects
bool InsertIndexObject(std::shared_ptr<IndexCatalogObject> index_object);
bool EvictIndexObject(oid_t index_oid);
bool EvictIndexObject(const std::string &index_name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again. These should be protected and only used by what-if API through friend class.

@@ -79,6 +67,9 @@ class TableCatalogObject {
inline oid_t GetDatabaseOid() { return database_oid; }
inline uint32_t GetVersionId() { return version_id; }

// NOTE: should be only used by What-if API.
void SetValidIndexObjects(bool is_valid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protected.

// try get from cache
auto pg_table = Catalog::GetInstance()
->GetSystemCatalogs(database_oid)
->GetTableCatalog();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct way to do this. We should not directly access the catalog instance. We should get the database catalog object from txn->catalog_cache_, then get table objects, and then index objects. Everything should go through the local catalog cache of the transaction but no the global catalog instance.


// TODO: Avoid using this function.
// Copied from SQL testing util.
// Execute a SQL query end-to-end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a hack.... Is there a way to fix this?

for (size_t idx = 0; idx < key_attr_list.size(); ++idx) {
// If index cannot further reduce scan range, break
if (idx == op->key_column_id_list.size() ||
key_attr_list[idx] != op->key_column_id_list[idx]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenboy It looks like this thing requires the key_column_id_list has exactly the same order as key_attr_list? But in fact, an index with col(a, b, c) can also help a query with predicates(b=2 AND a=1 AND c=3), right?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants