Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

curvefs/metaserver: now we support rocksdb storage. #1214

Merged
merged 1 commit into from
Apr 10, 2022
Merged

curvefs/metaserver: now we support rocksdb storage. #1214

merged 1 commit into from
Apr 10, 2022

Conversation

Wine93
Copy link
Contributor

@Wine93 Wine93 commented Mar 26, 2022

No description provided.

# metaserver max memory quota bytes (default: 30GB)
storage.max_memory_quota_bytes=32212254720
# metaserver max disk quota bytes (default: 2TB)
storage.max_disk_quota_bytes=2199023255552
# storage root dir, rocksdb's all SST and other files store under it
storage.data_dir=/tmp # __CURVEADM_TEMPLATE__ ${prefix}/data/storage __CURVEADM_TEMPLATE__
# rocksdb column failmy's write_buffer_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Family ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Family ?

fixed.

storage.rocksdb.unordered_write_buffer_size=134217728
# rocksdb column failmy's max_write_buffer_number
# for store inode which exclude its s3chunkinfo list (default: 5)
storage.rocksdb.unordered_max_write_buffer_number=5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the unit? bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the unit? bytes?

yep~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, what is the meaning of 5?

storage.rocksdb.unordered_write_buffer_size=134217728
# rocksdb column failmy's max_write_buffer_number
# for store inode which exclude its s3chunkinfo list (default: 5)
storage.rocksdb.unordered_max_write_buffer_number=5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the unit? bytes?

Copy link
Contributor Author

@Wine93 Wine93 Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the unit? bytes?

yep~

// StreamStatus::RECEIVE_OK
auto iter = out->find(chunkIndex);
if (iter == out->end()) {
out->insert({chunkIndex, list});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out->insert({chunkIndex, std::move(list});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out->insert({chunkIndex, std::move(list});

fixed.

if (iter == out->end()) {
out->insert({chunkIndex, list});
} else {
merge(list, &iter->second);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list can also move into iter->second

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list can also move into iter->second

fixed.

buffer.clear();
buffer.append(std::to_string(key.chunkIndex));
buffer.append(":");
buffer.append(svalue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void TrivialDelete(void*) {}

buffer.append_user_data(svalue.data(), svalue.size(), TrivialDeleter)

this can also reduce extra copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void TrivialDelete(void*) {}

buffer.append_user_data(svalue.data(), svalue.size(), TrivialDeleter)

this can also reduce extra copy.

practical tips! i will improve it.

return EQUAL(fsid) && EQUAL(parentinodeid) &&
EQUAL(name) && lhs.txid() <= rhs.txid();
}

bool MemoryDentryStorage::IsSameDentry(const Dentry& lhs, const Dentry& rhs) {
inline bool DentryStorage::IsSameDentry(const Dentry& lhs, const Dentry& rhs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for inline, it's not as we think about it from spelling, it just means that a function's definition can appear in more than one translation unit.

and if you really want a function to be unfolded in the caller, you should use magic from compiler, like __attribute__((always_inline)) in gcc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for inline, it's not as we think about it from spelling, it just means that a function's definition can appear in more than one translation unit.

and if you really want a function to be unfolded in the caller, you should use magic from compiler, like __attribute__((always_inline)) in gcc

yep, fixed.

auto parentId = dentry.parentinodeid();
auto exclude = dentry.name();
auto txId = dentry.txid();
MetaStatusCode DentryStorage::List(const Dentry& dentry,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list doesn't hold lock when processing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list doesn't hold lock when processing?

fixed.

MetaStatusCode InodeManager::PaddingInodeS3ChunkInfo(int32_t fsId,
uint64_t inodeId,
Inode* inode) {
NameLockGuard lg(inodeLock_, GetInodeLockName(fsId, inodeId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you hold lock in here, every operations of current inode will be blocked.

Copy link
Contributor Author

@Wine93 Wine93 Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you hold lock in here, every operations of current inode will be blocked.

the lock is redundant for PaddingInodeS3ChunkInfo operator, i will remove it.


std::string EncodeNumber(size_t num);

inline size_t DecodeNumber(std::string str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::string& str

and should you check str's length? if str.length() < sizeof(size_t), this will cause undefined behavior

Copy link
Contributor Author

@Wine93 Wine93 Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::string& str

and should you check str's length? if str.length() < sizeof(size_t), this will cause undefined behavior

fixed

@wu-hanqing
Copy link
Contributor

BTW, avoid sharing everything, you just need to clarify object's lifetime, if you can find out when it starts and when it ends, then a unique_ptr or raw pointer is enough

@Wine93
Copy link
Contributor Author

Wine93 commented Mar 31, 2022

BTW, avoid sharing everything, you just need to clarify object's lifetime, if you can find out when it starts and when it ends, then a unique_ptr or raw pointer is enough

I will pay attention it and improve it in this branch.

@h0hmj h0hmj self-requested a review April 1, 2022 02:05
# metaserver max memory quota bytes (default: 30GB)
storage.max_memory_quota_bytes=32212254720
# metaserver max disk quota bytes (default: 2TB)
storage.max_disk_quota_bytes=2199023255552
# storage root dir, rocksdb's all SST and other files store under it
storage.data_dir=./storage # __CURVEADM_TEMPLATE__ ${prefix}/data/storage __CURVEADM_TEMPLATE__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ./storage as the default value seems bad

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ./storage as the default value seems bad

FYH

@@ -819,6 +829,49 @@ void MetaServerClientImpl::UpdateXattrAsync(const Inode &inode,
excutor->DoAsyncRPCTask(taskDone);
}

bool MetaServerClientImpl::ParseStreamBuffer(butil::IOBuf* buffer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make function name more specific.
Client::ParseStreamBuffer is too generic
it seems only for parsing s3chunkinfo related.
maybe somthing like ParseS3MetaBuffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make function name more specific. Client::ParseStreamBuffer is too generic it seems only for parsing s3chunkinfo related. maybe somthing like ParseS3MetaBuffer?

yep, i will improve it.

prefix?= "$(PWD)"

download:
@wget https://curve-build.nos-eastchina1.126.net/rocksdb-6.28.fb.tar.gz && tar -zxvf rocksdb-6.28.fb.tar.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

library is static or shared? any dependence?
arch is amd64?

return MetaStatusCode::OK;
}

auto txn = kvStorage_->BeginTransaction();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems when kvstorage type is memory, BeginTransaction/EndTransaction are not implemented as expected.
when request is from compaction, AppendS3ChunkInfoList is not atomic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems when kvstorage type is memory, BeginTransaction/EndTransaction are not implemented as expected. when request is from compaction, AppendS3ChunkInfoList is not atomic.

yep, maybe we can using MVCC to implement a simple transaction for our memoy storage in the next release version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i forget we have raft for metadata operations : )
it should be okay.

@@ -1,5 +1,5 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange filename? config.h~HEAD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange filename

i will remove it.

// see issue: https://github.com/apache/incubator-brpc/issues/392
auto channel = std::make_shared<brpc::Channel>();
brpc::ChannelOptions options;
options.connection_type = "pooled";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, you can use options.connection_group to separate normal channel and streaming channel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate normal channel and streaming channel

yep, i will improve it.

storage.rocksdb.unordered_write_buffer_size=134217728
# rocksdb column failmy's max_write_buffer_number
# for store inode which exclude its s3chunkinfo list (default: 5)
storage.rocksdb.unordered_max_write_buffer_number=5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, what is the meaning of 5?

} else if (!StringToUll(out.to_string(), chunkIndex)) {
LOG(ERROR) << "invalid stream buffer: invalid chunkIndex";
return false;
} else if (!list->ParseFromString(buffer->to_string())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, i will improve it.

auto merge = [](const S3ChunkInfoList& from, S3ChunkInfoList* to) {
for (size_t i = 0; i < from.s3chunks_size(); i++) {
auto chunkinfo = to->add_s3chunks();
*chunkinfo = std::move(from.s3chunks(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might not work, because the result of from.s3chunks(i) is const, and you can move from a const variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might not work, because the result of from.s3chunks(i) is const, and you can move from a const variable.

fixed:

 *chunkinfo = std::move(*from->mutable_s3chunks(i));

}
}
{
WriteLockGuard writeLockGuard(rwLock_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should double-check in write lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should double-check in write lock

fixed.

}
}
{
WriteLockGuard writeLockGuard(rwLock_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

fixed.

ReadLockGuard readLockGuard(rwLock_);
std::thread th =
std::thread(&MetaStoreImpl::SaveBackground, this, path, done);
WriteLockGuard writeLockGuard(rwLock_);
Copy link
Contributor

@wu-hanqing wu-hanqing Apr 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this lock doesn't help, even you build iterators within the lock, because this lock is released after this function is returned.

And the dump works without lock, so data that iterators point to can be modified or deleted.

For memory storage, the possibility is relatively low because you forked. But as for rocksdb, user's requests and dump work on the same database concurrently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, rocksdb's problem has been fixed by snapshot, #1228

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now for memory storage, we will hold write lock until child process forked, it can gurantee all data consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if metastore has stored lots of data, this will block other requests, anyway, we can leave it later.

Copy link
Contributor Author

@Wine93 Wine93 Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if metastore has stored lots of data, this will block other requests, anyway, we can leave it later.

actually, we only return the iterator of all datas execpt the partition iterator which is small enough, and iterate all data in the child process, so it will not lead to a performance problem.

@Wine93
Copy link
Contributor Author

Wine93 commented Apr 6, 2022

recheck

#
# storage settings
#
# metaserver storage data directory
storage.data_dir=/tmp/storage # __CURVEADM_TEMPLATE__ ${prefix}/data __CURVEADM_TEMPLATE__
# metaserver storage type, "momory" or "rocksdb"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: memory?

fixed

Status MemoryStorage::HGet(const std::string& name,
const std::string& key,
ValueType* value) {
if (options_.compression) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (options_.compression) {
    GET_SERALIZED(UnorderedSeralizedContainer, name, key, value);
    return;
}
GET(UnorderedContainer, name, key, value);

add an explicit return here, it's more readable.

and I think those macros can be replaced by template

Copy link
Contributor Author

@Wine93 Wine93 Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an explicit return here, it's more readable.

yep, i will improve it, but the return value of those functions is Status not void, so we should return Status::OK()

and I think those macros can be replaced by template

Actually, i implement it with templates at first, but it's not easy to deal with GetContainer(), the code become more heavy, so i trun to use macros.
maybe it's not an elegant implementation, i had mark a todo and will improve it in the next commit.

Comment on lines +66 to +73
class ValueWrapper {
public:
explicit ValueWrapper(const ValueType& value) {
RETURN_IF_CONVERT_VALUE_SUCCESS(Dentry);
RETURN_IF_CONVERT_VALUE_SUCCESS(Inode);
RETURN_IF_CONVERT_VALUE_SUCCESS(S3ChunkInfoList);
}

std::shared_ptr<ValueType> Message() const {
return value_;
}

private:
std::shared_ptr<ValueType> value_;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dynamic cast has runtime costs and the const_cast may cause undefined behavior.

but currently, you use ValueType as the interface, so it's not easy to change this.

maybe, std::any(needs C++17) or absl::any is the solution.

and, use unique_ptr is enough for this case, and Message() just return the pointer.

Copy link
Contributor Author

@Wine93 Wine93 Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dynamic cast has runtime costs and the const_cast may cause undefined behavior.

but currently, you use ValueType as the interface, so it's not easy to change this.

maybe, std::any(needs C++17) or absl::any is the solution.

maybe replace the type of interface param value toabsl::any is a good solution, but i looking through the documentation of absl::any, it states that absl::any can only constructed by copying value, it means all data will be copyed:

// An `absl::any` can only store a type that is copy-constructible; move-only
// types are not allowed within an `any` object.

and, use unique_ptr is enough for this case, and Message() just return the pointer.

actually we need assignment operate, so shared_ptr is our choice:

#define SET(TYPE, NAME, KEY, VALUE)                   \
do {                                                  \
    auto container = GET_CONTAINER(TYPE, NAME);       \
    auto valueWrapper = ValueWrapper(VALUE);          \
    auto ret = container->emplace(KEY, valueWrapper); \
    if (!ret.second) {                                \
        ret.first->second = valueWrapper;             \  // <--- here
    }                                                 \
    return Status::OK();                              \
} while (0)

ReadLockGuard readLockGuard(rwLock_);
std::thread th =
std::thread(&MetaStoreImpl::SaveBackground, this, path, done);
WriteLockGuard writeLockGuard(rwLock_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if metastore has stored lots of data, this will block other requests, anyway, we can leave it later.

#
# storage settings
#
# metaserver storage data directory
storage.data_dir=/tmp/storage # __CURVEADM_TEMPLATE__ ${prefix}/data __CURVEADM_TEMPLATE__
# metaserver storage type, "momory" or "rocksdb"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo “momory”

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently rocksdb is just cheaper replacement of memory cache.
"metaserver storage" may be misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo “momory”

it already fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently rocksdb is just cheaper replacement of memory cache. "metaserver storage" may be misleading.

fixed.

@Wine93
Copy link
Contributor Author

Wine93 commented Apr 9, 2022

recheck

1 similar comment
@Wine93
Copy link
Contributor Author

Wine93 commented Apr 10, 2022

recheck

@Wine93 Wine93 merged commit 79e09d3 into opencurve:master Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants