-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
curvefs/metaserver: now we support rocksdb storage. #1214
Conversation
curvefs/conf/metaserver.conf
Outdated
# metaserver max memory quota bytes (default: 30GB) | ||
storage.max_memory_quota_bytes=32212254720 | ||
# metaserver max disk quota bytes (default: 2TB) | ||
storage.max_disk_quota_bytes=2199023255552 | ||
# storage root dir, rocksdb's all SST and other files store under it | ||
storage.data_dir=/tmp # __CURVEADM_TEMPLATE__ ${prefix}/data/storage __CURVEADM_TEMPLATE__ | ||
# rocksdb column failmy's write_buffer_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Family ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Family ?
fixed.
storage.rocksdb.unordered_write_buffer_size=134217728 | ||
# rocksdb column failmy's max_write_buffer_number | ||
# for store inode which exclude its s3chunkinfo list (default: 5) | ||
storage.rocksdb.unordered_max_write_buffer_number=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the unit? bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the unit? bytes?
yep~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, what is the meaning of 5?
storage.rocksdb.unordered_write_buffer_size=134217728 | ||
# rocksdb column failmy's max_write_buffer_number | ||
# for store inode which exclude its s3chunkinfo list (default: 5) | ||
storage.rocksdb.unordered_max_write_buffer_number=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the unit? bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the unit? bytes?
yep~
// StreamStatus::RECEIVE_OK | ||
auto iter = out->find(chunkIndex); | ||
if (iter == out->end()) { | ||
out->insert({chunkIndex, list}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out->insert({chunkIndex, std::move(list});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out->insert({chunkIndex, std::move(list});
fixed.
if (iter == out->end()) { | ||
out->insert({chunkIndex, list}); | ||
} else { | ||
merge(list, &iter->second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list
can also move into iter->second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list
can also move into iter->second
fixed.
curvefs/src/metaserver/metastore.cpp
Outdated
buffer.clear(); | ||
buffer.append(std::to_string(key.chunkIndex)); | ||
buffer.append(":"); | ||
buffer.append(svalue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void TrivialDelete(void*) {}
buffer.append_user_data(svalue.data(), svalue.size(), TrivialDeleter)
this can also reduce extra copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void TrivialDelete(void*) {} buffer.append_user_data(svalue.data(), svalue.size(), TrivialDeleter)this can also reduce extra copy.
practical tips! i will improve it.
return EQUAL(fsid) && EQUAL(parentinodeid) && | ||
EQUAL(name) && lhs.txid() <= rhs.txid(); | ||
} | ||
|
||
bool MemoryDentryStorage::IsSameDentry(const Dentry& lhs, const Dentry& rhs) { | ||
inline bool DentryStorage::IsSameDentry(const Dentry& lhs, const Dentry& rhs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for inline
, it's not as we think about it from spelling, it just means that a function's definition can appear in more than one translation unit.
and if you really want a function to be unfolded in the caller, you should use magic from compiler, like __attribute__((always_inline))
in gcc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for
inline
, it's not as we think about it from spelling, it just means that a function's definition can appear in more than one translation unit.and if you really want a function to be unfolded in the caller, you should use magic from compiler, like
__attribute__((always_inline))
in gcc
yep, fixed.
auto parentId = dentry.parentinodeid(); | ||
auto exclude = dentry.name(); | ||
auto txId = dentry.txid(); | ||
MetaStatusCode DentryStorage::List(const Dentry& dentry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list doesn't hold lock when processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list doesn't hold lock when processing?
fixed.
MetaStatusCode InodeManager::PaddingInodeS3ChunkInfo(int32_t fsId, | ||
uint64_t inodeId, | ||
Inode* inode) { | ||
NameLockGuard lg(inodeLock_, GetInodeLockName(fsId, inodeId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you hold lock in here, every operations of current inode will be blocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you hold lock in here, every operations of current inode will be blocked.
the lock is redundant for PaddingInodeS3ChunkInfo
operator, i will remove it.
|
||
std::string EncodeNumber(size_t num); | ||
|
||
inline size_t DecodeNumber(std::string str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::string& str
and should you check str's length? if str.length() < sizeof(size_t), this will cause undefined behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::string& str
and should you check str's length? if str.length() < sizeof(size_t), this will cause undefined behavior
fixed
BTW, avoid sharing everything, you just need to clarify object's lifetime, if you can find out when it starts and when it ends, then a unique_ptr or raw pointer is enough |
I will pay attention it and improve it in this branch. |
# metaserver max memory quota bytes (default: 30GB) | ||
storage.max_memory_quota_bytes=32212254720 | ||
# metaserver max disk quota bytes (default: 2TB) | ||
storage.max_disk_quota_bytes=2199023255552 | ||
# storage root dir, rocksdb's all SST and other files store under it | ||
storage.data_dir=./storage # __CURVEADM_TEMPLATE__ ${prefix}/data/storage __CURVEADM_TEMPLATE__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ./storage as the default value seems bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ./storage as the default value seems bad
FYH
@@ -819,6 +829,49 @@ void MetaServerClientImpl::UpdateXattrAsync(const Inode &inode, | |||
excutor->DoAsyncRPCTask(taskDone); | |||
} | |||
|
|||
bool MetaServerClientImpl::ParseStreamBuffer(butil::IOBuf* buffer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make function name more specific.
Client::ParseStreamBuffer is too generic
it seems only for parsing s3chunkinfo related.
maybe somthing like ParseS3MetaBuffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make function name more specific. Client::ParseStreamBuffer is too generic it seems only for parsing s3chunkinfo related. maybe somthing like ParseS3MetaBuffer?
yep, i will improve it.
prefix?= "$(PWD)" | ||
|
||
download: | ||
@wget https://curve-build.nos-eastchina1.126.net/rocksdb-6.28.fb.tar.gz && tar -zxvf rocksdb-6.28.fb.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
library is static or shared? any dependence?
arch is amd64?
return MetaStatusCode::OK; | ||
} | ||
|
||
auto txn = kvStorage_->BeginTransaction(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems when kvstorage type is memory, BeginTransaction/EndTransaction are not implemented as expected.
when request is from compaction, AppendS3ChunkInfoList is not atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems when kvstorage type is memory, BeginTransaction/EndTransaction are not implemented as expected. when request is from compaction, AppendS3ChunkInfoList is not atomic.
yep, maybe we can using MVCC
to implement a simple transaction for our memoy storage in the next release version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i forget we have raft for metadata operations : )
it should be okay.
@@ -1,5 +1,5 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange filename? config.h~HEAD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange filename
i will remove it.
// see issue: https://github.com/apache/incubator-brpc/issues/392 | ||
auto channel = std::make_shared<brpc::Channel>(); | ||
brpc::ChannelOptions options; | ||
options.connection_type = "pooled"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, you can use options.connection_group to separate normal channel and streaming channel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate normal channel and streaming channel
yep, i will improve it.
storage.rocksdb.unordered_write_buffer_size=134217728 | ||
# rocksdb column failmy's max_write_buffer_number | ||
# for store inode which exclude its s3chunkinfo list (default: 5) | ||
storage.rocksdb.unordered_max_write_buffer_number=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, what is the meaning of 5?
} else if (!StringToUll(out.to_string(), chunkIndex)) { | ||
LOG(ERROR) << "invalid stream buffer: invalid chunkIndex"; | ||
return false; | ||
} else if (!list->ParseFromString(buffer->to_string())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can check whether https://github.com/apache/incubator-brpc/blob/4839ec2add93b439ccfe2761ce76a8914952992a/src/brpc/protocol.h#L207 is useful to reduce memory copy in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, i will improve it.
auto merge = [](const S3ChunkInfoList& from, S3ChunkInfoList* to) { | ||
for (size_t i = 0; i < from.s3chunks_size(); i++) { | ||
auto chunkinfo = to->add_s3chunks(); | ||
*chunkinfo = std::move(from.s3chunks(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might not work, because the result of from.s3chunks(i)
is const, and you can move from a const variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might not work, because the result of
from.s3chunks(i)
is const, and you can move from a const variable.
fixed:
*chunkinfo = std::move(*from->mutable_s3chunks(i));
} | ||
} | ||
{ | ||
WriteLockGuard writeLockGuard(rwLock_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should double-check in write lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should double-check in write lock
fixed.
} | ||
} | ||
{ | ||
WriteLockGuard writeLockGuard(rwLock_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
fixed.
ReadLockGuard readLockGuard(rwLock_); | ||
std::thread th = | ||
std::thread(&MetaStoreImpl::SaveBackground, this, path, done); | ||
WriteLockGuard writeLockGuard(rwLock_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this lock doesn't help, even you build iterators within the lock, because this lock is released after this function is returned.
And the dump works without lock, so data that iterators point to can be modified or deleted.
For memory storage, the possibility is relatively low because you forked. But as for rocksdb, user's requests and dump work on the same database concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, rocksdb's problem has been fixed by snapshot, #1228
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now for memory storage, we will hold write lock until child process forked, it can gurantee all data consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if metastore has stored lots of data, this will block other requests, anyway, we can leave it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if metastore has stored lots of data, this will block other requests, anyway, we can leave it later.
actually, we only return the iterator of all datas execpt the partition iterator which is small enough, and iterate all data in the child process, so it will not lead to a performance problem.
recheck |
curvefs/conf/metaserver.conf
Outdated
# | ||
# storage settings | ||
# | ||
# metaserver storage data directory | ||
storage.data_dir=/tmp/storage # __CURVEADM_TEMPLATE__ ${prefix}/data __CURVEADM_TEMPLATE__ | ||
# metaserver storage type, "momory" or "rocksdb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: memory?
fixed
Status MemoryStorage::HGet(const std::string& name, | ||
const std::string& key, | ||
ValueType* value) { | ||
if (options_.compression) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (options_.compression) {
GET_SERALIZED(UnorderedSeralizedContainer, name, key, value);
return;
}
GET(UnorderedContainer, name, key, value);
add an explicit return here, it's more readable.
and I think those macros can be replaced by template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an explicit return here, it's more readable.
yep, i will improve it, but the return value of those functions is Status
not void
, so we should return Status::OK()
and I think those macros can be replaced by template
Actually, i implement it with templates at first, but it's not easy to deal with GetContainer()
, the code become more heavy, so i trun to use macros.
maybe it's not an elegant implementation, i had mark a todo and will improve it in the next commit.
class ValueWrapper { | ||
public: | ||
explicit ValueWrapper(const ValueType& value) { | ||
RETURN_IF_CONVERT_VALUE_SUCCESS(Dentry); | ||
RETURN_IF_CONVERT_VALUE_SUCCESS(Inode); | ||
RETURN_IF_CONVERT_VALUE_SUCCESS(S3ChunkInfoList); | ||
} | ||
|
||
std::shared_ptr<ValueType> Message() const { | ||
return value_; | ||
} | ||
|
||
private: | ||
std::shared_ptr<ValueType> value_; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dynamic cast has runtime costs and the const_cast may cause undefined behavior.
but currently, you use ValueType as the interface, so it's not easy to change this.
maybe, std::any(needs C++17) or absl::any is the solution.
and, use unique_ptr is enough for this case, and Message() just return the pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dynamic cast has runtime costs and the const_cast may cause undefined behavior.
but currently, you use ValueType as the interface, so it's not easy to change this.
maybe, std::any(needs C++17) or absl::any is the solution.
maybe replace the type of interface param value toabsl::any
is a good solution, but i looking through the documentation of absl::any
, it states that absl::any
can only constructed by copying value, it means all data will be copyed:
// An `absl::any` can only store a type that is copy-constructible; move-only
// types are not allowed within an `any` object.
and, use unique_ptr is enough for this case, and Message() just return the pointer.
actually we need assignment operate, so shared_ptr
is our choice:
#define SET(TYPE, NAME, KEY, VALUE) \
do { \
auto container = GET_CONTAINER(TYPE, NAME); \
auto valueWrapper = ValueWrapper(VALUE); \
auto ret = container->emplace(KEY, valueWrapper); \
if (!ret.second) { \
ret.first->second = valueWrapper; \ // <--- here
} \
return Status::OK(); \
} while (0)
ReadLockGuard readLockGuard(rwLock_); | ||
std::thread th = | ||
std::thread(&MetaStoreImpl::SaveBackground, this, path, done); | ||
WriteLockGuard writeLockGuard(rwLock_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if metastore has stored lots of data, this will block other requests, anyway, we can leave it later.
curvefs/conf/metaserver.conf
Outdated
# | ||
# storage settings | ||
# | ||
# metaserver storage data directory | ||
storage.data_dir=/tmp/storage # __CURVEADM_TEMPLATE__ ${prefix}/data __CURVEADM_TEMPLATE__ | ||
# metaserver storage type, "momory" or "rocksdb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo “momory”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently rocksdb is just cheaper replacement of memory cache.
"metaserver storage" may be misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo “momory”
it already fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently rocksdb is just cheaper replacement of memory cache. "metaserver storage" may be misleading.
fixed.
recheck |
1 similar comment
recheck |
No description provided.