-
-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDB$BLOB_UTIL system package. #281
Conversation
Wouldn't it be better for this package to be a little more compatible with Oracle in term of names and usage...? |
857bfea
to
9cff3dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firebird 5? What's stopping you from adding this feature to Firebird 4.0?
## Function `NEW` | ||
|
||
`RDB$BLOB_UTIL.NEW` is used to create a new BLOB. It returns a handle (an integer bound to the transaction) that should be used with the others functions of the package. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need such an artificial for SQL concept as "handle" here? Every blob is represented with blob ID which actually is a handle. Passing a blob here and there inside PSQL (except assigning to a table field) is just a matter of copying its ID, the contents is not touched. So tra_blob_util_map may just store blob IDs created/opened with RDB$BLOB_UTIL package. And all package functions may declare inputs/outputs as just BLOB instead of INTEGER handle. Do I miss anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A blob id is used in the client with a handle. A handle in this context is an id more the blb class inside the engine. A blb has information like current position. RDB$BLOB_UTIL handles model this concept in PSQL.
A blob id for this would be very confusing. Many different variables would have the same id so how one could have multiple parallel seek/read in the same blob id?
Also a blob id is implicitely copied depending on blob charset when they are passed as arguments.
|
||
Input parameter: | ||
- `SEGMENTED` type `BOOLEAN NOT NULL` | ||
- `TEMP_STORAGE` type `BOOLEAN NOT NULL` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should be prepared for the tablespaces feature, so that blobs could be created in the explicitly specified (by name) tablespace which can be either "permanent" or "temporary". This requires more thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new parameter with default. Also named arguments as Oracle =>
would be very interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that TEMP_STORAGE
may conflict with the tablespace, e.g. TEMP_STORAGE = TRUE
but TABLESPACE = MY_BLOB_SPACE
. This looks error-prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how the same problem is going to be resolved in regard to storage specified in BPB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the lower-level options (TEMP_STORAGE
parameter or isc_bpb_storage_temp
) should override the DDL-level default storage.
|
||
## Function `OPEN_BLOB` | ||
|
||
`RDB$BLOB_UTIL.OPEN_BLOB` is used to open an existing BLOB for read. It returns a handle that should be used with the others functions of the package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please just "OPEN", not "OPEN_BLOB". We already have just "NEW" and the package is named RDB$BLOB_UTIL ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use it if it's not a reserved word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but let's be consistent then and rename NEW
-> NEW_BLOB
, to complement OPEN_BLOB
, MAKE_BLOB
and possible APPEND_BLOB
;-)
Input parameters: | ||
- `HANDLE` type `INTEGER NOT NULL` | ||
- `DATA` type `VARBINARY(32767) NOT NULL` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need yet another routine -- something like APPEND_BLOB -- to concatenate the whole other blob if it's longer than 32KB. Here "BLOB" in the name again seems redundant ;-) so better naming ideas are welcome. Or we should find a way to make APPEND polymorphic in regard in its input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could create a VARIANT type which could be used as system routines arguments - and also as general data type.
System functions already can work in this way, but they do not have stored metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the VARIANT
type might be an interesting idea per se, it requires some serious thinking and discussions and it could be an overkill for this particular need if we need to release v5 really soon. So I'd be more happy with APPEND_TEXT
(or APPEND_STRING
if you wish) and separate APPEND_BLOB
.
|
||
Return type: `INTEGER NOT NULL`. | ||
|
||
## Procedure `APPEND` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be more happy to see all routines defined as functions even if they don't return anything useful, just because of this typing difference:
execute procedure rdb$blob_util.append(...);
vs
rdb$blob_util.seek(...);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should we return? True or 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer NULL
maybe. From another side, we could add the SQL-standard CALL
in addition to our legacy EXECUTE PROCEDURE
and the typing would be much easier ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CALL
would be good to fix EXECUTE PROCEDURE
, but since our procedures are not identical to SQL, I will open a discussion in devel.
If `LENGTH` is passed with a positive number, it returns a VARBINARY with its maximum length. | ||
|
||
If `LENGTH` is `NULL` it returns just a segment of the BLOB with a maximum length of 32765. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, blob segments may be up to 64KB in length. Will the longer-than-32KB segment be truncated to 32KB and the next READ call would return the remaining part? Can there be any consequences if the segment is split to multiple parts? For example, one source segment will be written as two segments in the target blob. Blob filters may be not able to decode half-chunks properly (firstly it's about built-in transliteration filters -- think about splitting in the middle of a multi-byte character -- although perhaps they never deal with chunks longer then 32KB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could increase max VARCHAR to 64KB - 2. But also, is there an impediment to have max dsc_length of dtype_varying to 64KB?
As I understand, there should not be many places just reading and incrementing dsc_length, so dtype_cstring / dtype_varying does not need to have the constant size added to dsc_length.
It would simplify various places that substract and re-add that value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the segment split problem remains anyway, if the LENGTH
argument is not NULL (and less than the segment size). So perhaps we don't need to do anything special right now. Those who use filtered blobs should either prefer under-32KB segments or avoid using this package.
## Function `MAKE_BLOB` | ||
|
||
`RDB$BLOB_UTIL.MAKE_BLOB` is used to create a BLOB from a BLOB handle created with `NEW` followed by its content added with `APPEND`. After `MAKE_BLOB` is called the handle is destroyed and should not be used with the others functions. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With parameters being BLOB rather than INTEGER handle, I'd just call it "CLOSE". And maybe think about "auto-close" scenarios in some cases.
src/jrd/BlobUtil.cpp
Outdated
IExternalContext* context, const AppendInput::Type* in, void*) | ||
{ | ||
const auto tdbb = JRD_get_thread_data(); | ||
Attachment::SyncGuard guard(tdbb->getAttachment(), FB_FUNCTION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR: could it make sense (from the performance POV) to avoid EngineCheckout for system packages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be avoided for simple cases. Do you mean something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be avoided for simple cases. Do you mean something different?
I was been talking about avoinding SyncGuard for simple cases.
But yes, we should better avoid EngineCheckout for system packages in general.
src/jrd/BlobUtil.cpp
Outdated
if (in->data.length > 0) | ||
blob->BLB_put_data(tdbb, (const UCHAR*) in->data.str, in->data.length); | ||
else if (in->data.length == 0 && !(blob->blb_flags & BLB_stream)) | ||
blob->BLB_put_segment(tdbb, (const UCHAR*) in->data.str, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While zero-length segments are allowed by the engine, does it make sense to copy them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you do not copy them when replicating, but if user can explicitly do it and gbak preserves it, I think it should be preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I don't mind.
I've added some comments with a hope to close the remaining questions. Naming consistency (whether "_BLOB" suffix in method names should be mandatory) also needs some agreement. |
This package needs some adjustments after
Please note that some functions has Maybe make sense to add |
I do not agree with these changes. Why try to cast the return result to the RDB$BLOB_APPEND variant? Let this package work with handles and in a different way than RDB$BLOB_APPEND. |
Cast So with exception of |
I don't mind deleting append, but in this case it would be necessary to provide the blob_write procedure for new blobs. Let this package be a kind of analogue of the blob api. In this case, you will not be able to abandon make_blob. In addition, I am against changing the types of open. Otherwise, it will work, but here the fish was wrapped. And by the way, it would be nice to have a blob_info procedure, similar to the api, but returning all the information at once, such as the blob type, number of segments, length, and so on. |
Why are you insisting for write function if APPEND_BLOB can do it? |
That would make sense to me. |
How about the BLOB_INFO procedure, which returns information about a BLOB:
|
What is the use case to get this info in PSQL?
There is already
What is the use case to get this info in PSQL?
This may be good. |
This may be required for your own BLOB_UTILS package. BLOB navigation with SEEK is only possible for streaming BLOBs. About length, I meant the information that can be obtained through isc_blob_info. Although considering that it still doesn't work for long BLOBs > 2GB, it's probably not necessary. In fact, I would like only this:
|
But also useless IMHO. Implementation of SEEK for segmented BLOBs is not that complicated. |
If `LENGTH` is passed with a positive number, it returns a VARBINARY with its maximum length. | ||
|
||
If `LENGTH` is `NULL` it returns just a segment of the BLOB with a maximum length of 32765. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the segment split problem remains anyway, if the LENGTH
argument is not NULL (and less than the segment size). So perhaps we don't need to do anything special right now. Those who use filtered blobs should either prefer under-32KB segments or avoid using this package.
|
||
Input parameter: | ||
- `SEGMENTED` type `BOOLEAN NOT NULL` | ||
- `TEMP_STORAGE` type `BOOLEAN NOT NULL` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the lower-level options (TEMP_STORAGE
parameter or isc_bpb_storage_temp
) should override the DDL-level default storage.
src/jrd/names.h
Outdated
NAME("RDB$BLOB_UTIL_HANDLE", nam_butil_handle) | ||
NAME("RDB$BLOB", nam_blob) | ||
NAME("RDB$VARBINARY_MAX", nam_varbinary_max) | ||
NAME("RDB$LONG_NUMBER", nam_long_number) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe name it RDB$HANDLE instead? For me "long number" suggests something like "longer than usual" e.g. INT64 and also "NUMBER" is not necessarily means INTEGER in the SQL world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, now I see that RDB$BLOB_UTIL_HANDLE
is not used at all and RDB$LONG_NUMBER
acts as both a handle and a mode/offset/length too. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct for RDB$BLOB_UTIL_HANDLE
.
For RDB$LONG_NUMBER
I want to create a generic domain as I think it does not make sense to create one domain for OFFSET
and another for LENGTH
. For MODE
it would make sense, but I'm just reusing it there too. I now renamed it to RDB$INTEGER
.
No description provided.