Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDB$BLOB_UTIL system package. #281

Merged
merged 15 commits into from
Dec 16, 2022
Merged

RDB$BLOB_UTIL system package. #281

merged 15 commits into from
Dec 16, 2022

Conversation

asfernandes
Copy link
Member

No description provided.

@aafemt
Copy link
Contributor

aafemt commented Aug 23, 2020

Wouldn't it be better for this package to be a little more compatible with Oracle in term of names and usage...?

Copy link

@sim1984 sim1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firebird 5? What's stopping you from adding this feature to Firebird 4.0?

## Function `NEW`

`RDB$BLOB_UTIL.NEW` is used to create a new BLOB. It returns a handle (an integer bound to the transaction) that should be used with the others functions of the package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need such an artificial for SQL concept as "handle" here? Every blob is represented with blob ID which actually is a handle. Passing a blob here and there inside PSQL (except assigning to a table field) is just a matter of copying its ID, the contents is not touched. So tra_blob_util_map may just store blob IDs created/opened with RDB$BLOB_UTIL package. And all package functions may declare inputs/outputs as just BLOB instead of INTEGER handle. Do I miss anything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A blob id is used in the client with a handle. A handle in this context is an id more the blb class inside the engine. A blb has information like current position. RDB$BLOB_UTIL handles model this concept in PSQL.

A blob id for this would be very confusing. Many different variables would have the same id so how one could have multiple parallel seek/read in the same blob id?

Also a blob id is implicitely copied depending on blob charset when they are passed as arguments.


Input parameter:
- `SEGMENTED` type `BOOLEAN NOT NULL`
- `TEMP_STORAGE` type `BOOLEAN NOT NULL`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should be prepared for the tablespaces feature, so that blobs could be created in the explicitly specified (by name) tablespace which can be either "permanent" or "temporary". This requires more thinking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new parameter with default. Also named arguments as Oracle => would be very interesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that TEMP_STORAGE may conflict with the tablespace, e.g. TEMP_STORAGE = TRUE but TABLESPACE = MY_BLOB_SPACE. This looks error-prone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how the same problem is going to be resolved in regard to storage specified in BPB?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lower-level options (TEMP_STORAGE parameter or isc_bpb_storage_temp) should override the DDL-level default storage.


## Function `OPEN_BLOB`

`RDB$BLOB_UTIL.OPEN_BLOB` is used to open an existing BLOB for read. It returns a handle that should be used with the others functions of the package.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just "OPEN", not "OPEN_BLOB". We already have just "NEW" and the package is named RDB$BLOB_UTIL ;-)

Copy link
Member Author

@asfernandes asfernandes May 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use it if it's not a reserved word.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but let's be consistent then and rename NEW -> NEW_BLOB, to complement OPEN_BLOB, MAKE_BLOB and possible APPEND_BLOB ;-)

Input parameters:
- `HANDLE` type `INTEGER NOT NULL`
- `DATA` type `VARBINARY(32767) NOT NULL`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need yet another routine -- something like APPEND_BLOB -- to concatenate the whole other blob if it's longer than 32KB. Here "BLOB" in the name again seems redundant ;-) so better naming ideas are welcome. Or we should find a way to make APPEND polymorphic in regard in its input.

Copy link
Member Author

@asfernandes asfernandes May 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could create a VARIANT type which could be used as system routines arguments - and also as general data type.

System functions already can work in this way, but they do not have stored metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the VARIANT type might be an interesting idea per se, it requires some serious thinking and discussions and it could be an overkill for this particular need if we need to release v5 really soon. So I'd be more happy with APPEND_TEXT (or APPEND_STRING if you wish) and separate APPEND_BLOB.


Return type: `INTEGER NOT NULL`.

## Procedure `APPEND`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more happy to see all routines defined as functions even if they don't return anything useful, just because of this typing difference:
execute procedure rdb$blob_util.append(...);
vs
rdb$blob_util.seek(...);

Copy link
Member Author

@asfernandes asfernandes May 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we return? True or 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer NULL maybe. From another side, we could add the SQL-standard CALL in addition to our legacy EXECUTE PROCEDURE and the typing would be much easier ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CALL would be good to fix EXECUTE PROCEDURE, but since our procedures are not identical to SQL, I will open a discussion in devel.

If `LENGTH` is passed with a positive number, it returns a VARBINARY with its maximum length.

If `LENGTH` is `NULL` it returns just a segment of the BLOB with a maximum length of 32765.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, blob segments may be up to 64KB in length. Will the longer-than-32KB segment be truncated to 32KB and the next READ call would return the remaining part? Can there be any consequences if the segment is split to multiple parts? For example, one source segment will be written as two segments in the target blob. Blob filters may be not able to decode half-chunks properly (firstly it's about built-in transliteration filters -- think about splitting in the middle of a multi-byte character -- although perhaps they never deal with chunks longer then 32KB).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could increase max VARCHAR to 64KB - 2. But also, is there an impediment to have max dsc_length of dtype_varying to 64KB?

As I understand, there should not be many places just reading and incrementing dsc_length, so dtype_cstring / dtype_varying does not need to have the constant size added to dsc_length.

It would simplify various places that substract and re-add that value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the segment split problem remains anyway, if the LENGTH argument is not NULL (and less than the segment size). So perhaps we don't need to do anything special right now. Those who use filtered blobs should either prefer under-32KB segments or avoid using this package.

## Function `MAKE_BLOB`

`RDB$BLOB_UTIL.MAKE_BLOB` is used to create a BLOB from a BLOB handle created with `NEW` followed by its content added with `APPEND`. After `MAKE_BLOB` is called the handle is destroyed and should not be used with the others functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With parameters being BLOB rather than INTEGER handle, I'd just call it "CLOSE". And maybe think about "auto-close" scenarios in some cases.

IExternalContext* context, const AppendInput::Type* in, void*)
{
const auto tdbb = JRD_get_thread_data();
Attachment::SyncGuard guard(tdbb->getAttachment(), FB_FUNCTION);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR: could it make sense (from the performance POV) to avoid EngineCheckout for system packages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be avoided for simple cases. Do you mean something different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be avoided for simple cases. Do you mean something different?

I was been talking about avoinding SyncGuard for simple cases.

But yes, we should better avoid EngineCheckout for system packages in general.

if (in->data.length > 0)
blob->BLB_put_data(tdbb, (const UCHAR*) in->data.str, in->data.length);
else if (in->data.length == 0 && !(blob->blb_flags & BLB_stream))
blob->BLB_put_segment(tdbb, (const UCHAR*) in->data.str, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While zero-length segments are allowed by the engine, does it make sense to copy them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you do not copy them when replicating, but if user can explicitly do it and gbak preserves it, I think it should be preserved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I don't mind.

src/jrd/BlobUtil.cpp Outdated Show resolved Hide resolved
src/jrd/jrd.h Outdated Show resolved Hide resolved
@dyemanov
Copy link
Member

I've added some comments with a hope to close the remaining questions. Naming consistency (whether "_BLOB" suffix in method names should be mandatory) also needs some agreement.

@asfernandes
Copy link
Member Author

This package needs some adjustments after APPEND_BLOB to not implement same thing in different way:

  • Currently RDB$BLOB_UTIL.NEW returns a handle, that needs usage of RDB$BLOB_UTIL.APPEND and RDB$BLOB_UTIL.MAKE_BLOB.

RDB$BLOB_UTIL.NEW should be replaced by RDB$BLOB_UTIL.NEW_BLOB, that creates (only create, not append) a blob in very similar way than APPEND_BLOB do, but with option to create it in temporary space and segmented/stream like RDB$BLOB_UTIL.NEW was allowing.

  • RDB$BLOB_UTIL.APPEND and RDB$BLOB_UTIL.MAKE_BLOB should be removed and let APPEND_BLOB and BLB_close_on_read do this work.

  • RDB$BLOB_UTIL.CANCEL should be adapted in relation to changed RDB$BLOB_UTIL.NEW. It should be split in two functions: RDB$BLOB_UTIL.CLOSE (for opened handles) and RDB$BLOB_UTIL.CANCEL_BLOB (for BLB_close_on_read).

Please note that some functions has _BLOB suffixes and some not. This is because some functions operates on blobs and some in handles.

Maybe make sense to add _HANDLE suffixes too. That would also be to deal with reserved words, as for example, CLOSE is a reserved word.

@sim1984
Copy link

sim1984 commented Oct 21, 2022

I do not agree with these changes. Why try to cast the return result to the RDB$BLOB_APPEND variant? Let this package work with handles and in a different way than RDB$BLOB_APPEND.

@asfernandes
Copy link
Member Author

I do not agree with these changes. Why try to cast the return result to the RDB$BLOB_APPEND variant? Let this package work with handles and in a different way than RDB$BLOB_APPEND.

Cast RDB$BLOB_APPEND? I said to remove RDB$BLOB_UTIL.APPEND. This job can be done by APPEND_BLOB, we liking it or not.

So with exception of RDB$BLOB_UTIL.NEW_BLOB (that fulfills a job not done by APPEND_BLOB), the package will be most for reading.

@sim1984
Copy link

sim1984 commented Oct 22, 2022

I don't mind deleting append, but in this case it would be necessary to provide the blob_write procedure for new blobs. Let this package be a kind of analogue of the blob api. In this case, you will not be able to abandon make_blob. In addition, I am against changing the types of open. Otherwise, it will work, but here the fish was wrapped. And by the way, it would be nice to have a blob_info procedure, similar to the api, but returning all the information at once, such as the blob type, number of segments, length, and so on.

@asfernandes
Copy link
Member Author

Why are you insisting for write function if APPEND_BLOB can do it?

@cincuranet
Copy link
Member

Maybe make sense to add _HANDLE suffixes too. That would also be to deal with reserved words, as for example, CLOSE is a reserved word.

That would make sense to me.

@asfernandes asfernandes requested a review from dyemanov December 12, 2022 01:18
@sim1984
Copy link

sim1984 commented Dec 12, 2022

How about the BLOB_INFO procedure, which returns information about a BLOB:

  • blob type (segmented, streamed)
  • blob length
  • number of segments (if any);
  • blob placement (temporary, permanent).

@asfernandes
Copy link
Member Author

How about the BLOB_INFO procedure, which returns information about a BLOB:

  • blob type (segmented, streamed)

What is the use case to get this info in PSQL?

  • blob length

There is already CHAR_LENGTH and OCTET_LENGTH.

  • number of segments (if any);

What is the use case to get this info in PSQL?

  • blob placement (temporary, permanent).

This may be good.

@sim1984
Copy link

sim1984 commented Dec 12, 2022

What is the use case to get this info in PSQL?

This may be required for your own BLOB_UTILS package. BLOB navigation with SEEK is only possible for streaming BLOBs.

About length, I meant the information that can be obtained through isc_blob_info. Although considering that it still doesn't work for long BLOBs > 2GB, it's probably not necessary.

In fact, I would like only this:

  • blob type (segmented, streamed)
  • blob placement (temporary, permanent).

@aafemt
Copy link
Contributor

aafemt commented Dec 12, 2022

This may be good.

But also useless IMHO.

Implementation of SEEK for segmented BLOBs is not that complicated.

If `LENGTH` is passed with a positive number, it returns a VARBINARY with its maximum length.

If `LENGTH` is `NULL` it returns just a segment of the BLOB with a maximum length of 32765.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the segment split problem remains anyway, if the LENGTH argument is not NULL (and less than the segment size). So perhaps we don't need to do anything special right now. Those who use filtered blobs should either prefer under-32KB segments or avoid using this package.


Input parameter:
- `SEGMENTED` type `BOOLEAN NOT NULL`
- `TEMP_STORAGE` type `BOOLEAN NOT NULL`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lower-level options (TEMP_STORAGE parameter or isc_bpb_storage_temp) should override the DDL-level default storage.

src/jrd/BlobUtil.cpp Outdated Show resolved Hide resolved
src/jrd/names.h Outdated
NAME("RDB$BLOB_UTIL_HANDLE", nam_butil_handle)
NAME("RDB$BLOB", nam_blob)
NAME("RDB$VARBINARY_MAX", nam_varbinary_max)
NAME("RDB$LONG_NUMBER", nam_long_number)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe name it RDB$HANDLE instead? For me "long number" suggests something like "longer than usual" e.g. INT64 and also "NUMBER" is not necessarily means INTEGER in the SQL world.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, now I see that RDB$BLOB_UTIL_HANDLE is not used at all and RDB$LONG_NUMBER acts as both a handle and a mode/offset/length too. Is this correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct for RDB$BLOB_UTIL_HANDLE.

For RDB$LONG_NUMBER I want to create a generic domain as I think it does not make sense to create one domain for OFFSET and another for LENGTH. For MODE it would make sense, but I'm just reusing it there too. I now renamed it to RDB$INTEGER.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants