Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement file embedding #414

Merged
merged 11 commits into from
Dec 16, 2023
Merged

Implement file embedding #414

merged 11 commits into from
Dec 16, 2023

Conversation

RhoSigma-QB64
Copy link
Member

@RhoSigma-QB64 RhoSigma-QB64 commented Dec 13, 2023

$EMBED:'filename','handle' and _EMBEDDED$("handle")
Closing #383

$EMBED:'filename','handle' and _EMBEDDED$("handle")
@RhoSigma-QB64 RhoSigma-QB64 added the enhancement New feature or request label Dec 13, 2023
@RhoSigma-QB64 RhoSigma-QB64 self-assigned this Dec 13, 2023
@RhoSigma-QB64 RhoSigma-QB64 linked an issue Dec 13, 2023 that may be closed by this pull request
try again...
forget AddQuotes$(), rather make file.bas self-contained using CHR$(34)
as suggested by Matt...
ideally these should have been part of the 1st commit
Copy link
Contributor

@mkilgore mkilgore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with my comments, I'd be interested to see at least some basic tests of $EMBED and _EMBEDDED$() added, it should be pretty simple I think (embed a file, read it, and verify the contents are correct). Otherwise this may not get tested on Linux and Mac OS before the next release.

@@ -3548,6 +3635,60 @@ DO

END IF 'QB64 Metacommands

'Check that Embed-Handle is a literal string and follows conventions,
'also check if such a handle was defined and mark it as used then.
IF INSTR(a3u$, qb64prefix$ + "EMBEDDED$") > 0 THEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in understanding that this code is only necessary so that we can exclude unnecessary $EMBEDs?

I'm personally tempted to say we just shouldn't do that, it seems like the string literal requirement will introduce a lot of confusion for little benefit. This would be the only function that works that way, and func__embedded() still takes a string argument at runtime anyway so it's not really necessary to do this.

NEXT vc&
'--- process remaining BYTEs ---
IF cntB& > 0 THEN
PRINT #dff%, "static const uint8_t "; handle$; "B[] = {"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't just output the whole file this way? It seems a lot simpler overall since you'd only have one array to deal with. Once it's compiled it doesn't really matter if it was originally one array or multiple.

Going along with that, if we only had one array then you could use qbs_new_txt() with the single array and avoid needing to allocating a new buffer to copy the data into. In the non-compressed case it would mean the data could \be used as-is with no copying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mkilgore. We can just create a byte array and have the whole file inside it. We can use sizeof to get the size of the array. Whether the data is compressed or not can be simply stored in a bool inside the function that we are generating. Uncompressed data can be directly used without the need to do two memcpys.

qbs *GetArrayData_peLogo(qbs *handle)
{
    if (!qbs_equal(handle, qbs_new_txt("peLogo"))) {return qbs_new_txt("");}

    qbs  *data = qbs_new(17394, 1);
    void *buff = data -> chr;

    memcpy(buff, &peLogoL0[1], peLogoL0[0] << 2);
    buff += (peLogoL0[0] << 2);
    memcpy(buff, &peLogoB[1], peLogoB[0]);

    return func__inflate(data, 17438, 1);
}

END IF
PRINT #dff%, ""
IF packed% THEN
PRINT #dff%, " return func__inflate(data, "; LTRIM$(STR$(LEN(filedata$))); ", 1);"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to qbs_free() the data string here, since inflate creates a new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func__inflate() returns a tmp qbs. Do we still need to qbs_free() that?

qbs *ret = qbs_new(uncompsize, 1);

I am asking this because I am working on something and I see a lot of libqb stuff creating strings this way and returning it.

The only time, it is not done this way is when the qbs is static inside the function and allocated only once during the lifetime of the program.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I'll be honest, I don't know for sure, there's some spooky stuff that goes on with the string allocation and cleanup.

My thinking was that the data string that we pass to func__inflate() is created earlier in this function using qbs_new(), but that's not the string that gets returned - func__inflate() calls qbs_new() itself and returns a new string. The data string is never used again after the func__inflate() call and seems like it's a leak 🤷

FOR vc& = 0 TO cntV&
IF vc& = cntV& THEN numL& = (cntL& MOD 8180): ELSE numL& = 8180
PRINT #dff%, "static const uint32_t "; handle$; "L"; LTRIM$(STR$(vc&)); "[] = {"
PRINT #dff%, " "; LTRIM$(STR$(numL& * 8)); ","
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the code would be easier to read if you stored the lengths in separate variables rather than in the arrays themselves. But also, you can use sizeof to get the size of the array (in bytes) when defined this way, you don't actually need to store the length.

PRINT #dff%, " if (!qbs_equal(handle, qbs_new_txt("; AddQuotes$(handle$); "))) {return qbs_new_txt("; MKI$(&H2222); ");}"
PRINT #dff%, ""
PRINT #dff%, " qbs *data = qbs_new("; LTRIM$(STR$(fl&)); ", 1);"
PRINT #dff%, " void *buff = data -> chr;"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, but this should really be a char *. gcc lets you do pointer math on void * for some reason but it's not ideal.

CLOSE #sff%
'--- try to compress ---
compdata$ = _DEFLATE$(filedata$)
IF LEN(compdata$) < LEN(filedata$) THEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to let the caller pick if compression is done - going along with some of my other comments, if the compression is skipped then you could get away with no copying of the data. That could be something that the developer explicitly wants to avoid the runtime and memory cost (though we probably want to discourage embedding files so large that it matters anyway...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the data is stored uncompressed, is there some way we could get a pointer and size to wherever it's embedded in memory? Then one could point a mem block to it with M = _MEM(pointer, size), and access that embedded data directly, without ever having to assign it to another actual variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That goes to my other comment about the multiple C arrays vs. a single one. If you store the entire thing in a single array uncompressed then you can create a STRING out of it using qbs_new_txt() and that does no copying. In that situation _EMBEDDED$() would return a STRING which is directly backed by the original data, and only if you modify the string will the data need to be copied.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to let the caller pick if compression is done - going along with some of my other comments, if the compression is skipped then you could get away with no copying of the data. That could be something that the developer explicitly wants to avoid the runtime and memory cost (though we probably want to discourage embedding files so large that it matters anyway...)

I agree. We should let the caller decide. From my observations even an MP3 file would compress if you pass it through _DEFLATE but only by a few bytes. In those cases you end up decompressing the data twice which can contribute a significant performance hit.

OPEN "A", #eflFF, tmpdir$ + "embedded.cpp"
'append the internal retrieval function for _EMBEDDED$()
PRINT #eflFF, "qbs *func__embedded(qbs *handle) {"
PRINT #eflFF, " qbs *result = qbs_new_txt("; MKI$(&H2222); ");"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just call this on the return line, otherwise I'm pretty sure you need to qbs_free() it when it doesn't get used.

PRINT #eflFF, ""
FOR i = 0 TO eflUB
IF embedFileList$(eflFile, i) <> "" AND VAL(embedFileList$(eflUsed, i)) > 0 THEN
PRINT #eflFF, " result = GetArrayData_"; embedFileList$(eflHand, i); "(handle);"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth producing a hash table to look-up the correct function rather than the loop.

@@ -3413,6 +3418,88 @@ DO
RETURN
END IF

IF LEFT$(a3u$, 6) = "$EMBED" THEN
'redim/clear the $EMBED array on each new pass
IF linenumber <= eflLastLine THEN eflLastLine = 0: REDIM SHARED embedFileList$(3, 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should clear this out earlier when all the other stuff gets cleared before compilation (Ex. maybe around line 1646, that's where I clear the midi stuff).

The user might build a program that uses $EMBED and then build one that does not (or remove the line). When that happens, the embedFileList$ won't get reset on the next compile since there will be no $EMBED lines to trigger this.

eflUB = UBOUND(embedFileList$, 2)
FOR i = 0 TO eflUB
IF embedFileList$(eflFile, i) = EmbedFile$ THEN
a$ = "File '" + EmbedFile$ + "' was already embedded in line"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm not sure we should ban this, I think you could reasonably want to use the same file for several embeds (Ex. You reuse the same image for different handles, so that you can easily change them separately in the future). I'd rather just allow this now and then in the future we could optimize it so that we only embed one copy of each file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you just need one embed, even in that case?

image1 = _LOADIMAGE(_Embedded(foo$), 32, "memory")
image2 = _LOADIMAGE(_Embedded(foo$), 32, "memory")
image3 = _LOADIMAGE(_Embedded(foo$), 32, "memory")

The embed is just the image file itself in memory, and just like if it was on the drive, you'd still need to _LoadImage it before it'd become a valid, working image handle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is not a "can you" but rather "should you". If you have multiple embeds that serve completely different purposes then you don't want them to share a handle regardless of whether they use the same file - in the future you may change the file for one of the handles but not both, and you wouldn't want to have to then figure out which spots are supposed to use which handle.

IMO It's the same as saying "two CONSTs shouldn't have the same value, if they do then just use the same CONST in both places". It's technically true, you don't need more than one CONST with a particular value, but it misses the point that if one CONST is named test_file and another is named logo they might temporarily (or always) have the same value but they don't serve the same purpose and combining them wouldn't be correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's like having two CONSTs with the same value. I think it's more like putting the same font file inside your folder and then naming it cour(2).ttf, cout(3).ttf, cour(4).ttf.... At the end of the day, does it matter which of those files you use to load into a font handle? They're all the exact same font, and unless you have some plan to overwrite those files, there's absolutely no reason to have more than that one.

I honestly don't think there's any way to overwrite (and save) embedded data inside a program, with that program itself. Windows doesn't allow that type of virus-like behavior at all.

I'd think a better way to think of it would be like with DATA statements. Would you ever include the same 1000 lines of data, just to place them under a separate label? Or would you just make multiple labels for the same data?

label_1:
label_2:
label_3:

DATA "dog", "frog", "fish", "wish", "chicken"
...999 more lines of data

Then, if you really need to, you can always RESTORE label_1, RESTORE label_2, or RESTORE label_3, but you still wouldn't need to have that same 1000 lines repeated again and again in the same program.

At the most, I think allowing multiple labels for the same embed might not be so terrible. $EMBED:'foo.ext', 'handle1', 'handle2', 'handle3' But the idea of including the same file multiple times, just to have it listed under a separate label to read back from, seems like a bloat of exe size and a waste of an user's memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember, the handle is only in use really with _EMBEDDED, which will decompress the data and then assign it to wherever you want. For what you're talking about, all you'd need is:

a$ = _EMBEDDED$(foo$)
b$ = _EMBEDDED$(foo$)

Use a$ all you want in one spot. Use b$ all you want in a completely different spot. foo$ itself is embedded and wouldn't ever really change on you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're actually in agreement :P On the backend I agree, we should only keep one copy of the file embedded and serve it up for all the handles that use that file, but that's also an optimization we could do later-on if Rho doesn't want to do it here.

If we error on using the same file though, as it currently works, then introducing that optimization won't help people who already duplicated their files to get around the error.

Copy link
Contributor

@a740g a740g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to address the performance / memory costs.
Allow the user the choice of compressing the data or not.

CLOSE #sff%
'--- try to compress ---
compdata$ = _DEFLATE$(filedata$)
IF LEN(compdata$) < LEN(filedata$) THEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to let the caller pick if compression is done - going along with some of my other comments, if the compression is skipped then you could get away with no copying of the data. That could be something that the developer explicitly wants to avoid the runtime and memory cost (though we probably want to discourage embedding files so large that it matters anyway...)

I agree. We should let the caller decide. From my observations even an MP3 file would compress if you pass it through _DEFLATE but only by a few bytes. In those cases you end up decompressing the data twice which can contribute a significant performance hit.

NEXT vc&
'--- process remaining BYTEs ---
IF cntB& > 0 THEN
PRINT #dff%, "static const uint8_t "; handle$; "B[] = {"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mkilgore. We can just create a byte array and have the whole file inside it. We can use sizeof to get the size of the array. Whether the data is compressed or not can be simply stored in a bool inside the function that we are generating. Uncompressed data can be directly used without the need to do two memcpys.

qbs *GetArrayData_peLogo(qbs *handle)
{
    if (!qbs_equal(handle, qbs_new_txt("peLogo"))) {return qbs_new_txt("");}

    qbs  *data = qbs_new(17394, 1);
    void *buff = data -> chr;

    memcpy(buff, &peLogoL0[1], peLogoL0[0] << 2);
    buff += (peLogoL0[0] << 2);
    memcpy(buff, &peLogoB[1], peLogoB[0]);

    return func__inflate(data, 17438, 1);
}

- As suggested by @mkilgore , moved the embed list array reset out of the $EMBED block
- Imposed a 20% least ratio for compression
- Moved the handle comparison into `func__embedded()` to avoid some unnecessary function calls
simple test case, which just embeds the `test.output` file and prints it
@RhoSigma-QB64 RhoSigma-QB64 merged commit 7bb6fb3 into main Dec 16, 2023
@RhoSigma-QB64 RhoSigma-QB64 deleted the file-embedding branch December 16, 2023 08:31
@RhoSigma-QB64 RhoSigma-QB64 restored the file-embedding branch December 16, 2023 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

$Embed and _Embed
4 participants