Implement file embedding #414

RhoSigma-QB64 · 2023-12-13T19:53:25Z

$EMBED:'filename','handle' and _EMBEDDED$("handle")
Closing #383

$EMBED:'filename','handle' and _EMBEDDED$("handle")

try again...

forget AddQuotes$(), rather make file.bas self-contained using CHR$(34)

as suggested by Matt...

ideally these should have been part of the 1st commit

mkilgore

Along with my comments, I'd be interested to see at least some basic tests of $EMBED and _EMBEDDED$() added, it should be pretty simple I think (embed a file, read it, and verify the contents are correct). Otherwise this may not get tested on Linux and Mac OS before the next release.

mkilgore · 2023-12-13T20:32:45Z

source/qb64pe.bas

@@ -3548,6 +3635,60 @@ DO

    END IF 'QB64 Metacommands

+    'Check that Embed-Handle is a literal string and follows conventions,
+    'also check if such a handle was defined and mark it as used then.
+    IF INSTR(a3u$, qb64prefix$ + "EMBEDDED$") > 0 THEN


Am I right in understanding that this code is only necessary so that we can exclude unnecessary $EMBEDs?

I'm personally tempted to say we just shouldn't do that, it seems like the string literal requirement will introduce a lot of confusion for little benefit. This would be the only function that works that way, and func__embedded() still takes a string argument at runtime anyway so it's not really necessary to do this.

mkilgore · 2023-12-13T20:34:37Z

source/utilities/file.bas

+    NEXT vc&
+    '--- process remaining BYTEs ---
+    IF cntB& > 0 THEN
+        PRINT #dff%, "static const uint8_t "; handle$; "B[] = {"


Is there a reason we don't just output the whole file this way? It seems a lot simpler overall since you'd only have one array to deal with. Once it's compiled it doesn't really matter if it was originally one array or multiple.

Going along with that, if we only had one array then you could use qbs_new_txt() with the single array and avoid needing to allocating a new buffer to copy the data into. In the non-compressed case it would mean the data could \be used as-is with no copying.

I agree with @mkilgore. We can just create a byte array and have the whole file inside it. We can use sizeof to get the size of the array. Whether the data is compressed or not can be simply stored in a bool inside the function that we are generating. Uncompressed data can be directly used without the need to do two memcpys.

qbs *GetArrayData_peLogo(qbs *handle) { if (!qbs_equal(handle, qbs_new_txt("peLogo"))) {return qbs_new_txt("");} qbs *data = qbs_new(17394, 1); void *buff = data -> chr; memcpy(buff, &peLogoL0[1], peLogoL0[0] << 2); buff += (peLogoL0[0] << 2); memcpy(buff, &peLogoB[1], peLogoB[0]); return func__inflate(data, 17438, 1); }

mkilgore · 2023-12-13T20:37:58Z

source/utilities/file.bas

+    END IF
+    PRINT #dff%, ""
+    IF packed% THEN
+        PRINT #dff%, "    return func__inflate(data, "; LTRIM$(STR$(LEN(filedata$))); ", 1);"


I think we need to qbs_free() the data string here, since inflate creates a new one.

func__inflate() returns a tmp qbs. Do we still need to qbs_free() that?

qbs *ret = qbs_new(uncompsize, 1);

I am asking this because I am working on something and I see a lot of libqb stuff creating strings this way and returning it.

The only time, it is not done this way is when the qbs is static inside the function and allocated only once during the lifetime of the program.

Well I'll be honest, I don't know for sure, there's some spooky stuff that goes on with the string allocation and cleanup.

My thinking was that the data string that we pass to func__inflate() is created earlier in this function using qbs_new(), but that's not the string that gets returned - func__inflate() calls qbs_new() itself and returns a new string. The data string is never used again after the func__inflate() call and seems like it's a leak 🤷

mkilgore · 2023-12-13T20:42:01Z

source/utilities/file.bas

+    FOR vc& = 0 TO cntV&
+        IF vc& = cntV& THEN numL& = (cntL& MOD 8180): ELSE numL& = 8180
+        PRINT #dff%, "static const uint32_t "; handle$; "L"; LTRIM$(STR$(vc&)); "[] = {"
+        PRINT #dff%, "    "; LTRIM$(STR$(numL& * 8)); ","


IMO the code would be easier to read if you stored the lengths in separate variables rather than in the arrays themselves. But also, you can use sizeof to get the size of the array (in bytes) when defined this way, you don't actually need to store the length.

mkilgore · 2023-12-13T20:43:59Z

source/utilities/file.bas

+    PRINT #dff%, "    if (!qbs_equal(handle, qbs_new_txt("; AddQuotes$(handle$); "))) {return qbs_new_txt("; MKI$(&H2222); ");}"
+    PRINT #dff%, ""
+    PRINT #dff%, "    qbs  *data = qbs_new("; LTRIM$(STR$(fl&)); ", 1);"
+    PRINT #dff%, "    void *buff = data -> chr;"


Small thing, but this should really be a char *. gcc lets you do pointer math on void * for some reason but it's not ideal.

mkilgore · 2023-12-13T20:49:50Z

source/utilities/file.bas

+    CLOSE #sff%
+    '--- try to compress ---
+    compdata$ = _DEFLATE$(filedata$)
+    IF LEN(compdata$) < LEN(filedata$) THEN


It might be nice to let the caller pick if compression is done - going along with some of my other comments, if the compression is skipped then you could get away with no copying of the data. That could be something that the developer explicitly wants to avoid the runtime and memory cost (though we probably want to discourage embedding files so large that it matters anyway...)

If the data is stored uncompressed, is there some way we could get a pointer and size to wherever it's embedded in memory? Then one could point a mem block to it with M = _MEM(pointer, size), and access that embedded data directly, without ever having to assign it to another actual variable.

That goes to my other comment about the multiple C arrays vs. a single one. If you store the entire thing in a single array uncompressed then you can create a STRING out of it using qbs_new_txt() and that does no copying. In that situation _EMBEDDED$() would return a STRING which is directly backed by the original data, and only if you modify the string will the data need to be copied.

It might be nice to let the caller pick if compression is done - going along with some of my other comments, if the compression is skipped then you could get away with no copying of the data. That could be something that the developer explicitly wants to avoid the runtime and memory cost (though we probably want to discourage embedding files so large that it matters anyway...)

I agree. We should let the caller decide. From my observations even an MP3 file would compress if you pass it through _DEFLATE but only by a few bytes. In those cases you end up decompressing the data twice which can contribute a significant performance hit.

mkilgore · 2023-12-13T20:53:46Z

source/qb64pe.bas

+OPEN "A", #eflFF, tmpdir$ + "embedded.cpp"
+'append the internal retrieval function for _EMBEDDED$()
+PRINT #eflFF, "qbs *func__embedded(qbs *handle) {"
+PRINT #eflFF, "    qbs *result = qbs_new_txt("; MKI$(&H2222); ");"


I think you should just call this on the return line, otherwise I'm pretty sure you need to qbs_free() it when it doesn't get used.

mkilgore · 2023-12-13T20:55:05Z

source/qb64pe.bas

+PRINT #eflFF, ""
+FOR i = 0 TO eflUB
+    IF embedFileList$(eflFile, i) <> "" AND VAL(embedFileList$(eflUsed, i)) > 0 THEN
+        PRINT #eflFF, "    result = GetArrayData_"; embedFileList$(eflHand, i); "(handle);"


It might be worth producing a hash table to look-up the correct function rather than the loop.

mkilgore · 2023-12-13T21:04:56Z

source/qb64pe.bas

@@ -3413,6 +3418,88 @@ DO
            RETURN
        END IF

+        IF LEFT$(a3u$, 6) = "$EMBED" THEN
+            'redim/clear the $EMBED array on each new pass
+            IF linenumber <= eflLastLine THEN eflLastLine = 0: REDIM SHARED embedFileList$(3, 10)


I think we should clear this out earlier when all the other stuff gets cleared before compilation (Ex. maybe around line 1646, that's where I clear the midi stuff).

The user might build a program that uses $EMBED and then build one that does not (or remove the line). When that happens, the embedFileList$ won't get reset on the next compile since there will be no $EMBED lines to trigger this.

mkilgore · 2023-12-13T21:22:54Z

source/qb64pe.bas

+            eflUB = UBOUND(embedFileList$, 2)
+            FOR i = 0 TO eflUB
+                IF embedFileList$(eflFile, i) = EmbedFile$ THEN
+                    a$ = "File '" + EmbedFile$ + "' was already embedded in line"


Personally I'm not sure we should ban this, I think you could reasonably want to use the same file for several embeds (Ex. You reuse the same image for different handles, so that you can easily change them separately in the future). I'd rather just allow this now and then in the future we could optimize it so that we only embed one copy of each file.

Wouldn't you just need one embed, even in that case?

image1 = _LOADIMAGE(_Embedded(foo$), 32, "memory")
image2 = _LOADIMAGE(_Embedded(foo$), 32, "memory")
image3 = _LOADIMAGE(_Embedded(foo$), 32, "memory")

The embed is just the image file itself in memory, and just like if it was on the drive, you'd still need to _LoadImage it before it'd become a valid, working image handle.

My thinking is not a "can you" but rather "should you". If you have multiple embeds that serve completely different purposes then you don't want them to share a handle regardless of whether they use the same file - in the future you may change the file for one of the handles but not both, and you wouldn't want to have to then figure out which spots are supposed to use which handle.

IMO It's the same as saying "two CONSTs shouldn't have the same value, if they do then just use the same CONST in both places". It's technically true, you don't need more than one CONST with a particular value, but it misses the point that if one CONST is named test_file and another is named logo they might temporarily (or always) have the same value but they don't serve the same purpose and combining them wouldn't be correct.

I don't think it's like having two CONSTs with the same value. I think it's more like putting the same font file inside your folder and then naming it cour(2).ttf, cout(3).ttf, cour(4).ttf.... At the end of the day, does it matter which of those files you use to load into a font handle? They're all the exact same font, and unless you have some plan to overwrite those files, there's absolutely no reason to have more than that one.

I honestly don't think there's any way to overwrite (and save) embedded data inside a program, with that program itself. Windows doesn't allow that type of virus-like behavior at all.

I'd think a better way to think of it would be like with DATA statements. Would you ever include the same 1000 lines of data, just to place them under a separate label? Or would you just make multiple labels for the same data?

label_1:
label_2:
label_3:

DATA "dog", "frog", "fish", "wish", "chicken"
...999 more lines of data

Then, if you really need to, you can always RESTORE label_1, RESTORE label_2, or RESTORE label_3, but you still wouldn't need to have that same 1000 lines repeated again and again in the same program.

At the most, I think allowing multiple labels for the same embed might not be so terrible. $EMBED:'foo.ext', 'handle1', 'handle2', 'handle3' But the idea of including the same file multiple times, just to have it listed under a separate label to read back from, seems like a bloat of exe size and a waste of an user's memory.

Remember, the handle is only in use really with _EMBEDDED, which will decompress the data and then assign it to wherever you want. For what you're talking about, all you'd need is:

a$ = _EMBEDDED$(foo$)
b$ = _EMBEDDED$(foo$)

Use a$ all you want in one spot. Use b$ all you want in a completely different spot. foo$ itself is embedded and wouldn't ever really change on you.

I think we're actually in agreement :P On the backend I agree, we should only keep one copy of the file embedded and serve it up for all the handles that use that file, but that's also an optimization we could do later-on if Rho doesn't want to do it here.

If we error on using the same file though, as it currently works, then introducing that optimization won't help people who already duplicated their files to get around the error.

a740g

We'll need to address the performance / memory costs.
Allow the user the choice of compressing the data or not.

a740g · 2023-12-14T04:40:43Z

source/utilities/file.bas

+    CLOSE #sff%
+    '--- try to compress ---
+    compdata$ = _DEFLATE$(filedata$)
+    IF LEN(compdata$) < LEN(filedata$) THEN


It might be nice to let the caller pick if compression is done - going along with some of my other comments, if the compression is skipped then you could get away with no copying of the data. That could be something that the developer explicitly wants to avoid the runtime and memory cost (though we probably want to discourage embedding files so large that it matters anyway...)

I agree. We should let the caller decide. From my observations even an MP3 file would compress if you pass it through _DEFLATE but only by a few bytes. In those cases you end up decompressing the data twice which can contribute a significant performance hit.

a740g · 2023-12-14T05:52:38Z

source/utilities/file.bas

+    NEXT vc&
+    '--- process remaining BYTEs ---
+    IF cntB& > 0 THEN
+        PRINT #dff%, "static const uint8_t "; handle$; "B[] = {"


I agree with @mkilgore. We can just create a byte array and have the whole file inside it. We can use sizeof to get the size of the array. Whether the data is compressed or not can be simply stored in a bool inside the function that we are generating. Uncompressed data can be directly used without the need to do two memcpys.

qbs *GetArrayData_peLogo(qbs *handle) { if (!qbs_equal(handle, qbs_new_txt("peLogo"))) {return qbs_new_txt("");} qbs *data = qbs_new(17394, 1); void *buff = data -> chr; memcpy(buff, &peLogoL0[1], peLogoL0[0] << 2); buff += (peLogoL0[0] << 2); memcpy(buff, &peLogoB[1], peLogoB[0]); return func__inflate(data, 17438, 1); }

@mkilgore

- As suggested by @mkilgore , moved the embed list array reset out of the $EMBED block - Imposed a 20% least ratio for compression - Moved the handle comparison into `func__embedded()` to avoid some unnecessary function calls

simple test case, which just embeds the `test.output` file and prints it

Implement file embedding

38eed18

$EMBED:'filename','handle' and _EMBEDDED$("handle")

RhoSigma-QB64 added the enhancement New feature or request label Dec 13, 2023

RhoSigma-QB64 requested review from mkilgore, SteveMcNeill and a740g December 13, 2023 19:53

RhoSigma-QB64 self-assigned this Dec 13, 2023

RhoSigma-QB64 linked an issue Dec 13, 2023 that may be closed by this pull request

$Embed and _Embed #383

Closed

RhoSigma-QB64 added 4 commits December 13, 2023 21:42

Fixing tests

34bbdae

try again...

Fixing tests

1358716

forget AddQuotes$(), rather make file.bas self-contained using CHR$(34)

Fixing Makefile

acfcd4f

as suggested by Matt...

Add missing changes

2d9018c

ideally these should have been part of the 1st commit

mkilgore reviewed Dec 14, 2023

View reviewed changes

a740g reviewed Dec 14, 2023

View reviewed changes

RhoSigma-QB64 added 6 commits December 15, 2023 00:22

Minor flow changes

45ffc8c

- As suggested by @mkilgore , moved the embed list array reset out of the $EMBED block - Imposed a 20% least ratio for compression - Moved the handle comparison into `func__embedded()` to avoid some unnecessary function calls

Adds an test case

84cd581

simple test case, which just embeds the `test.output` file and prints it

Fixing the test

d37d0ff

Fixed include path separators

84c9761

Avoid OS differences in tests

395af5c

Another try...

0adcd6c

SteveMcNeill approved these changes Dec 15, 2023

View reviewed changes

RhoSigma-QB64 merged commit 7bb6fb3 into main Dec 16, 2023

RhoSigma-QB64 deleted the file-embedding branch December 16, 2023 08:31

RhoSigma-QB64 restored the file-embedding branch December 16, 2023 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement file embedding #414

Implement file embedding #414

RhoSigma-QB64 commented Dec 13, 2023 •

edited

Loading

mkilgore left a comment

mkilgore Dec 13, 2023

mkilgore Dec 13, 2023

a740g Dec 14, 2023

mkilgore Dec 13, 2023

a740g Dec 14, 2023

mkilgore Dec 14, 2023

mkilgore Dec 13, 2023

mkilgore Dec 13, 2023

mkilgore Dec 13, 2023

SteveMcNeill Dec 14, 2023

mkilgore Dec 14, 2023

a740g Dec 14, 2023

mkilgore Dec 13, 2023

mkilgore Dec 13, 2023

mkilgore Dec 13, 2023

mkilgore Dec 13, 2023

SteveMcNeill Dec 14, 2023

mkilgore Dec 14, 2023

SteveMcNeill Dec 14, 2023

SteveMcNeill Dec 14, 2023

mkilgore Dec 14, 2023

a740g left a comment

a740g Dec 14, 2023

a740g Dec 14, 2023

Implement file embedding #414

Implement file embedding #414

Conversation

RhoSigma-QB64 commented Dec 13, 2023 • edited Loading

mkilgore left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a740g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RhoSigma-QB64 commented Dec 13, 2023 •

edited

Loading