Questions: What's the magic behind SigleFileZ? #1275

giiutfff · 2023-10-08T15:03:52Z

giiutfff
Oct 8, 2023

I couldn't find any document about the logic of the feature added recently "Self-Extracting ZIP Files Added to SingleFile (version 1.22)". And it feels a little intimidating to switch to the new SingleFileZ format considering Mozilla dumped MAFF. So I'd like to know what exactly is this new format. How it works and what's the future.

Binary Data in HTML?

I don't know that you can actually embed binary data in HTML before SingleFileZ. It's really impressive. How did you do this?! Why use <xmp>, isn't it deprecated? What is that base64 data in <sfz-extra-data>? And why use ISO-8859-1 encoding?

Redirect HTTP Request??

Here are some pieces of saved HTML (extracted using 7z):
<link rel=stylesheet href="stylesheet_0.css">
<img src=images/1.svg>
How does it redirect these urls? I never know that these urls can be intercepted by javascript. I tried to read the source code but I'm too stupid to understand. Could you please explain the method? Is it future proof(like open a saved file decades later using the future system and software)?

gildas-lormeau · 2023-10-08T16:23:37Z

gildas-lormeau
Oct 8, 2023
Maintainer

Thanks for the questions, it gives me a chance to document how the format works briefly. The files generated by SingleFileZ are ZIP files. In fact, the ZIP specification allows arbitrary data to be inserted before and after the payload. In the case of SingleFileZ, this feature is used to disguise the ZIP file as an HTML file. The resulting HTML page is invalid because it contains binary data, but the HTML specification allows for this case. This file contains also a script for unzipping the ZIP payload when displaying the page in a browser.
The purpose of this script is to read the ZIP payload in binary and unzip it. To make the script read the page in binary form, you can use window.fetch(), but this doesn't work for "security reasons" in Chromium-based browsers when the page is displayed from the filesystem.
To get around this problem, I've implemented a mechanism to read the binary directly from the DOM. When the page is encoded in UTF-8, all invalid characters are transformed into "U+FFFD REPLACEMENT CHARACTER". This makes it a very impractical encoding for this purpose because a lot of data is lost. On the other hand, when the page is encoded in ISO-8859-1, this is only the case for the 0 character in the table; the other invalid characters are not transformed into "U+FFFD REPLACEMENT CHARACTER" and can be recovered. The last problem to be solved is the CR and CR+LF characters. All CR and CR+LF characters are replaced by LF characters, so data must be included in the page to restore these characters. This is the role of the <sfz-extra-data> tag, which contains this data as well as the offset of the start of the ZIP payload.

Paths like stylesheet_0.css or images/1.svg in the HTML page are replaced with data: or blob: URIs by the script just before displaying the page.

Since the saved page is a ZIP file. I think this is quite future proof safe for the coming decades. Backward compatibility of HTML/JS also ensures that the script in the saved page should also work for a long time.

0 replies

giiutfff · 2023-10-08T17:56:56Z

giiutfff
Oct 8, 2023
Author

Thank you for the explanation.

0 replies

nettybun · 2023-10-11T20:28:40Z

nettybun
Oct 11, 2023

Thanks for the write up. I also found myself digging through the recent commits like "merge sfz code" to see how you managed the universal format. Really impressive work arounds! I'll have to keep reading on ISO-8859-1 encoding and why some characters like 0 and cr+lf are lost. Glad it shipped.

PS. This code made me smile :)

	async function base64ToUint32Array(data) {
		return new Uint32Array(await (await fetch("data:application/octet-stream;base64," + data)).arrayBuffer());
	}

0 replies

gildas-lormeau · 2023-10-12T22:58:17Z

gildas-lormeau
Oct 12, 2023
Maintainer

@heyheyhello You're welcome. I'm glad my code made you smile :)

For the record, here is the test page I used to do my tests to find the best encoding: https://jsfiddle.net/7qv3y20z/. It shows how the binary content is altered when read from the DOM depending on the encoding of the page.

Here is below the code of the test.

<!DOCTYPE html>
<html>

  <head>
    <title>Test binary content</title>
    <style>
      body {
        font-family: monospace;
      }
    </style>
  </head>

  <body>
    <iframe hidden></iframe>
  </body>

</html>

const ENCODINGS = [
  "utf-8",
  "ibm866",
  "iso-8859-2",
  "iso-8859-3",
  "iso-8859-4",
  "iso-8859-5",
  "iso-8859-6",
  "iso-8859-7",
  "iso-8859-8",
  "iso-8859-8i",
  "iso-8859-10",
  "iso-8859-13",
  "iso-8859-14",
  "iso-8859-15",
  "iso-8859-16",
  "koi8-r",
  "koi8-u",
  "macintosh",
  "windows-874",
  "windows-1250",
  "windows-1251",
  "windows-1252",
  "windows-1253",
  "windows-1254",
  "windows-1255",
  "windows-1256",
  "windows-1257",
  "windows-1258",
  "x-mac-cyrillic",
  "gbk",
  "gb18030",
  "big5",
  "euc-jp",
  "iso-2022-jp",
  "shift-jis",
  "euc-kr",
  "utf-16be",
  "utf-16le",
  "x-user-defined"
];

let encodingIndex = 0;
onmessage = ({
  data
}) => {
  document.querySelector("iframe").src = "about:blank";
  const difference = [];
  data.forEach((value, index) => {
      if (value != index) {
          difference.push({
              expected: index,
              read: value
          });
      }
  });
  document.body.innerHTML +=
      "<details><summary>" + ENCODINGS[encodingIndex] + " (" +
      difference.length + " differences)</summary>" + 
      JSON.stringify(difference) + "<br><br></details>";
  encodingIndex++;
  if (encodingIndex < ENCODINGS.length) {
      runNext();
  }
};
runNext();

function runNext() {
  const blob = new Blob([
      "<!DOCTYPE html> <html><head><meta charset=\"",
      ENCODINGS[encodingIndex],
      "\"></head><body><!--",
      new Uint8Array((new Array(256).fill(0).map((value, index) => index))),
      "--><script>(",
      () => {
          const commentData = Array.from(document.body.firstChild.textContent).map((value) => value.charCodeAt(0));
          parent.postMessage(commentData, "*");
      },
      ")()<\/script></body></html>"
  ], {
      type: "text/html"
  });
  document.querySelector("iframe").src = URL.createObjectURL(blob);
}

0 replies

gildas-lormeau · 2023-10-25T22:45:33Z

gildas-lormeau
Oct 25, 2023
Maintainer

For the record, I've updated the FAQ to explain how the format works, see https://github.com/gildas-lormeau/SingleFile/blob/master/faq.md#how-does-the-self-extracting-zip-format-work

0 replies

gildas-lormeau · 2024-01-02T02:45:01Z

gildas-lormeau
Jan 2, 2024
Maintainer

Mainly because I can, I've started to play with the support of PNG files. Here's an example of page which is also a ZIP file and a PNG file: https://gildas-lormeau.github.io/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions: What's the magic behind SigleFileZ? #1275

{{title}}

Replies: 6 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Questions: What's the magic behind SigleFileZ? #1275

giiutfff Oct 8, 2023

Binary Data in HTML?

Redirect HTTP Request??

Replies: 6 comments

gildas-lormeau Oct 8, 2023 Maintainer

giiutfff Oct 8, 2023 Author

nettybun Oct 11, 2023

gildas-lormeau Oct 12, 2023 Maintainer

gildas-lormeau Oct 25, 2023 Maintainer

gildas-lormeau Jan 2, 2024 Maintainer

giiutfff
Oct 8, 2023

gildas-lormeau
Oct 8, 2023
Maintainer

giiutfff
Oct 8, 2023
Author

nettybun
Oct 11, 2023

gildas-lormeau
Oct 12, 2023
Maintainer

gildas-lormeau
Oct 25, 2023
Maintainer

gildas-lormeau
Jan 2, 2024
Maintainer