Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Normalize ingested API data to unicode NFC format
Some unicode characters can be represented in "NFC" or "NFD" format. The C and D stand for composed and decomposed. Composed means characters like the umlaut-ed "u" are a single code point, whereas decomposed means a standard "u" followed by an umlaut "diaeresis" character which combines to be VIEWED as a single character, when it is actually two characters. In the HFS filesystem era of macOS, all files used NFD format which, while valid, was not how every single linux-based system does it - they use NFC. As of APFS, either form is accepted and the filesystem will not modify/normalize the written paths. However, in the case of mktorrent (and probably because mktorrent uses libraries built-in to macOS), files read from the APFS disk (which are written in NFC) are actually normalized back to NFD when mktorrent generates the encoded data for the torrent. This means that a torrent built with mktorrent on macOS will fail to verify when loaded in a torrent client on a linux machine, when a path within that torrent contains a string that is different between NFC and NFD normalization forms. This is not something I can fix here (despite this commit making sure we write data in NFC format). There is a GitHub issue open on the mktorrent repo: pobrn/mktorrent#14 For now I need to make sure that it is clear that we need to generate the torrents on a Linux box until we can confirm that mktorrent makes the right normalization decisions.
- Loading branch information