Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sum changes when mp3 tag metadata changes #94

Open
krixano opened this issue Jul 9, 2022 · 10 comments
Open

Sum changes when mp3 tag metadata changes #94

krixano opened this issue Jul 9, 2022 · 10 comments

Comments

@krixano
Copy link

krixano commented Jul 9, 2022

I copied an mp3 file and changed the tag info (album, title, and artist), and hashed the origin and the copy, and they came up as different values.

The format of the files was ID3v2.3

@krixano
Copy link
Author

krixano commented Jul 9, 2022

@wader
Copy link
Contributor

wader commented Jul 12, 2022

No sure i follow, if some tag info was changed the hash should change also?

@krixano
Copy link
Author

krixano commented Jul 12, 2022

@wader The hash function in the library is not supposed to give a different result when tags change, as specified by the library's documentation. The hash function in the library is intended to only hash (SHA1) the audio portion without the header or any tags. This is literally specified right in the README:

This package also provides a metadata-invariant checksum for audio files: only the audio data is used to construct the checksum.

The fact that you don't follow just means you didn't read all of the documentation.

@krixano
Copy link
Author

krixano commented Jul 12, 2022

This issue has been reported previously as well:

#53

@wader
Copy link
Contributor

wader commented Jul 12, 2022

Ok sorry, i read "hashed" as hashing the whole file, not using the Sum*() functions. As i mention in #53 i think the issue is that sizeToEndOffset does something wrong. Looking at it now it seems to restore the original position incorrectly, seek to end-128, seeks back -current_pos bytes, ends up at start of file? also don't see how the id3v1 footer is skipped in the SumID3v2 case? your files has both header and footer that differ.

$ fq 'input_filename, with_entries(.value |= (tobytes | {md5: (tomd5 | tohex), size: .size}))' 0a672d92a0770ef72959ac0ec19d69ca13c6fb9e.mp3 0a672d92a0770ef72959ac0ec19d69ca13c6fb9e\ copy.mp3
"0a672d92a0770ef72959ac0ec19d69ca13c6fb9e.mp3"
{
  "footers": {
    "md5": "c4f932b0678e613d3eb5c819ec644220",
    "size": 128
  },
  "frames": {
    "md5": "721a41215e0d17d0ff1fd3a007a85514",
    "size": 4590290
  },
  "headers": {
    "md5": "e20a90cf3679a47711d84bd409ead1fb",
    "size": 361
  }
}
"0a672d92a0770ef72959ac0ec19d69ca13c6fb9e copy.mp3"
{
  "footers": {
    "md5": "823d9cfbc26600bc778dd39ad93e9beb",
    "size": 128
  },
  "frames": {
    "md5": "721a41215e0d17d0ff1fd3a007a85514",
    "size": 4590290
  },
  "headers": {
    "md5": "d633ad9c7f1e7fabce46241d92c5cbb7",
    "size": 361
  }
}

@krixano
Copy link
Author

krixano commented Jul 12, 2022

I changed the tags with Windows, so idk if that matters at all - if windows changes the footer for some reason. I don't know much about audio file formats, so I didn't even know there was a footer, lol. What is that "fq" tool you are using?

Ok sorry, i read "hashed" as hashing the whole file, not using the Sum*() functions.

Sorry, should have been more specific in the body of the issue. I said "hashed" because that's basically what Sum does, afaik. Anyways, I'm considering just using a different library to fingerprint the audio, but it seems golang doesn't have very many except this one. And this one has the added bonus of taking care of the headers and tags.

@wader
Copy link
Contributor

wader commented Jul 13, 2022

I changed the tags with Windows, so idk if that matters at all - if windows changes the footer for some reason. I don't know much about audio file formats, so I didn't even know there was a footer, lol.

Ok, don't think windows or not should matter. But it seems like both headers and footers got chanted quite a lot. I will attach a "structural" diff below.

The mp3 "format" is quite a messy, in summary it's usually zero or more tags spread out between mp3 frames :) an mp3 decoder usually scans for mp3 frame sync point, decodes it, find next sync, repeat. So it ignores tags and other things.

What is that "fq" tool you are using?

It's https://github.com/wader/fq a tool i've been working on for doing these kind of things, I spend quite a lot of day time job digging into broken media files :) ... also free time

Sorry, should have been more specific in the body of the issue. I said "hashed" because that's basically what Sum does, afaik. Anyways, I'm considering just using a different library to fingerprint the audio, but it seems golang doesn't have very many except this one. And this one has the added bonus of taking care of the headers and tags.

No worries 👍

Was a while since i did a survey of tagging golang packages but i guess this one is probably the most complete.

I can try fix the sum/hashing code to properly skip header/footer tag or your already looking into it?

fq is not really superfast at decoding but you could use it to compare:

# show md5 hex string for bytes between first frames first byte and last frames last byte
$ fq -r '.frames | tomd5 | tohex' 0a672d92a0770ef72959ac0ec19d69ca13c6fb9e.mp3
721a41215e0d17d0ff1fd3a007a85514
# technically there could be non-mp3-frame between between them. if you really want just the mp3 frames bytes you can do something like [.frames[]] | tomd5 | ... 
# same but output filename and md5
$ fq -r '.frames | tomd5 | tohex | "\(input_filename): \(.)"' 0a672d92a0770ef72959ac0ec19d69ca13c6fb9e*
0a672d92a0770ef72959ac0ec19d69ca13c6fb9e copy.mp3: 721a41215e0d17d0ff1fd3a007a85514
0a672d92a0770ef72959ac0ec19d69ca13c6fb9e.mp3: 721a41215e0d17d0ff1fd3a007a85514
# do a structural diff between the two files
# -s slurps the two inputs (they will automatically be detect as mp3 and decoded) into an array
# diff first and second input
# footer diff shows it's been weirdly half padded with white space
# header diff is a bit messier as frames and changed
$ fq -s 'diff(.[0]; .[1])' 0a672d92a0770ef72959ac0ec19d69ca13c6fb9e.mp3 0a672d92a0770ef72959ac0ec19d69ca13c6fb9e\ copy.mp3
{
  "footers": {
    "0": {
      "album_name": {
        "a": "Under My Skin",
        "b": "                              "
      },
      "artist": {
        "a": "Avril Lavigne",
        "b": "                              "
      },
      "comment": {
        "a": "",
        "b": "                            "
      },
      "song_name": {
        "a": "He Wasn't",
        "b": "He Wasnt                      "
      },
      "year": {
        "a": "",
        "b": "    "
      }
    }
  },
  "headers": {
    "0": {
      "frames": {
        "1": {
          "id": {
            "a": "TPE1",
            "b": "TYER"
          },
          "size": {
            "a": 29,
            "b": 1
          },
          "text": {
            "a": "Avril Lavigne",
            "b": ""
          },
          "text_encoding": {
            "a": "utf16",
            "b": "iso_8859-1"
          }
        },
        "2": {
          "id": {
            "a": "TIT2",
            "b": "TRCK"
          },
          "size": {
            "a": 21,
            "b": 2
          },
          "text": {
            "a": "He Wasn't",
            "b": "4"
          },
          "text_encoding": {
            "a": "utf16",
            "b": "iso_8859-1"
          }
        },
        "3": {
          "id": {
            "a": "TALB",
            "b": "TCON"
          },
          "size": {
            "a": 29,
            "b": 20
          },
          "text": {
            "a": "Under My Skin",
            "b": "Indie / Alternative"
          },
          "text_encoding": {
            "a": "utf16",
            "b": "iso_8859-1"
          }
        },
        "4": {
          "description": {
            "b": ""
          },
          "id": {
            "a": "TYER",
            "b": "COMM"
          },
          "language": {
            "b": "eng"
          },
          "size": {
            "a": 3,
            "b": 5
          },
          "text": {
            "a": ""
          },
          "text_encoding": {
            "a": "utf16",
            "b": "iso_8859-1"
          },
          "value": {
            "b": ""
          }
        },
        "5": {
          "id": {
            "a": "TRCK",
            "b": "TIT2"
          },
          "size": {
            "a": 5,
            "b": 9
          },
          "text": {
            "a": "4",
            "b": "He Wasnt"
          },
          "text_encoding": {
            "a": "utf16",
            "b": "iso_8859-1"
          }
        },
        "6": {
          "id": {
            "a": "TCON",
            "b": "TCOM"
          },
          "size": {
            "a": 41,
            "b": 14
          },
          "text": {
            "a": "Indie / Alternative",
            "b": "Avril Lavigne"
          },
          "text_encoding": {
            "a": "utf16",
            "b": "iso_8859-1"
          }
        },
        "7": {
          "id": {
            "a": "TPE2",
            "b": "TLEN"
          },
          "size": {
            "a": 29,
            "b": 7
          },
          "text": {
            "a": "Avril Lavigne",
            "b": "179773"
          },
          "text_encoding": {
            "a": "utf16",
            "b": "iso_8859-1"
          }
        },
        "8": {
          "a": {
            "flags": {
              "compression": false,
              "encryption": false,
              "file_alter_preservation": false,
              "grouping_identity": false,
              "read_only": false,
              "tag_alter_preservation": false,
              "unused0": 0,
              "unused1": 0
            },
            "id": "TCOM",
            "size": 29,
            "text": "Avril Lavigne",
            "text_encoding": "utf16"
          }
        },
        "9": {
          "a": {
            "flags": {
              "compression": false,
              "encryption": false,
              "file_alter_preservation": false,
              "grouping_identity": false,
              "read_only": false,
              "tag_alter_preservation": false,
              "unused0": 0,
              "unused1": 0
            },
            "id": "TLEN",
            "size": 7,
            "text": "179773",
            "text_encoding": "iso_8859-1"
          }
        }
      },
      "padding": {
        "b": "<155>AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="
      }
    }
  }
}

Other commands that might be useful:

# show most of decode tree of a file (`da` to show full)
$ fq d 0a672d92a0770ef72959ac0ec19d69ca13c6fb9e.mp3
# show only footers
$ fq '.footers | d' 0a672d92a0770ef72959ac0ec19d69ca13c6fb9e*

@wader
Copy link
Contributor

wader commented Jul 13, 2022

Did a quick hack just to verify. This gives same sha1 for the files for me:

func SumID3v2(r io.ReadSeeker) (string, error) {
	n, err := sizeToEndOffset(r, 128)
	if err != nil {
		return "", fmt.Errorf("error determining read size to ID3v1 header: %v", err)
	}

	header, offset, err := readID3v2Header(r)
	if err != nil {
		return "", fmt.Errorf("error reading ID3v2 header: %v", err)
	}

	_, err = r.Seek(int64(header.Size)+int64(offset), io.SeekStart)
	if err != nil {
		return "", fmt.Errorf("error seeking to end of ID3V2 header: %v", err)
	}

	// TODO: remove this check?????
	if n < 0 {
		return "", fmt.Errorf("file size must be greater than 128 bytes for MP3: %v bytes", n)
	}

	frameBytes := n - (int64(offset) + int64(header.Size))

	h := sha1.New()
	_, err = io.CopyN(h, r, frameBytes)
	if err != nil {
		return "", fmt.Errorf("error reading %v bytes: %v", n, err)
	}
	return hashSum(h), nil
}

To do this more properly I think one also needs to check if a footer exist, currently i think it will always skip the last 128 bytes.

@krixano
Copy link
Author

krixano commented Jul 14, 2022

Cool! Thanks for looking into this!

@krixano
Copy link
Author

krixano commented Sep 13, 2023

So, I have looked at these files again and compared them to other files. It looks like the LAME encoder on Linux puts both ID3v2 tags at the start of the file and ID3v1 tags at the end of the file, as well as a LAME Header just after the ID3v2 tags.

@dhowden @wader Do you know if this library supports files that have both types of tags at the same time? And currently the Sum feature doesn't take into account that there could be ID3v1 tags along with ID3v2 tags, so it would be cool if this was taken into account and committed into this repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants