Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024-06-25 sndhdr update and HD/CD/DVD Image files #87

Merged
merged 6 commits into from
Jul 11, 2024

Conversation

NebularNerd
Copy link
Contributor

@NebularNerd NebularNerd commented Jun 27, 2024

Should close #85

SNDHDR Parity update (and HD/CD/DVD Image files)

.aif/.aiff/.aiffc/.8svx:

These are IFF based files, all start with 0x464f524d/FORM
AIFF files are a hodgepodge of formats and specs all thrown under the same label, different compression styles or similar compression styles with the wrong FourCC can render a file unplayable on certain software.

I've updated/tidied the database to recognise the additional AIFF or AIFC header at byte 8. With possible enhancements under V2 we could perform further matches to detail compression used and possible even bitrates etc...

  • .aif I've removed the trailing 00 to allow it access to the multi-part section for better confidences
  • .aiff @cdgriffith had already added AIFF and 8SVX at byte 8 in multi-part, corrected MIME for AIFF
  • Removed ["41494646", 8, ".aif", "audio/x-aiff", "AIFF/Amiga/Mac audio"], ["38535658", 8, ".aif", "audio/x-aiff", "AIFF/Amiga/Mac audio"] and ["41494643", 8, ".aiffc", "audio/x-aifc", "AIFC audio"] as these are covered by other matches in the 0x464f524d/FORM match format.

.au:

The existing fingerprint should match all files
No changes, we could extract more info but looking at how sndhdr does it I'll leave that for a V2 upgrade

.hcom:

There exists almost no information on the format, what there is, is basically the same data as linked below in differing formats. From what I can see it's some old Apple Mac format possibly used in apps and games.

The sndhdr test looks for two headers, one is in the Mac header, the other in the Mac data fork. For the time being I have added them as two separate tests, this will give a low-ish confidence score, however, in the absence of test files there is little more I can do.

If anyone ever reads this and has some sample files, I'll take a look to improve this match.

.sndt:

After a lot of digging I found this format seems to belong to a very old Win 3.1 era program called SoundTool/SNDTOOL, I managed to source a copy buried in a shareware .iso at archive.org. Downloading it and comparing a sample file included to the ones below seems to indicate this is the source of these files.

.voc/.wav:

No changes required, existing fingerprint will match any VOC/WAV file.
V2 Improvements could look to decode audio data for sample rate etc...

.sb/.ub/.ulaw:

Cannot add, .sb and .ub are intended to be signed or unsigned byte-streams as far as I can guess the intentions of the sndhdr authors. This means they are simply a stream of bytes that hold audio data, knowledge of the correct bitrate etc.. then decodes them back to audio.

.ulaw is essentially a CODEC used in various audio containers such as AIFF and AU, this again means there is no specific ulaw file format.

In these cases, there is not a lot we can do to detect these files. It would basically require creating an audio decoder similar to sndhdr or Audacity, VLC etc. to fully process and try to understand these files. This could be possible with V2 but this would take on a life of its own.

.sndr:

I have no idea on this, I've added the header match from sndhdr but again without test files or knowledge of the program they came from we can't go any better than that.

Again, if anyone reading this has any test files of the program that made it, I'll take a look and improve.

Other formats:

Honestly this PR is a bit of a so so one, so let's add some extras stuff to make it more exciting.

.vhdx:

The updated version of the older .vhd format used by Microsoft Hyper-V and Virtual PC, nice simple header of 0x7668647866696c65 / vhdxfile.

.qcow/.qcow2/.qed:

QEMU's Hard drive image formats. Simple headers with version numbers

  • 0x514649fb00000001 / QFIû �� for QCOW Image
  • 0x514649fb00000002 / QFIû � for QCOW2
  • 0x514649fb00000003 / QFIû � for QCOW3 (Still .qcow2 extension)
  • 0x514544 / QED for QEMU Enhanced Disk Image

.luks

Linux Unified Key Setup is another HD Image format, there are two versions LUKS1 and LUKS2.

  • 0x4c554b53babe0001 / LUKSº¾ � for LUKS1
  • 0x4c554b53babe0002 / LUKSº¾ � for LUKS2
    It's an interesting format that has an embedded .json, future V2 functionality could interrogate the files to display encryption type and other data.

.vdi

Sun/Oracle HD Image for use with VirtualBox. Nice long headers to match against. There is no official document on the format it seems but a good breakdown is available, linked below.

  • 0x3c3c3c2053756e2078564d205669727475616c426f78204469736b20496d616765203e3e3e / <<< Sun xVM VirtualBox Disk Image >>> for older Sun images
  • 0x3c3c3c204f7261636c6520564d205669727475616c426f78204469736b20496d616765203e3e3e / <<< Oracle VM VirtualBox Disk Image >>> for newer Oracle images

As far as I can see there is only one version (1.1) with the same image signature starting at byte 64 for both flavours, I've included it as a multi-part for completeness.

.vmdk

There are already entries in the .json for VMWare .vmdk files, I have tidied and adjusted some to better match real world files

  • ["4b444d", 0, ".vmdk", "application/octet-stream", "VMware 4 Virtual Split Disk file"] Removed as the correct match is below it in the .json
  • ["23204469736b2044657363726970746f", 0, ".vmdk", "application/octet-stream", "VMware 4 Virtual Split Disk file"] Corrected to include better match using the full term # Disk DescriptorFile and changed the name to VMware Image Descriptor File
  • ["23204469736b2044", 0, ".vmdk", "application/octet-stream", "VMware Virtual Disk description"] Removed as above fix is better match
  • Removed 3 and 4 from the 434f5744/COWD and 4b444d56/KDMV labels, these files do different jobs, they are not for different versions of VMWare

.dmg

The venerable archive format of Mac OS machines, the existing entry would only ever work for the file it came from. The correct way to identify a .dmg is to use a footer match at -512 for koly.

["7801730d626260", 0, ".dmg", "application/octet-stream", "MacOS X image file"] and ["", 0, ".dmg", "application/octet-stream", "MacOS X image file"] removed, new entry in footer added

OK, Even more formats

I'll note here the CD/DVD images are a real pain in the backside, lots of overlapping headers and proprietary info. This is a good start for later V2 fun.

MagicISO Image Format .uif

A seemingly much hated proprietary format for storing images of CD/DVD's. Can't find any test files or documentation, however, there is UIF2ISO which converts the files to regular ISO. Digging in the source seems to show a header at byte 0 of 0x73696262 / sibb with another match at byte 8 of 0x72686c62 / rhlb if it's encrypted.

If I ever come across a real file to test against I'll confirm this but the code has been around a long time so it's pretty safe to assume it's correct.

PowerISO Direct Access Archive .daa

Another proprietary format for storing images of CD/DVD's, much like .uif it's also pretty unpopular. The author of UIF2ISO also created a tool to deal with them called DAA2ISO.
Simple header of 0x444141 / DAA at byte 0

gBurner Image .gbi

Another proprietary format for storing images of CD/DVD's, it appears to be quite similar to .daa as DAA2ISO handles both.
Simple header of 0x474249 / GBI at byte 0

Apple HyperCard Stack .hc

While I was looking for data on another .hc extension, HyperCards popped up, so we'll add them in while we're here. HyperCards were almost a pre-cursor to web pages, able to store text and images in a clickable, searchable database. Header of 0x5354414b / STAK at byte 4

VeraCrypt File Container .hc

An encrypted image container, we can only add this as an extension as the VERA header at byte 64 and all data following is encrypted by the 64 byte salt.

Nero Disc images *.nrg

Nero was once one of the most popular CD/DVD burning tools, the .nrg was their own custom image format. These use Footer matches for the two versions 0x4e45524f / NERO at -8 and 0x4e455235 / NER5 at -12 for v1 and v2 images.

Compressed ISO images .isz

Created by EZB Systems for use in their various products, this is an open specification for producing ZLIB compressed version of ISO images. Header is 0x49735a21 / IsZ! at byte 0

DiscJuggler images .cdi

Padus DiscJuggler was a professional mastering solution for CD and DVD. Due to their .cdi image format being highly flexible, it got adopted as the de-facto format for archiving Dreamcast games. There appear to be a few versions. Adding as an extension only, looking at the source for cdi2nero it's a complex format that would need a partial port of that app to understand them, looking at libMirage confirms this idea.

CloneCD Control File .ccd, Image .img and Subchannel Info .sub

CloneCD is another powerful CD/DVD image tool. The .ccd contains various metadata relating to the .img file. Official specs on the format are non-existent it seems, I've inferred the matches from samples from a range of sources. Much like .cdi above some form of decoding may be the way to go in the future, looking at libMirage confirms this idea.

  • .ccd has 0x5b436c6f6e6543445d / [CloneCD] as it's first line. There are versions on the next line, but as it's a text file spacing\tabs could cause match issues. A regex solution would be best for extracing that info.
  • .sub files all appear to start with 0xffffffffffffffffffffffff then a few bytes after which may be some sort of versioning.
  • .img files appear to have a couple of different starting blocks, the two added seem to match a range of test .img files.

BlindWrite images .b5t / .b6t and BlindRead images .bwt

BlindWrite and it predecessor BlindRead are another set of CD/DVD Imaging tools. Much like CloneCd they can produces various files to preserve important onformation about the source disk. Most of these will be extension only for the time being as I lack sample files and cannot find much about the format.

  • .BWS BlindRead Sub Channel Data
  • .BWT BlindRead Control File
  • .BWI BlindRead Image File
  • .B5T BlindWrite 5 Stream File, libMirage gives a header of 0x425754352053545245414d205349474e / BWT5 STREAM SIGN
  • .B5I BlindWrite 5 Image File (Tentatively adding header 0xffffffffffffffffffffffff based on source code to b5i2iso)
  • .B6T BlindWrite 6 Stream File libMirage gives a header of 0x425754352053545245414d205349474e / BWT5 STREAM SIGN
  • .B6I BlindWrite 6 Image File

WinOnCD images .c2d

While browsing the libMirage source for other formats, this one was in the list. This was an early entry into the CD mastering market, it changed hands a couple of times from Roxio to Adaptec. Two headers 0x4164617074656320436551756164726174205669727475616c43442046696c65 / Adaptec CeQuadrat VirtualCD File and 526f78696f20496d6167652046696c6520466f726d617420332e30 / Roxio Image File Format 3.0

Adaptec Easy CD/DVD Creator image file .cif

Another CD/DVD creator software purchased by Adaptec from Corel, header info from libMirage. This use a RIFF header then at byte 8 0x696d6167 / imag.
There are earlier versions of the format that used .cl2, .cl3 and .cl4 but there is no info on these formats beyond that, will add as extension only until samples files are found.

Alcohol 120% image file .mds and GameJack image file .xmd

Another powerful CD/DVD image creator, like BlindWrite and CloneCD it can make near perfect copies of most discs.
There's not much info on GameJack, it's either a licensed or questionable clone of Alcohol.

  • .mds and .xmd are the control file 0x4d454449412044455343524950544f5201 / MEDIA DESCRIPTOR�
  • .mdf This is the main image, it's already in the .json

Daemon Tools image file .mdx

Pretty much one of the most popular virtual drive tools, it's been around for a very long time.

  • .mdx uses 0x4d454449412044455343524950544f5202 / MEDIA DESCRIPTOR� which is nearly identical to Alcohol's expect the last byte

Apple Toast File .toast

Toast is a early CD burning software package for Macs, it's changed hands many times of the years.
Early toast files have a header of 45520200 / ER� . Later toast files are simply .iso with a different name.

Links:

@NebularNerd NebularNerd changed the title 2024-06-25 sndhdr update and HD Image files 2024-06-25 sndhdr update and HD/CD/DVD Image files Jun 30, 2024
@cdgriffith
Copy link
Owner

cdgriffith commented Jul 11, 2024

Thank you for all these additions!

@cdgriffith cdgriffith merged commit 1d94f59 into cdgriffith:develop Jul 11, 2024
9 checks passed
@cdgriffith cdgriffith mentioned this pull request Jul 11, 2024
cdgriffith added a commit that referenced this pull request Jul 11, 2024
- Adding #87 sndhdr update and HD/CD/DVD Image files (thanks to Andy - NebularNerd)
- Adding #88 Add .caf mime type (thanks to William Bonnaventure)
- Fixing #89 add py.typed to package_data (thanks to Sebastian Kreft)

---------

Co-authored-by: Sebastian Kreft <[email protected]>
Co-authored-by: Andy <[email protected]>
Co-authored-by: William Bonnaventure <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants