Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024-08-25: Lots of new formats #99

Merged

Conversation

NebularNerd
Copy link
Contributor

@NebularNerd NebularNerd commented Aug 25, 2024

Updates

This update is primarily for new formats, it's bit of a mix this time around. Moving forward my update texts are changing as well, I'm using VSCode to create the blurb so aiming to try for a clearer layout 😎

Lots of new matches and some fixes for older entries. Some of the ASCII translations of the hex have been left out as it broke GitHub when I was pasting them in.

On the subject of matching, it's now becoming clearer that as part of v2 rebuild plan, .zip, .xml and the Microsoft Compound File formats all need to have some form of unpacking/decoding to allow for better matching and less alternative confidences. Some of these are now giving 20-30 matches.

Formats

Canon Camera RAW 2

Extensions: .cr2
Magic: Intel TIFF* then a second marker of 0x435202 / CR� at byte 8

An update to Canon's original RAW format, this uses a beefed-up Intel* TIFF file. Like TIFF there is a lot of info we can extract if we wanted to in later v2.0 expansion ideas.

*There may possibly be Motorola encoded .cr2 files out there as well going by one source, but my 350D files are Intel flavoured so I've only added that for now.

Panasonic RAW and RAW2 and LEICA RAW

Extensions: .raw .rw2 .rwl
Magic: 0x49495500

There are entries for these file extensions in the .json, however, I suspect they are either duff entries or they will only match the file from which it was sourced. From my own Panasonic FZ1000 and various test files on the links below they all start with the magic above. I have not removed the existing entries as I may be wrong about them not being valid. The LEICA cameras are basically posh Panasonic's and use the same file format with a different extension, all other details are the same.

If anyone comes across Panasonic RAW's that don't match please leave a comment so we can take a look.

Comic Book Archives

Extensions: .cb7 .cba .cbr .cbt .cbz

These are simply archives containing image files in numerical order, the extension gives away the parent formats of 7-Zip, Ace, RAR, TAR and Zip. Headers are identical to the parent formats they use.

PRC, Mobipocket and Amazon Kindle eBooks

Extensions: .prc .mobi .azw .azw1 .azw3 .azw4 .tpz .kfx .kcr

This is a weird hodgepodge of formats, Starting with the original AportisDoc document .pdc, U.S. Robotics .prc and Mobipocket SA .mobi formats, which hail from the PalmPilot era, they eventually morphed into Amazon Kindle files with only the extensions to tell them apart (there are deeper changes but on the surface, they are essentially the same). To be annoying a lot of eBooks have the .mobi or .azw extension when they should really have something else, this will affect FILE based scores as that uses the extension as part of the scoring.

Starting with the KF8 .azw3 format files could be MOBI or dual format MOBI/EPUB but still have the same extension, KF10/KFX .azw4 / .kfx files are a completely new format. There are even more subformats/subversions than I have added but I need to learn more about them or get samples, or PureMagic needs new features to dig deeper into the files.

PalmDoc

Extensions: .prc
Magic: 0x5445587452454164 / TEXtREAd at byte 60

Pretty much the grandaddy of them all, The PalmDOC eBook format from the PalmPilot series of handhelds. Technically they are a subformat of Palm DOC .pdb (see below) but the header is what classes them as a PRC eBook. The will conflict match with AportisDoc document .pdc as they are one and the same filetype, U.S. Robotics used the format as the basis for the Palm operating system. Just to be awkward, a .prc may be a .pdb and vice-versa.

I've also added .prc as an extension only due to it being used to store all manner of data on Palm Pilots.

MOBI and early Kindle eBooks

Extensions: .mobi,.azw,.azw3 (MOBI)
Magic: 0x424f4f4b4d4f4249 / BOOKMOBI at byte 60 and a footer of 0xe98e0d0a at -4

The most common of the formats in this batch, most commonly found eBooks are a MOBI (aka Mobi6 format) regardless of its extension. Some old MOBI may have the extension .prc or .pdb from their PalmPilot roots.

Topaz DRM eBooks

Extensions: .awz1 and .tpz
Magic: 0x54505a / TPZ at different offsets per file

These are DRM encrypted files delivered via Whispernet or downloaded to your PC, I have a single .azw file in my Kindle library which is DRM'd (others are newer .azw4), I need more samples but based on DeDRM we should be looking for the TPZ magic, it's not at a fixed position so adding as extension only for now, v2 upgrades should let us test for this.

Kindle KF8 eBooks

Extensions: .azw3
Magic: 0x424f4f4b4d4f4249 / BOOKMOBI at byte 60 and a footer of 0x434f4e54424f554e44415259e98e0d0a / CONTBOUNDARYé� at -16

These are dual format MOBI/ePub eBooks that have the tag BOUNDARY at the end of the MOBI data, however this is not a fixed position so would require a v2 upgrade to search for this, handily they also have an longer footer than regular MOBI files, we'll use that instead. 😊

Amazon Print Replica eBook (aka Kindle Format 10/KF10/KFX)

Extensions: .azw4, .kfx
Magic: 0xea44524d494f4eeee00100eaee9e8183de9a86be97de95848d50726f74656374656444617461

This is the current Kindle format, all my files downloaded through Kindle for Windows still use .azw for the extension, so again FILE based scores will be affected. However, with a ridiculously long match you'll be more than certain it's this format. There is a version number but we'd need a regex to ensure correct reporting of just the digits as they seem to follow the pattern v1.1blahblahblah or v1.85blahblahbah given how many versions there could be that would mean a lot of extra data in the database if we went with fixed strings.

Kindle Cloud Reader and Kindle for Mac

Extensions: .kcr

As the label suggests these are another wrapper for Kindle files. From limited info they are an .azk wrapped in DRM. I have no samples for these, so adding the extension only for now.

Kindle Preview file

Extensions: .azk

A PK zip based file format used by Kindle Previewer and older iOS Kindle apps. Again, no samples available so extension only for now.

Sundry files

These are files that you'll find with some eBooks, none are eBooks themselves but provide functionality to them.

  • .voucher appears to be the DRM key for KFX eBooks, all start 0xe00100eaee9e8183de9a86be97de95848d50726f74656374656444617461
  • .mbpV2 is a metadata file, it stores the last position and annotations. It's a basic JSON data file starting 0x7b226d6435223a22 / {"md5":"
  • .mbp is the original MOBI metadata file, like it's newer brother above it does the same job. I have no sample files so adding as as extension only for now.
  • .azw.res these are Resource Containers that hold external data files such as high-res images, part of the AZW6 specification originally aimed at Japanese Manga and Graphics novels, western comics adopted the same format to offer higher quality images. Header of 0x434f4e540200
  • .azw.md these are Metadata Containers, they use the same header as .azw.res.
  • .phl are Amazon Kindle Popular Highlights Files, these are an XML file that show how many people highlight certain passages etc... All start with 0x3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d225554462d3822207374616e64616c6f6e653d22796573223f3e / <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
  • .azw9.res Resource containers for Kindle on the MAC, no samples so adding as an extension for now.
  • .azw9.md Metadata containers for Kindle on the MAC, no samples so adding as an extension for now.

This is a proper Rabbit Hole job, it took way longer than I thought it would and there is still more to uncover. This pile of links covers most of what I dug out. As samples become available and new features are added to PureMagic we can do more with this bunch of formats. Some information is contradictory so I expect there will be tweaks to this lot over time.

Palm OS Database

Extensions: .pdb
The primary data format for the PalmPilot (also Visor handspring and Sony CLIÉ) series of handheld devices. A bit like RIFF and IFF, it's a container format that wraps around many types of data. .prc, .mobi and AportisDoc document .pdc files are a form of PDB but as they get lumped in with all the other eBook formats, I left them above. All files share the same extension with just the byte 60 header changing. There are later PDB files that use zTXT (such as Weasel), but that is another kettle of fish entirely. All PDB use the same mimetype with PalmOS deciding what to do once it looks at the subformat tag.

Much like the eBooks above, the extension does not mean a lot, a Palm File could easily be an application and still have a .prc extension for example.

Subformats

All these start at byte 60

  • Palm Pilot Applications: 0x6170706c / appl

  • Palm Pilot zTXT Compressed file: 0x7a545854 / zTXT

  • GrayPaint 0x444154414772503f / DATAGrP?

  • Adobe Reader 0x2e70646641444245 / .pdfADBE

  • BDicty (Dictionary Reader) 0x42566f6b42444943 / BVokBDIC

  • DB (Database program) 0x4442393944424f53 / DB99DBOS

  • eReader (aka Palm Reader) 0x504e526450507273 / PNRdPPrs

  • eReader 0x4461746150507273 / DataPPrs

  • FireViewer (ImageViewer) 0x76494d4756696577 / vIMGView

  • HanDBase 0x506d4442506d4442 / PmDBPmDB

  • InfoView 0x496e666f494e4442 / InfoINDB

  • iSilo 0x546f476f546f476f / ToGoToGo

  • iSilo 3 0x53446f6353696c58 / SDocSilX

  • JFile 0x4a6244624a426173 / JbDbJBas

  • JFile Pro 0x4a6644624a46696c / JfDbJFil

  • LIST 0x444154414c536462 / DATALSdb

  • MobileDB 0x4d6f62696c654442 / Mdb1Mdb1

  • Plucker 0x44617461506c6b72 / DataPlkr

  • PQA 0x70716120636c7072 / pqa clpr

  • QuickSheet 0x4461746153707264 / DataSprd

  • SuperMemo 0x534d3031534d656d / SM01SMem

  • TealDoc 0x54455874546c4463 / TEXtTlDc

  • TealInfo 0x496e666f546c4966 / InfoTlIf

  • TealMeal 0x44617461546c4d6c / DataTlMl

  • TealPaint 0x44617461546c5074 / DataTlPt

  • ThinkDB 0x6461746154444250 / dataTDBP

  • Tides 0x5464617454696465 / TdatTide

  • TomeRaider 0x546f526154525057 / ToRaTRPW, these may also have a .tr extension

  • Weasel 0x7a54585447506c6d / zTXTGPlm

  • WordSmith 0x42444f4357726453 / BDOCWrdS
    Not an exhaustive list but like RIFF and IFF there are going to always be more.

  • Justsolve: MOBI

  • MobileRead Wiki: PDB

TomeRaider eBooks

Extensions: .tr .tr2 .tr3
Magic:
.pdb have 0x546f526154525057 / ToRaTRPW at byte 60 (as above)
.tr and .tr2 have 0x370000106d000010d2160010dcf4ddfcd1 at byte 0
.tr3 have 0x5452334454523343 / TR3DTR3C at byte 60

This came up while doing the Palm Doc entries. TomeRaider is another eBook format that started life on the PalmPilot series of devices. There are three formats, the .pdb version, then later on TR2 and TR3. TR2's and the old PDB version may both use .tr when not on a Palm device. Calibre cannot read any of these files (not that I can find a TR2 sample but I imagine it also does not work) which is a shame, maybe a new project for me to look into...

FictionBook 2 and FictionBook 3

Extensions: .fb2 .fb2.zip .fbz
Magic:
.fb2 has 0x3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d225554462d38223f3e0a3c46696374696f6e426f6f6b / <?xml version="1.0" encoding="UTF-8"?> <FictionBook
.fbz and .fb2.zip are just normal PK Zip files with an FB2 inside
.fb3 are also just normal PK Zip files with a similar structure to an ePub

Another eBook format that is popular in Russia but nearly unused anywhere else. FictionBook 2's are an XML file with everything stored within it as a monolithic block, the compressed variants are simply a zip file with a single FB2 inside. FictionBook 3's are a similar idea to ePub in that they are just a zip file with a structured layout. Yay! more PK Zip matches....

Windows Help Files

Extensions: .hlp .gid .cnt
Magic:
.hlp and .gid both have 0x3f5f0300 at byte 0, then 0x0000ffffffff at byte 6
.cnt has 0x3a42617365 / :Base at byte 0

This was already in the data base but .hlp was split over two entries, I've condensed then into one superior match. .gid are a metadata file that stores the last window position and size (but not read position), there is not much info but looking at the samples created when I open .hlp they all have the same starting layout. There are some other tags we can look for later to enhance confidence between both files.

.cnt files are a plain text file containing the chapters for a Help file, they add a graphical Table of Contents (TOC) tab to the Search/Find tabs under Win95. Assuming no blank or lines with no colon : the first line should always match the magic.

MS Reader eBook

Extensions: .lit

Already in the .json, just added the mimetype application/x-ms-reader

Sony Broad Band eBook (aka BBeB)

Extensions: .lrf .lrf .lrx
Magic:
.lrf has 0x4c00520046000000 at byte 0
.lrx and .lrf no samples, extension only for now

A proprietary eBook format from Sony and Canon mainly aimed at the Sony Librié. .lrs are XML files that can be read as an eBook, but are aimed at being the source files for the other two extensions. .lrf and .lrx are compiled and compiled with DRM.

Rocket eBook

Extensions: .rb
Magic:
eBooks have 0xb00cb00c / ° ° (bookbook)
System files use 0xb00cc0de / ° ÀÞ (bookcode) or 0xb00cf00d / ° ð (bookfood)

Another proprietary format, this is for the NuvoMedia Rocket eBook reading device, reportedly the first dedicated eBook reader released in 1997. There are possibly DRM versions of the file that may differ from these entries.

Text Compression for Reader eBook (aka Psion Series 3 eBook)

Extensions: .tcr
Magic: 0x2121382d4269742121 / !!8-Bit!! at byte 0

A text compression format I stumbled across while looking into Rocket eBooks. Quite possible the oldest format in this PR, it harks from the days of Psion Series 3 and 5's.

Shanda Bambook eBook (aka SuperNote Book)

Extensions: .snb
Magic: 0x534e425030303042 / SNBP000B at byte 0

This is an eBook format for the amazingly named Shanda Bambook, a Chinese eBook reader. All info and test files I've got fail to work in Calibre, would be nice to have a working sample file.

Cheat Engine Trainer Data

Extensions: .CETRAINER
Magic: 0x3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d227574662d38223f3e0d0a3c43686561745461626c65 / <?xml version="1.0" encoding="utf-8"?> <CheatTable at byte 0

These are another XML format document used by CheatEngine for storing a trainer before being compiled into an executable. Longer match to help prevent false positives against other XML based files.

Quake PAK files

Extensions: .pak .bsp .mdl .lmp .dem .map .rc .spr
Magic:
.pak has 0x5041434b / PACK at byte 0, then may have 0x4944504f / IDPO, 0x52494646 / RIFF or 0x49425350 / IBSP at byte 12
.bsp has 0x1d000000 or 0x1c000000 typically at byte 0, other versions may exist
.mdl has 0x4944504f / IDPO at byte 0
.map has 0x7b0a22 / { " at byte 0 assuming no comment lines before
.spr has 0x49445350 / IDSP at byte 0
.lmp, .dem .rc have no fixed headers, extensions only

The Quake PACK format shares a header with many other file types, we have entries in the JSON already but I've added these extra markers to help boost confidences. Not all files have them but it helps those that do.

Other Quake files

  • .bsp are compiled level files

  • .mdl are 3D models used for characters, monsters, weapons etc...

  • .lmp are various image related files

  • .dem are recorded demos (or movies) of levels

  • .map are un-compiled map files that are used to make .bsp, they look a little like JSON at a glance

  • .rc are Resource files, basically a scripting language

  • .spr are sprite files

  • QuakeWiki: Quake File Formats

  • Description of .BSP Files

  • Quake map source

  • Quake Sprite format

Python Pickle

Extensions: .pickle
Magic:
Protocol 0 has 0x28 / ( at byte 0
Protocol 1 has 0x7d71 / }q at byte 0
Protocol 2 has 0x8002 / �� at byte 0
Protocol 3 has 0x8003 / �� at byte 0
Protocol 4 has 0x8004 / �� at byte 0
Protocol 5 has 0x8005 / �� at byte 0
All end 0x2e / . at -1

Pickle is a data dump format for Python, there is an existing extension only but we can remove that now. The headers are small but thanks to the footer always being a . there should be no issues. Justsolve's protocol 1 file seems to not match my generated files when using protocol=1 (looks like it's a 0), going with my files for the magic. I've left the extension in for now to allow for fringe cases or later

Smacker video

Extensions: .smk
Magic: Either 0x534d4b32 / SMK2 or 0x534d4b34 / SMK4 at byte 0

A popular video file format from the mid 90's, loads of early CD games used it due to it's decent compression and for the time fairly decent quality. There are two versions, not sure what the later one added.

Bink video

Extensions: .bik .bk2 .bik2
Magic: 0x42494b / BIK at byte 0

Another popular video file format from early to mid CD era games, this replaced Smacker. There seems to be some confusion over the amount of FourCC's this format has: BINK BIKb BIKf BIKg BIKh BIKi BIKd are all considered valid. The samples I found all used BIKi but for now I have gone with just BIK and the extension .bik until more samples appear, this covers most potential files out there.

AmigaGuide

Extensions: .guide
Magic: 0x40646174616261736520616d69676167756964652e6775696465 / @database amigaguide.guide at byte 0

The AmigaGuide document was made for creating navigable help files, they work much like Windows Help files.

CRI Movie 2

Extensions: .usm
Magic: 0x43524944 / CRID at byte 0

Another proprietary video format used in various games, especially those coming from Japanese studios. It's an annoying format as on later Windows version the audio no longer plays back due to the weird semi off standard codecs they used.

Adobe flash video file

Extensions: .flv
Magic: 0x464c5601 / FLV� at byte 0, then 04, 01 or 05 for audio, video or both at byte 4

This is a tidy and improvement of existing entries, there were two .flv but one lacked the last byte pair so I've removed that from the JSON. Added little secondary matches for extra confidence boosts.

Microsoft Works files

Extensions: .wdb .wks .xlr .wps
Magic:
Early .wdb versions have 0x20540200000005540200 at byte 0
Later .wdb .wps and .xlr versions have 0xd0cf11e0a1b11ae1 / ÐÏ�ࡱ�á at byte 0
Early .wps have 0x01fe / �þ at byte 0

Microsoft Works was a cut down budget office that offered everything you needed in one package, it saved documents in a semi proprietary format that MS did support in Office but later dropped. .wdb were the Works equivalent to Access, .wks/.xlr were spreadsheets, and .wps was a text file.

There are some differing versions, early formats were just for Works, later ones were still Works specific but used the Microsoft Compound File format and identifying them may be trickier as we need to decode the CLSID identifier from the file. In fact the format is the basis for many many formats much like a RIFF or IFF, expect conflict clashes. Definitely a candidate for v2 identification upgrades.

JPEG XR, Windows Media Photo and Microsoft HD Photo File Format

Extensions: .jxr .wdp .hdp
Magic:
All files should have 0x4949bc01 / II¼� at byte 0
.jxr also has 0x574d50484f544f00 / WMPHOTO at byte 90
.hdp I cannot find any samples, extension only for now.

Another member of the JPEG Family, derived from the Windows Media Photo and and Microsoft HD Photo formats, it's part MS, part JPEG, part butchered TIFF. The format is a mess.

JPEG-LS

Extensions: .jls
Magic: 0xffd8fff7 / ÿØÿ÷ at byte 0

Another JPEG format that is also not quite a format, it's a subset of regular JPEG and also has roots in HP's own lossless codec (which apparently is in one of the old Mars rovers). JustSolve magic suggestions match the output from the HP Reference encoder linked there, and at the CharlLS WebAssembly demo linked below. XnView would not view them despite claiming support, the online demo could read successfully converted images from the HP encoder. I've gone with the longer magic based on the test files, this should allow it to win confidence over regular .jpg

Amiga Floppy Disk Images

Extensions: .adf
Magic: 0x444f53 / DOS at byte 0 then 0x01 to 0x07 for various Amiga filesystems at byte 3

We have entries in the json but they can be improved. One was too specific, the other lacked mimetypes, lets fix that and make some enhancements. I've left a basic match for fringe cases and multi-parted the variations.

Amiga Harddisk Images

Extensions: .hdf
Magic:
0x5244534b / RDSK at byte 0 for Amiga Filesystems
0x504653 / PFS for Professional Filesystem 3 (PFS3)
0x504453 / PDS for Professional Filesystem 3 (PDS3)
0x534653 / SFS for Smart File System

I thought I had added these before but evidently not, like their floppy brethren these have many formats.

Fixes

There are also some small changes to various entries, fixing spelling errors, unifying names or adding mimetypes

@NebularNerd NebularNerd changed the base branch from master to develop August 25, 2024 16:44
because more is always better
@cdgriffith
Copy link
Owner

Thank you for all these updates! Only request is to replace tabs with spaces to keep current formatting the same.

@NebularNerd
Copy link
Contributor Author

Thank you for all these updates! Only request is to replace tabs with spaces to keep current formatting the same.

I've done a search and replace to remove all tabs and converted them to spaces, looks a lot more uniform now. Force of habit to hit the Tab key.

@cdgriffith cdgriffith merged commit c1d49eb into cdgriffith:develop Aug 31, 2024
9 checks passed
@cdgriffith cdgriffith mentioned this pull request Sep 25, 2024
cdgriffith added a commit that referenced this pull request Sep 25, 2024
- Adding #99 New file support (thanks to Andy - NebularNerd)
- Fixing #100 FITS files no longer had mime type (thanks to ejeschke)
---------

Co-authored-by: Andy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants