Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update added support for big excel files #191

Merged
merged 2 commits into from
Nov 13, 2021

Conversation

ThomasObenaus
Copy link
Contributor

Adds support for big excel files

By adding the possibility to configure the amount of local headers checked

There the problem can be (depending on the library generating the file) that the headers are more than 10 and the relevant ones for identifying it as an excel file are at place 11 or above.

Reason for the Change

We have some bigger excel documents (which I can't share due to compliance restrictions) that aren't detected correctly.
After analyzing the problem it turned out that it is not enough for those kind of documents to check only the first 10 local headers.

How we use it with that change

// increase the read limit (default is 3072) and the amount of MSO file headers checked to support big excel files as well
mimetype.SetLimit(20480)
mimetype.SetMaxMSOFileHeaders(100)
detectedMimeType := mimetype.Detect(data)

… configure the amount of local headers checked

There the problem can be (depending on the library generating the file) that the headers are more than 10 and the relevant ones for identifying it as an excel file are at place 11 or above.
@gabriel-vasile
Copy link
Owner

gabriel-vasile commented Oct 11, 2021

I'm concerned not many users of the library are familiar MSO file headers and will / can be bothered to learn about it.

I think it would be great to fix this issue without adding the SetMaxMSOFileHeaders function. I see 2 steps for this:

First, we can remove the 10 headers limit and search in all available headers. I added this limit, but thinking back on it I'm not sure about it. There already is a limit for the max number of bytes read from input which also limits the number of headers.
Second, we can add more known excel headers. If you'd like, you can add them yourself, or tell me the library used to generate the Excels and I'll have a look.

@ThomasObenaus Do you think this approach will work for you?

@ThomasObenaus
Copy link
Contributor Author

@gabriel-vasile I adjusted my PR according to your request.

@ThomasObenaus
Copy link
Contributor Author

ThomasObenaus commented Oct 20, 2021

Sadly I can't tell you how the file was created since it was uploaded by a customer.
And as I said I can't provide the file itself either.

But I can provide (at least) the file structure of that xlsx document:

image

I assume there where just too much files in the customXML folder. So the search never found any of the xlsxSigFiles within the first 10 files.

@gabriel-vasile gabriel-vasile merged commit 04f0fdf into gabriel-vasile:master Nov 13, 2021
@gabriel-vasile
Copy link
Owner

Thank you for your contribution @ThomasObenaus!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants