Skip binaries files on filesystem scan #201

baruchiro · 2024-02-11T11:21:46Z

Steps to reproduce:

Build 2ms with go build -o 2ms main.go
Run a filesystem scan with ./2ms filesystem --path . --log-level debug
One of the scanned files is the ./2ms executable itself.
After ~4 minutes (it is a long time!) you will receive a lot of results from the binary.

There are two problems here:

The scan takes a very long time
There are a lot of false positives because the binary content generates sequences like secrets.

The text was updated successfully, but these errors were encountered:

nargov · 2024-02-22T15:17:35Z

Hi,

I was thinking of tackling this one using this library.
While the http package has a mime type sniffing function, this has the benefit of the hierarchy of mime types, meaning the determination between binary/text is provided.

What do you think?

baruchiro · 2024-02-28T07:32:29Z

I was thinking of tackling this one using this library. While the http package has a mime type sniffing function, this has the benefit of the hierarchy of mime types, meaning the determination between binary/text is provided.

@nargov from their documentation:

Only use libraries like mimetype as a last resort. Content type detection using magic numbers is slow, inaccurate, and non-standard

I don't want to harm our performance, this library at least makes us read each file twice.

I'm looking for an idea to reduce the binaries scans, but without huge performance issues on one hand, and without doing magics for the user on the other hand.
For example, last time we saw this problem, we added the max-target-megabytes flag to skip large files.
Here, the only thing I can think of, is to somehow measure the time of doing a task for a specific file, and warn in the log about a potential performance issue.

What do you think?

By the way, I'm sorry for the late response, I was sick. I appreciate your help!

nargov · 2024-02-28T08:24:53Z

As an alternative, I see https://pkg.go.dev/net/http#DetectContentType reads at most 512 bytes to detect the MIME type. Think it's good enough?

baruchiro · 2024-02-28T09:36:55Z

OK, I think we can create a POC for that. Here is what I'm thinking:

We should avoid reading the file twice! We need to reuse the []byte.
We need to decide which MIME types are ignored.
We need to be sure the MIME type identification is not leading to unexpected results (unexpected skipping files)
Do we want to allow controlling which MIME types will be skipped?
We need to test how it affects the performance.
Can we check if and how KICS handled this situation?

You don't have to answer all the questions before you start developing.

baruchiro · 2024-03-12T09:36:00Z

Another option will be to ignore lines that are too long. On one hand, they might be a binary file. But on the other hand, they can be a minified JS file.

baruchiro added bug Something isn't working help wanted Extra attention is needed labels Feb 11, 2024

baruchiro mentioned this issue Feb 12, 2024

Performance issues when scanning a big repository #185

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip binaries files on filesystem scan #201

Skip binaries files on filesystem scan #201

baruchiro commented Feb 11, 2024 •

edited

Loading

nargov commented Feb 22, 2024

baruchiro commented Feb 28, 2024

nargov commented Feb 28, 2024

baruchiro commented Feb 28, 2024

baruchiro commented Mar 12, 2024

Skip binaries files on filesystem scan #201

Skip binaries files on filesystem scan #201

Comments

baruchiro commented Feb 11, 2024 • edited Loading

Steps to reproduce:

nargov commented Feb 22, 2024

baruchiro commented Feb 28, 2024

What do you think?

nargov commented Feb 28, 2024

baruchiro commented Feb 28, 2024

baruchiro commented Mar 12, 2024

baruchiro commented Feb 11, 2024 •

edited

Loading