-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reporting of fuzzing-time stats (by fuzzer.py) #648
Conversation
I fixed a lot of some of the more important issues, others (like reusing the stats implementation in other AFL-based fuzzers) can be done later. |
So, we wont have stats for libFuzzer and honggfuzz fuzzers ? Starting somewhere is fine, we need to add those as well over time. |
Not in this patch. But they can definitely be added. This API is meant to be supportable by most fuzzers and is meant to be optional as well so we don't increase onboarding complexity. |
Sounds perfect, we definitely add this incrementally and not everything needs to integrate with this. |
I think this is ready to land, I'm just going to test it with a local experiment first. |
@@ -62,3 +68,4 @@ class Snapshot(Base): | |||
trial_id = Column(Integer, ForeignKey('trial.id'), primary_key=True) | |||
trial = sqlalchemy.orm.relationship('Trial', back_populates='snapshots') | |||
edges_covered = Column(Integer, nullable=False) | |||
fuzzer_stats = Column(JSON, nullable=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to keep this? If yes, what is the benefit of having a JSON here?
IMO having JSONs here will be a large source of complexity (e.g. in queries) and bugs.
Consider having a number_of_executions
column (Integer, nullable) instead for this PR, which would be more normal form than {'execs_per_sec': float}
JSONs. Later, other metrics should have their own column too (like CPU/mem usage at time
).
Related:
https://www.thomascerqueus.fr/json-fields-relational-databases-postgresql/
https://stackoverflow.com/questions/15367696/storing-json-in-database-vs-having-a-new-column-for-each-key
https://softwareengineering.stackexchange.com/questions/215429/is-it-wise-to-store-a-big-lump-of-json-on-a-database-row
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on if we will have custom stats and whether we want to allow fuzzers to use those. If we decide with Laszlo approach, i dont think we should add all these columns in Snapshot table ? maybe some other table type. Should we have a FuzzerStats table object here and then we can add any columns to it over time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to keep this?
I think so.
If yes, what is the benefit of having a JSON here?
It offers a lot more flexibility. It means that stats can be added and removed without changing the SQL schema.
IMO having JSONs here will be a large source of complexity (e.g. in queries) and bugs.
I suspect this won't actually be the case since we tend to most analysis on dataframes, so once we clean things up so that JSON fields work the same as other fields in dataframes, we wont have a problem.
In https://stackoverflow.com/questions/15367696/storing-json-in-database-vs-having-a-new-column-for-each-key they describe a case for JSON fields that I think is appicable:
Saving arbitrary key/value user preferences (where the value can be boolean, textual, or numeric, and you don't want to have separate columns for different data types)
I think I want to do something similar in the next iteration. There I envision allowing fuzzers to store arbitrary key/value as stats.
Personally I don't think I will ever want to use these stats in an SQL query. For me, using pandas is easier, and this is what most users will do as well. Thus I don't think we lose anything from using a JSON field.
We do gain flexibility though so on balance I think there is a stronger case for JSON fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in-person you preferred using number of execs rather than avg execs since the latter can be derived from the former (which conveys more info). I think it's probably better to store stats in the format they will be consumed. So assuming avg execs is more useful to users than total execs (and maybe I'm wrong) it's better to store avg rather than store total and compute average before graphing. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that stats can be added and removed without changing the SQL schema.
That is what I think we should avoid. Would we allow arbitrary JSON, or enforce a schema? If the former, how are we going to use that? The PR suggests the latter, as it implements manual schema checking in common/fuzzer_stats.py
(SCHEMA = {'execs_per_sec': float}
). In this case what's not clear is: why would we implement this manually, adding all this code and complexity, while this is what the DB schema gives us out of the box?
IMO having JSONs here will be a large source of complexity (e.g. in queries) and bugs.
I suspect this won't actually be the case since we tend to most analysis on dataframes, so once we clean things up so that JSON fields work the same as other fields in dataframes, we wont have a problem.
Yes, this is what I was referring to: using the DB schema, we get the dataframe automatically, no additional code needed. With JSON we need to write custom code validating, and converting arbitrary schemaless JSONs to dataframe columns manually, handling all possible errors and so on, which I think will be a massive source of complexity and bugs.
I think I want to do something similar in the next iteration. There I envision allowing fuzzers to store arbitrary key/value as stats.
What do you mean by arbitrary? Does arbitrary mean that it's relevant only to a particular fuzzer? If yes, how would we use that? Or is it just for the developer of a particular engine? If it's a fuzzer engine specific stat and only for the developer of that engine to be used by their private scripts, and not by fuzzbench code, than JSON indeed sounds good.
However, for common stats (that are relevant to all engines) we should carefully choose the ones that we want to allow engines to provide (and have a schema for them) to ensure uniformity and normalized data, so we can compare them along those metrics (eg. in the reports). To me, these would be basically
- number of execs ,
- cpu usage,
- mem usage
(Is there anything else?)
arbitrary key/value as stats.
Why key/value? What key would you use other than time
or number_of_execs
as key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specific in column name is weird, i think fuzzer_stats would imply fuzzer specific stats. and, cpu and memory columns will probably come later as we add capturing hooks ? for now, i think just json is fine, since are adding fuzzer stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specific in column name is weird, i think fuzzer_stats would imply fuzzer specific stats. and, cpu and memory columns will probably come later as we add capturing hooks ?
sgtm. we might still consider some alternative name that differentiates a bit better between "common" and "specific" stats. eg. edges_covered
is also "fuzzer statistic", but it's a "common" one, as opposed to say the afl specific stability. maybe something like extra_stats
or custom_stats
or user_defined_stats
would capture this a bit better.
for now, i think just json is fine, since are adding fuzzer stats.
which stats are we adding first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean we will have json stats for now, since more common stats like cpu and mem are not hooked and will require custom capturing code. num_execs, can come as common one, but would need code across all fuzzers, i think this one is just doing afl as a start. once we add to all fuzzers, they can add code to set this column "num_execs"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm of two minds about this question since you (@lszekeres) raised valid points.
But I think it would be simpler to handle all of the stats which use custom capturing in the same way (JSON).
I also realized there is a pretty good reason we don't want to rely on the schema for type validation.
We wont save stats to the db in the runner, we will do this in the measurer. So I don't think we can wait until saving to the db for validation because it happens so far from the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we agree to separate common metrics from fuzzer-specific ones?
If yes, then re. type/schema validation, let's consider the two cases:
-
Fuzzer specific metrics wouldn't have a schema, users would be allowed to provide arbitrary JSONs, right? I.e., if
json.dumps()
succeeds, it's valid and good to go. (We could also statically pytype-check that thefuzzer.py
'sget_custom_stats()
returns adict
.) -
For common metrics, let's look at them case by case:
- Stuff like
cpu_usage
andmem_usage
wouldn't be provided by thefuzzer.py
, but by fuzzbench itself. No special type/schema validation is necessary. - For
num_execs
we need to rely onfuzzer.py
. I'd collect this data by letting fuzzers define adef get_num_execs() -> int
function. Early type/schema validation is done by pytype (-> int
) at linting time.
- Stuff like
Local experiment worked fine. |
@@ -76,3 +79,14 @@ def test_fuzz_function_errors(): | |||
# so fail if we see one. If that is not the case than this assert | |||
# should be removed. | |||
assert not isinstance(error.value, TypeError) | |||
|
|||
|
|||
def test_afl_get_stats(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we think about putting the test for each fuzzer's get_stats here?
The more intuitive alternative is to create fuzzers/afl/test_fuzzer.py and put it there.
But I don't really love the idea of polluting fuzzer directories with this (and with test files as well).
Currently the only stats FuzzBench reports per cycle/snapshot are those recorded by the measurer. This means we cannot report some useful fuzzing-time stats such as executions/sec, stability, CPU/RAM or I/O usage etc.
This PR adds an optional API for fuzzer.py to report extra stats in a uniform way that allows fuzzers to be compared on these stats.
For now it only supports executions per second a float that we call avg_execs.
We can use the same pipeline in runner.py to report fuzz-time stats that are fuzzer independent such as RAM, disk, IO, CPU.
TODO before this is ready for review:
We also need to figure out how to report this. I think we could maybe link to another page where we generate stats in aggregate and for each benchmark. I think the exact way we report it can wait though.