Improve canonicalization performance#3119
Conversation
| best = [refined_coloring] | ||
| color_score = tuple(c.key() for c in refined_coloring) | ||
|
|
||
| if best_score is None or best_score < color_score: |
There was a problem hiding this comment.
The right-hand side of the or is never evaluated as best_score is None is always true here. Can you please review this.
|
Agreed that there are large performance issues using A few changes to |
|
Hi @ddeschepper, I will put the use of the canon algorithm behind a flag when calling the longturtle serializer and set it to false by default. This should address the performance issues for those who don't need deterministic outputs. |
|
Hi @ddeschepper, please can you review this PR #3197 and see if it addresses your concerns with performance? By default, canonicalization is no longer applied when using the |
We're noticing big performance issues when using longturtle serialization on some graphs. I've been able to narrow this down to the performance of canonicalization, which is also tracked in issue #2528.
Looking into it I found that the current implementation of the
_tracesmethod of_TripleCanonicalizercauses much of the performance impact.This PR reduces the complexity of
_traces, which leads to a performance gain of at least an order of magnitude in our worst cases (100s -> 4s). All rdflib tests still pass, and additionally, I've tested these changes with our set of a few hundred examples that are longturtle serialized, which causes no changes in the serialization output.The author of the linked issue has created a performance test that, with the current code, gives the following results on my machine:
where my new version results in: