-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deserializing JSON to an AnalyzeResult is 20x slower in azure-ai-documentintelligence than in azure-ai-formrecognizer #36765
Comments
Thanks for the feedback, we’ll investigate asap. |
As a baseline for "reasonably performant while type-safe", I created a Pydantic model [1] with the same structure as the DocumentIntelligence AnalyzeResult. Deserialisation from JSON this time was 25x faster. Here is the timing comparison code: import azure.ai.documentintelligence.models
obj = orjson.loads(pathlib.Path(<path>).read_bytes())
with Timer('Pydantic version of AnalyzeResult') as t1:
result1 = PydanticAnalyzeResult(**obj)
# ['Pydantic version of AnalyzeResult' took 513.7 ms]
with Timer('DocumentIntelligence AnalyzeResult') as t2:
result2 = azure.ai.documentintelligence.models.AnalyzeResult(obj)
# ['DocumentIntelligence AnalyzeResult' took 0m12.93s]
print(f"Ratio: {round(t2.elapsed / t1.elapsed, 1)}x")
# Ratio: 25.2x
assert result1.pages[0].lines[0].content == result2.pages[0].lines[0].content [1] This Pydantic model was created directly from the DocIntelligence JSON using |
If I may add one further comment, please can you test with the features |
Thank you so much for this report @stevesimmons - this information is very helpful! We've already merged one perf improvement to the generated models - so if we regenerate this library we should already get that one for free. I think there will still be a fair amount of work to get the performance here where we want it to be. @swathipil and I will keep digging! |
Here is the comparison of first vs second runs of
Unless this slowdown is due to a simple cache step that's not working, maybe it's worth considering switching from the dynamic metaclass approach to a straightforward Pydantic model. You know the schemas for all the subobjects in AnalyzeResult. Plus Pydantic supports validator functions that can transparently upgrade old OCR results from the formrecognizer schema to documentintelligence's (e.g. reformatting polygons from |
Great - thank you for this additional data and all your investigations - we will investigate! |
+1 to this, especially Pydantic compatibility! Please keep us posted - thanks! |
@annatisch It takes |
Great thanks @YalinLi0312 - good to see a little bit of progress. |
@annatisch @swathipil I tested on the same file with the regenerated package from the latest codegen(PR link), it takes |
Hi @stevesimmons , our new release with perf updates is ready: https://pypi.org/project/azure-ai-documentintelligence/1.0.0b4/ Thanks |
Hi @stevesimmons. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation. |
Hi @stevesimmons - Have you had a chance to test new release and are you seeing improvement in perf? Additionally, would you be able to provide a sample of the OCR JSON output for both documentintelligence and formrecognizer? We have a sample file that we're running off of, but wondering if you might have a specific file that highlights any additional differences. Update from our side: We are continuing to investigate and will keep you updated on any findings. Thanks! |
Hi @stevesimmons. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
I've tested the new 1.0.0b4 version against the 32 files I am working on today.
|
Hi,
I noticed an unexpected and quite serious performance degradation while migrating from
azure-ai-formrecognizer
(version 3.3.3) toazure-ai-documentintelligence
(version 1.0.0b3), both the latest versions. I am using Python 3.11.4.Reloading a saved JSON OCR result is 21x slower in the new beta version of
azure-ai-documentintelligence
compared withazure-ai-formrecognizer
. For me, this is slow enough to make it the new code unusable, which also block use of the new2024-02-29-preview
models.Here are test results for a 50 page document, whose serialised OCR JSON output is 12MB in size:
I'd expect the two load times to be similar.
I used the Python profiler to get an idea where all the time is spent. It looks like the
_model_base.Model
is doing a ton of work repeatedly figuring out dynamically generated base classes at the start of_model_base.__new__()
for every single object instance in the OCR results. Can some of this be cached?Here is profiling output (slower than above because of the profiling hooks):
The text was updated successfully, but these errors were encountered: