Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserializing JSON to an AnalyzeResult is 20x slower in azure-ai-documentintelligence than in azure-ai-formrecognizer #36765

Open
stevesimmons opened this issue Aug 5, 2024 · 15 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Document Intelligence needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@stevesimmons
Copy link

stevesimmons commented Aug 5, 2024

Hi,

I noticed an unexpected and quite serious performance degradation while migrating from azure-ai-formrecognizer (version 3.3.3) to azure-ai-documentintelligence (version 1.0.0b3), both the latest versions. I am using Python 3.11.4.

Reloading a saved JSON OCR result is 21x slower in the new beta version of azure-ai-documentintelligence compared with azure-ai-formrecognizer. For me, this is slow enough to make it the new code unusable, which also block use of the new 2024-02-29-preview models.

Here are test results for a 50 page document, whose serialised OCR JSON output is 12MB in size:

import json, pathlib
import azure.ai.documentintelligence.models
import azure.ai.formrecognizer

obj = json.loads(pathlib.Path("src.ocr.json").read_bytes())

# Form Recognizer 3.3.3 takes 1.66 secs to reload AnalyzeResult
result_fr = azure.ai.formrecognizer.AnalyzeResult.from_dict(obj)

# Document Intelligence 1.0.0b takes 35.6 secs (21.5x longer)
result_di = azure.ai.documentintelligence.models.AnalyzeResult(obj)

I'd expect the two load times to be similar.

I used the Python profiler to get an idea where all the time is spent. It looks like the _model_base.Model is doing a ton of work repeatedly figuring out dynamically generated base classes at the start of _model_base.__new__() for every single object instance in the OCR results. Can some of this be cached?

Here is profiling output (slower than above because of the profiling hooks):

          112692731 function calls (99259951 primitive calls) in 138.527 seconds

   Ordered by: internal time

   ncalls       tottime  percall  cumtime  percall filename:lineno(function)
8456260/166821   26.026    0.000   87.971    0.001 copy.py:128(deepcopy)
        532080   23.577    0.000   27.177    0.000 _model_base.py:518(<dictcomp>)
1625780/166821    8.324    0.000   86.489    0.001 copy.py:227(_deepcopy_dict)
        532080    7.293    0.000   44.040    0.000 _model_base.py:515(__new__)
1208569/409967    6.768    0.000   13.617    0.000 _model_base.py:434(_serialize)
        532080    5.109    0.000    7.003    0.000 _model_base.py:521(<dictcomp>)
      18289276    4.771    0.000    4.771    0.000 {method 'get' of 'dict' objects}
      17230602    4.501    0.000    4.501    0.000 {built-in method builtins.hasattr}
       1011497    3.782    0.000    4.337    0.000 _model_base.py:597(_get_deserialize_callable_from_annotation)
      10796013    3.349    0.000    3.349    0.000 {built-in method builtins.isinstance}
 365259/166820    3.285    0.000   70.740    0.000 copy.py:259(_reconstruct)
       2303084    3.225    0.000    4.282    0.000 copy.py:243(_keep_alive)
      13229249    3.152    0.000    3.152    0.000 {built-in method builtins.id}
      537948/6    2.753    0.000  103.032   17.172 _model_base.py:749(_deserialize_with_callable)
        860113    1.995    0.000    2.234    0.000 _model_base.py:106(_serialize_datetime)
  312045/98139    1.795    0.000   77.118    0.001 copy.py:201(_deepcopy_list)
        473301    1.677    0.000    4.015    0.000 _model_base.py:463(_get_rest_field)
      166821/1    1.618    0.000  138.212  138.212 _model_base.py:485(__init__)
        365259    1.582    0.000    1.582    0.000 {method '__reduce_ex__' of 'object' objects}
        371128    1.489    0.000    2.242    0.000 typing.py:1611(__getitem__)
       6847519    1.475    0.000    1.475    0.000 {method 'items' of 'dict' objects}
        860113    1.317    0.000    1.317    0.000 _model_base.py:42(_timedelta_as_isostr)
        896701    1.314    0.000    1.785    0.000 _model_base.py:467(<genexpr>)
       6153176    1.276    0.000    1.276    0.000 copy.py:182(_deepcopy_atomic)
      166821/1    1.200    0.000  103.033  103.033 _model_base.py:496(<dictcomp>)
       3724560    1.192    0.000    1.192    0.000 {method 'items' of 'mappingproxy' objects}
        844677    1.103    0.000    1.767    0.000 _model_base.py:844(_get_deserialize_callable_from_annotation)
      370155/6    1.055    0.000  103.032   17.172 _model_base.py:771(_deserialize)
       1064158    1.040    0.000    1.040    0.000 {built-in method builtins.getattr}
        365259    0.815    0.000   31.077    0.000 copyreg.py:104(__newobj__)
 266198/183654    0.784    0.000    7.622    0.000 _model_base.py:438(<dictcomp>)
       3149490    0.760    0.000    0.760    0.000 {method 'append' of 'list' objects}
  166820/14561    0.746    0.000  101.664    0.007 _model_base.py:551(_deserialize)
     473301/10    0.706    0.000  103.032   10.303 _model_base.py:472(_create_value)
        166821    0.660    0.000   88.631    0.001 _model_base.py:341(__init__)
        371128    0.522    0.000    0.522    0.000 typing.py:349(inner)
        730518    0.514    0.000    2.198    0.000 copy.py:264(<genexpr>)
        897338    0.476    0.000    0.762    0.000 {built-in method builtins.issubclass}
       1262047    0.471    0.000    0.471    0.000 _model_base.py:813(_rest_name)
        532080    0.456    0.000    0.456    0.000 {method 'update' of 'dict' objects}
        473301    0.411    0.000    2.015    0.000 {built-in method builtins.next}
  166820/14561    0.394    0.000  102.626    0.007 _model_base.py:624(_deserialize_model)
        532080    0.359    0.000    0.359    0.000 {built-in method __new__ of type object at 0x7f2dc68eba20}
   82258/61624    0.319    0.000    8.557    0.000 _model_base.py:436(<listcomp>)
        104899    0.308    0.000    9.988    0.000 _models.py:2425(__init__)
  176516/14567    0.249    0.000  103.020    0.007 _model_base.py:725(<genexpr>)
       36515/6    0.246    0.000  103.032   17.172 _model_base.py:719(_deserialize_sequence)
        166821    0.226    0.000    0.226    0.000 _model_base.py:489(<dictcomp>)
             1    0.199    0.199  138.411  138.411 _models.py:259(__init__)
        640122    0.196    0.000    0.196    0.000 {method 'values' of 'dict' objects}
        166820    0.189    0.000    0.250    0.000 _model_base.py:309(get_deserializer)
           242    0.167    0.001   60.233    0.249 _models.py:2249(__init__)
  166820/14561    0.166    0.000  101.836    0.007 _model_base.py:731(_deserialize_default)
        166820    0.162    0.000    0.162    0.000 {built-in method _abc._abc_subclasscheck}
        537949    0.151    0.000    0.151    0.000 {built-in method builtins.len}
         26819    0.139    0.000   22.247    0.001 _models.py:2730(__init__)
        166820    0.127    0.000    0.191    0.000 _model_base.py:430(_is_model)
        166820    0.125    0.000    0.287    0.000 <frozen abc>:121(__subclasscheck__)
             1    0.115    0.115  138.527  138.527 <string>:1(<module>)
        371128    0.097    0.000    0.097    0.000 typing.py:2246(cast)
         13582    0.084    0.000   14.761    0.001 _models.py:2298(__init__)
         13936    0.076    0.000   11.766    0.001 _models.py:1787(__init__)
          6605    0.032    0.000    5.923    0.001 _models.py:2629(__init__)
           439    0.020    0.000   10.620    0.024 _models.py:2556(__init__)
           298    0.014    0.000   14.492    0.049 _models.py:2495(__init__)
           973    0.001    0.000   83.690    0.086 _model_base.py:647(_deserialize_with_optional)
             1    0.000    0.000  138.527  138.527 {built-in method builtins.exec}
             1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
@github-actions github-actions bot added customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Aug 5, 2024
@xiangyan99 xiangyan99 added Client This issue points to a problem in the data-plane of the library. Document Intelligence and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Aug 5, 2024
@github-actions github-actions bot added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Aug 5, 2024
@xiangyan99
Copy link
Member

Thanks for the feedback, we’ll investigate asap.

@stevesimmons
Copy link
Author

As a baseline for "reasonably performant while type-safe", I created a Pydantic model [1] with the same structure as the DocumentIntelligence AnalyzeResult. Deserialisation from JSON this time was 25x faster.

Here is the timing comparison code:

import azure.ai.documentintelligence.models

obj = orjson.loads(pathlib.Path(<path>).read_bytes())

with Timer('Pydantic version of AnalyzeResult') as t1:
    result1 = PydanticAnalyzeResult(**obj)
# ['Pydantic version of AnalyzeResult' took 513.7 ms]

with Timer('DocumentIntelligence AnalyzeResult') as t2:
    result2 = azure.ai.documentintelligence.models.AnalyzeResult(obj)
# ['DocumentIntelligence AnalyzeResult' took 0m12.93s]

print(f"Ratio: {round(t2.elapsed / t1.elapsed, 1)}x")
# Ratio: 25.2x
assert result1.pages[0].lines[0].content == result2.pages[0].lines[0].content

[1] This Pydantic model was created directly from the DocIntelligence JSON using datamodel-code-generator.

@stevesimmons
Copy link
Author

If I may add one further comment, please can you test with the features DocumentAnalysisFeature.LANGUAGES and DocumentAnalysisFeature.STYLE_FONT activated? They increase the size of the OCR results, making the slowdown more apparent.

@annatisch
Copy link
Member

Thank you so much for this report @stevesimmons - this information is very helpful!
Another datapoint I would be curious to see is whether the deserialization time improves on a second call (i.e. if you called AnalyzeResult(obj) twice in a row). I believe the deserialization caches a lot type-info the first-time round - which should make the second attempt much faster. Obviously that's still not a good experience and work is still needed here, especially in scenarios like this where the API is very minimal, so chances are you may only need to deserialize this once.

We've already merged one perf improvement to the generated models - so if we regenerate this library we should already get that one for free.
We also have another change in PR that should further improve this. Once this is reviewed and merged we should again regenerate and reassess. @YalinLi0312 could you test the provided code snippet with the updated autorest changes see how much difference it makes?

I think there will still be a fair amount of work to get the performance here where we want it to be. @swathipil and I will keep digging!

@stevesimmons
Copy link
Author

Here is the comparison of first vs second runs of AnalyzeResult(obj) on a 50-page document. No real difference between them. And roughly 1/20th the speed of a Pydantic model.

['pydantic version of AnalyzeResult' took 532.9 ms]
['DocumentIntelligence AnalyzeResult, first time' took 0m11.60s]
['DocumentIntelligence AnalyzeResult, second time' took 0m11.33s]

Unless this slowdown is due to a simple cache step that's not working, maybe it's worth considering switching from the dynamic metaclass approach to a straightforward Pydantic model. You know the schemas for all the subobjects in AnalyzeResult. Plus Pydantic supports validator functions that can transparently upgrade old OCR results from the formrecognizer schema to documentintelligence's (e.g. reformatting polygons from list[Point] to list[float], etc).

@annatisch
Copy link
Member

Great - thank you for this additional data and all your investigations - we will investigate!

@jleguina
Copy link

jleguina commented Aug 8, 2024

+1 to this, especially Pydantic compatibility!

Please keep us posted - thanks!

@YalinLi0312
Copy link
Member

@annatisch It takes 11.91s with DI version 1.0.0b3 and 8.14s with the updated generator changes in my test, 1.46X faster.

@annatisch
Copy link
Member

Great thanks @YalinLi0312 - good to see a little bit of progress.
It looks like the other perf updates to Model that were merged a couple of days ago haven't been included - perhaps they haven't been released in autorest yet. Would be good to retry with those once they're available.

@YalinLi0312
Copy link
Member

YalinLi0312 commented Aug 21, 2024

@annatisch @swathipil I tested on the same file with the regenerated package from the latest codegen(PR link), it takes 7.95s at this time.

@YalinLi0312
Copy link
Member

Hi @stevesimmons , our new release with perf updates is ready: https://pypi.org/project/azure-ai-documentintelligence/1.0.0b4/
Please give it a try and welcome any feedback!

Thanks

@xiangyan99 xiangyan99 added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Sep 6, 2024
@github-actions github-actions bot removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Sep 6, 2024
Copy link

github-actions bot commented Sep 6, 2024

Hi @stevesimmons. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

@swathipil swathipil added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. labels Sep 9, 2024
@swathipil
Copy link
Member

swathipil commented Sep 10, 2024

Hi @stevesimmons - Have you had a chance to test new release and are you seeing improvement in perf?

Additionally, would you be able to provide a sample of the OCR JSON output for both documentintelligence and formrecognizer? We have a sample file that we're running off of, but wondering if you might have a specific file that highlights any additional differences.

Update from our side: We are continuing to investigate and will keep you updated on any findings. Thanks!

@swathipil swathipil added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Sep 10, 2024
Copy link

Hi @stevesimmons. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

@stevesimmons
Copy link
Author

I've tested the new 1.0.0b4 version against the 32 files I am working on today.
The performance is considerably better than b3, though still several times slower than my basic Pydantic model.
It would probably need to double in speed again for me to consider using the new AnalyzeResult.

Loaded 15043c1df389006c OCR in 111ms (compared with 839ms using Azure Doc Int API version 1.0.0b4)
Loaded 2a85d24fb765fd3d OCR in 26ms (compared with 840ms using Azure Doc Int API version 1.0.0b4)
Loaded 029147ad86efb1ff OCR in 367ms (compared with 2,325ms using Azure Doc Int API version 1.0.0b4)
Loaded 0d4c42d63dd20ffe OCR in 19ms (compared with 474ms using Azure Doc Int API version 1.0.0b4)
Loaded 012f6ac07cfb7963 OCR in 29ms (compared with 546ms using Azure Doc Int API version 1.0.0b4)
Loaded 07ecc075841d47b7 OCR in 574ms (compared with 3,197ms using Azure Doc Int API version 1.0.0b4)
Loaded de49996967a47f00 OCR in 97ms (compared with 2,471ms using Azure Doc Int API version 1.0.0b4)
Loaded 14b061938dfc7236 OCR in 49ms (compared with 1,138ms using Azure Doc Int API version 1.0.0b4)
Loaded 050c36121ea71146 OCR in 410ms (compared with 3,129ms using Azure Doc Int API version 1.0.0b4)
Loaded 0018d3c2807935fa OCR in 59ms (compared with 1,435ms using Azure Doc Int API version 1.0.0b4)
Loaded 0373b5bb4116de97 OCR in 294ms (compared with 2,130ms using Azure Doc Int API version 1.0.0b4)
Loaded 1f78b963792a66f4 OCR in 31ms (compared with 598ms using Azure Doc Int API version 1.0.0b4)
Loaded 0063a91af1314ece OCR in 14ms (compared with 298ms using Azure Doc Int API version 1.0.0b4)
Loaded 70e0bd3c9e304954 OCR in 231ms (compared with 1,104ms using Azure Doc Int API version 1.0.0b4)
Loaded 14b02012d78a7ed0 OCR in 255ms (compared with 1,381ms using Azure Doc Int API version 1.0.0b4)
Loaded 9399b7f371e00cab OCR in 492ms (compared with 5,136ms using Azure Doc Int API version 1.0.0b4)
Loaded 0f822b76c0ff9c67 OCR in 388ms (compared with 2,252ms using Azure Doc Int API version 1.0.0b4)
Loaded 37c9053b5fe887f3 OCR in 13ms (compared with 323ms using Azure Doc Int API version 1.0.0b4)
Loaded 0006f81a6a44e94b OCR in 228ms (compared with 684ms using Azure Doc Int API version 1.0.0b4)
Loaded 361de1e819b2d870 OCR in 13ms (compared with 313ms using Azure Doc Int API version 1.0.0b4)
Loaded 904267d94c742ce5 OCR in 246ms (compared with 1,039ms using Azure Doc Int API version 1.0.0b4)
Loaded 1a6062cdfb853f82 OCR in 584ms (compared with 3,561ms using Azure Doc Int API version 1.0.0b4)
Loaded 03528a5619961c11 OCR in 238ms (compared with 274ms using Azure Doc Int API version 1.0.0b4)
Loaded 70573a0f261fc803 OCR in 67ms (compared with 1,761ms using Azure Doc Int API version 1.0.0b4)
Loaded 05a2d94b41a9d4f9 OCR in 13ms (compared with 344ms using Azure Doc Int API version 1.0.0b4)
Loaded 29fa6c5e10bace1d OCR in 46ms (compared with 1,238ms using Azure Doc Int API version 1.0.0b4)
Loaded 741bbc725edbe985 OCR in 9ms (compared with 237ms using Azure Doc Int API version 1.0.0b4)
Loaded 96661f1489af3aca OCR in 277ms (compared with 2,032ms using Azure Doc Int API version 1.0.0b4)
Loaded 15c827591adb6518 OCR in 213ms (compared with 373ms using Azure Doc Int API version 1.0.0b4)
Loaded 04a48d0ee7e49d1f OCR in 349ms (compared with 3,104ms using Azure Doc Int API version 1.0.0b4)
Loaded 4dc554962484138d OCR in 13ms (compared with 340ms using Azure Doc Int API version 1.0.0b4)
Loaded 35f4f94925ead8bf OCR in 317ms (compared with 2,816ms using Azure Doc Int API version 1.0.0b4)

@github-actions github-actions bot added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Sep 19, 2024
@github-actions github-actions bot removed the needs-author-feedback Workflow: More information is needed from author to address the issue. label Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Document Intelligence needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

6 participants