-
Notifications
You must be signed in to change notification settings - Fork 8.6k
【Hackathon 8th No.40】在 PaddleOCR 中复现 MixTeX 模型 #15581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
【Hackathon 8th No.40】在 PaddleOCR 中复现 MixTeX 模型 #15581
Conversation
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements the MixTeX model in PaddleOCR by adding new inference, training, and vocabulary generation scripts as well as the corresponding model components, data processing logic, loss function, and documentation.
- Added inference script (predict_mixtex.py) for evaluating MixTeX models.
- Integrated new model components including the backbone, head, and loss for MixTeX.
- Provided data processing tools, configuration files, and updated documentation in both English and Chinese.
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tools/infer/predict_mixtex.py | Added inference script and initialization for MixTeX recognition. |
tools/generate_mixtex_vocab.py | Created script to generate frequency-based vocabulary for MixTeX. |
ppocr/postprocess/mixtex_postprocess.py | Added post-processing utility for decoding model outputs into LaTeX strings. |
ppocr/modeling/heads/mixtex_head.py | Implemented a transformer decoder head for MixTeX with both training and inference modes. |
ppocr/modeling/backbones/mixtex_backbone.py | Introduced an efficient convolutional backbone for feature extraction. |
ppocr/modeling/architectures/mixtex.py | Assembled the MixTeX architecture by combining the backbone and head. |
ppocr/losses/mixtex_loss.py | Developed a custom loss function integrating cross-entropy with label smoothing. |
ppocr/data/imaug/mixtex_process.py | Added data processing and tokenization routines for MixTeX training/inference. |
docs/algorithm/formula_recognition/mixtex_en.md | Updated English documentation for the MixTeX model. |
docs/algorithm/formula_recognition/mixtex.md | Updated Chinese documentation for the MixTeX model. |
doc/mixtex/mixtex_architecture.txt | Provided textual architecture diagram for MixTeX. |
doc/mixtex/mixtex_architecture.png | Added placeholder image for the MixTeX architecture diagram. |
configs/rec/mixtex/mixtex_base.yml | Provided configuration file for training and evaluation of MixTeX. |
Comments suppressed due to low confidence (1)
tools/infer/predict_mixtex.py:51
- The attribute name 'rec_algorith' appears to contain a typo; consider renaming it to 'rec_algorithm' for clarity.
self.rec_algorith = "MixTeX"
# Break text into tokens based on whitespace and special characters | ||
tokens = [] | ||
current_token = "" | ||
special_chars = ['\\', '{', '}', '_', '^', '&'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The list of special characters used for tokenizing LaTeX in the 'encode' method is less comprehensive than the one used in the vocabulary generation script; consider aligning the tokenization rules to avoid potential inconsistencies.
special_chars = ['\\', '{', '}', '_', '^', '&'] | |
special_chars = ['\\', '{', '}', '_', '^', '&', '%', '$', '#', '~', '[', ']', '(', ')', '|', '<', '>', '=', '+', '-', '*', '/', ':', ';', ',', '.', '!'] |
Copilot uses AI. Check for mistakes.
请先提交RFC设计文档 |
No description provided.