Skip to content

Conversation

robinbg
Copy link

@robinbg robinbg commented Jun 4, 2025

No description provided.

Copy link

paddle-bot bot commented Jun 4, 2025

Thanks for your contribution!

@GreatV GreatV requested a review from Copilot June 4, 2025 23:47
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the MixTeX model in PaddleOCR by adding new inference, training, and vocabulary generation scripts as well as the corresponding model components, data processing logic, loss function, and documentation.

  • Added inference script (predict_mixtex.py) for evaluating MixTeX models.
  • Integrated new model components including the backbone, head, and loss for MixTeX.
  • Provided data processing tools, configuration files, and updated documentation in both English and Chinese.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tools/infer/predict_mixtex.py Added inference script and initialization for MixTeX recognition.
tools/generate_mixtex_vocab.py Created script to generate frequency-based vocabulary for MixTeX.
ppocr/postprocess/mixtex_postprocess.py Added post-processing utility for decoding model outputs into LaTeX strings.
ppocr/modeling/heads/mixtex_head.py Implemented a transformer decoder head for MixTeX with both training and inference modes.
ppocr/modeling/backbones/mixtex_backbone.py Introduced an efficient convolutional backbone for feature extraction.
ppocr/modeling/architectures/mixtex.py Assembled the MixTeX architecture by combining the backbone and head.
ppocr/losses/mixtex_loss.py Developed a custom loss function integrating cross-entropy with label smoothing.
ppocr/data/imaug/mixtex_process.py Added data processing and tokenization routines for MixTeX training/inference.
docs/algorithm/formula_recognition/mixtex_en.md Updated English documentation for the MixTeX model.
docs/algorithm/formula_recognition/mixtex.md Updated Chinese documentation for the MixTeX model.
doc/mixtex/mixtex_architecture.txt Provided textual architecture diagram for MixTeX.
doc/mixtex/mixtex_architecture.png Added placeholder image for the MixTeX architecture diagram.
configs/rec/mixtex/mixtex_base.yml Provided configuration file for training and evaluation of MixTeX.
Comments suppressed due to low confidence (1)

tools/infer/predict_mixtex.py:51

  • The attribute name 'rec_algorith' appears to contain a typo; consider renaming it to 'rec_algorithm' for clarity.
self.rec_algorith = "MixTeX"

# Break text into tokens based on whitespace and special characters
tokens = []
current_token = ""
special_chars = ['\\', '{', '}', '_', '^', '&']
Copy link
Preview

Copilot AI Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The list of special characters used for tokenizing LaTeX in the 'encode' method is less comprehensive than the one used in the vocabulary generation script; consider aligning the tokenization rules to avoid potential inconsistencies.

Suggested change
special_chars = ['\\', '{', '}', '_', '^', '&']
special_chars = ['\\', '{', '}', '_', '^', '&', '%', '$', '#', '~', '[', ']', '(', ')', '|', '<', '>', '=', '+', '-', '*', '/', ':', ';', ',', '.', '!']

Copilot uses AI. Check for mistakes.

@luotao1
Copy link
Collaborator

luotao1 commented Jun 10, 2025

请先提交RFC设计文档

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants