【Hackathon 8th No.40】在 PaddleOCR 中复现 MixTeX 模型 #15581

robinbg · 2025-06-04T18:52:58Z

No description provided.

paddle-bot · 2025-06-04T18:53:03Z

Thanks for your contribution!

Copilot

Pull Request Overview

This PR implements the MixTeX model in PaddleOCR by adding new inference, training, and vocabulary generation scripts as well as the corresponding model components, data processing logic, loss function, and documentation.

Added inference script (predict_mixtex.py) for evaluating MixTeX models.
Integrated new model components including the backbone, head, and loss for MixTeX.
Provided data processing tools, configuration files, and updated documentation in both English and Chinese.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tools/infer/predict_mixtex.py	Added inference script and initialization for MixTeX recognition.
tools/generate_mixtex_vocab.py	Created script to generate frequency-based vocabulary for MixTeX.
ppocr/postprocess/mixtex_postprocess.py	Added post-processing utility for decoding model outputs into LaTeX strings.
ppocr/modeling/heads/mixtex_head.py	Implemented a transformer decoder head for MixTeX with both training and inference modes.
ppocr/modeling/backbones/mixtex_backbone.py	Introduced an efficient convolutional backbone for feature extraction.
ppocr/modeling/architectures/mixtex.py	Assembled the MixTeX architecture by combining the backbone and head.
ppocr/losses/mixtex_loss.py	Developed a custom loss function integrating cross-entropy with label smoothing.
ppocr/data/imaug/mixtex_process.py	Added data processing and tokenization routines for MixTeX training/inference.
docs/algorithm/formula_recognition/mixtex_en.md	Updated English documentation for the MixTeX model.
docs/algorithm/formula_recognition/mixtex.md	Updated Chinese documentation for the MixTeX model.
doc/mixtex/mixtex_architecture.txt	Provided textual architecture diagram for MixTeX.
doc/mixtex/mixtex_architecture.png	Added placeholder image for the MixTeX architecture diagram.
configs/rec/mixtex/mixtex_base.yml	Provided configuration file for training and evaluation of MixTeX.

Comments suppressed due to low confidence (1)

tools/infer/predict_mixtex.py:51

The attribute name 'rec_algorith' appears to contain a typo; consider renaming it to 'rec_algorithm' for clarity.

self.rec_algorith = "MixTeX"

Copilot · 2025-06-04T23:48:22Z

ppocr/data/imaug/mixtex_process.py

+        # Break text into tokens based on whitespace and special characters
+        tokens = []
+        current_token = ""
+        special_chars = ['\\', '{', '}', '_', '^', '&']


[nitpick] The list of special characters used for tokenizing LaTeX in the 'encode' method is less comprehensive than the one used in the vocabulary generation script; consider aligning the tokenization rules to avoid potential inconsistencies.

Suggested change

special_chars = ['\\', '{', '}', '_', '^', '&']

special_chars = ['\\', '{', '}', '_', '^', '&', '%', '$', '#', '~', '[', ']', '(', ')', '|', '<', '>', '=', '+', '-', '*', '/', ':', ';', ',', '.', '!']

luotao1 · 2025-06-10T03:18:51Z

请先提交RFC设计文档

Add MixTeX multi-modal LaTeX formula recognition model implementation

5cf7f7f

paddle-bot bot added the contributor label Jun 4, 2025

GreatV requested a review from Copilot June 4, 2025 23:47

Copilot AI reviewed Jun 4, 2025

View reviewed changes

luotao1 mentioned this pull request Jun 5, 2025

【Hackathon 8th】开源贡献个人挑战赛 PaddlePaddle/Paddle#71310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

【Hackathon 8th No.40】在 PaddleOCR 中复现 MixTeX 模型 #15581

【Hackathon 8th No.40】在 PaddleOCR 中复现 MixTeX 模型 #15581

Uh oh!

robinbg commented Jun 4, 2025

Uh oh!

paddle-bot bot commented Jun 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 4, 2025

Uh oh!

luotao1 commented Jun 10, 2025

Uh oh!

Uh oh!

	special_chars = ['\\', '{', '}', '_', '^', '&']
	special_chars = ['\\', '{', '}', '_', '^', '&', '%', '$', '#', '~', '[', ']', '(', ')', '\|', '<', '>', '=', '+', '-', '*', '/', ':', ';', ',', '.', '!']

【Hackathon 8th No.40】在 PaddleOCR 中复现 MixTeX 模型 #15581

Are you sure you want to change the base?

【Hackathon 8th No.40】在 PaddleOCR 中复现 MixTeX 模型 #15581

Uh oh!

Conversation

robinbg commented Jun 4, 2025

Uh oh!

paddle-bot bot commented Jun 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

luotao1 commented Jun 10, 2025

Uh oh!

Uh oh!