Releases: VikParuchuri/marker
Releases · VikParuchuri/marker
Marker Bugfixes and Improvements to `pdftext`
What's Changed
- Fix
chunk_convert.sh
to handleoutput_dir
correctly by @Leon-Sander in #415 pdftext
Improvements and Misc Bugfixes by @VikParuchuri and @iammosespaulr in #422- Blank page and TOC bugfixes
- Fix README.md and updated examples
- Update to the latest pdftext release, incorporating heuristic-based segmentation for enhanced performance and accuracy
- Update surya and tabled dependencies, incorporating various bugfixes.
New Contributors
- @Leon-Sander made their first contribution in #415
Full Changelog: v1.0.2...v1.1.0
Bugfixes - python 3.10 compatibility, quotes, images
- Fix issue with python 3.10
- Fix positions of quote characters
- Change default image output type to JPEG for speed and smaller filesize with minimal quality loss
Bugfixes and parsing improvements
- Fix lots of misc bugs, including encoding, empty page problems, and image rendering
- Improve list processing with joining and nesting
- Add in blockquotes
- Slightly improve performance
What's Changed
- Fix marker server by @VikParuchuri in #396
- Add
ListGroup
joining processor and refactorText
joining processor by @iammosespaulr in #402 - Misc fixes by @VikParuchuri in #397
- Add Blockquote Processor by @iammosespaulr in #404
- Add Nested Lists support to ListProcessor by @iammosespaulr in #410
- Marker Improvements and Bugfixes by @iammosespaulr in #403
Full Changelog: v1.0.0...v1.0.1
Marker v1!
This is the release of marker v1, a complete rewrite from scratch.
- 2x faster due to a new layout model
- Consistent internal schema for blocks and pages
- Modular architecture with processors and renderers that can easily be overridden
- JSON chunk and markdown output
- Lots of units tests
- Much higher output quality
What's Changed
- feat: API server file upload support by @tjbck in #332
- Upgrade line joining by @iammosespaulr in #344
- Surya Layout model and batch multiplier updates by @iammosespaulr in #335
- Initial document skeleton by @VikParuchuri in #345
- Add PDF Provider by @iammosespaulr in #346
- Add Layout Merging by @iammosespaulr in #348
- Vik v2 by @VikParuchuri in #349
- Layout Merging fixes and tests by @iammosespaulr in #350
- Vik v2 by @VikParuchuri in #351
- Decouple Span from Line by @iammosespaulr in #352
- Vik v2 by @VikParuchuri in #353
- Add simple line and span renderer, add blocktype class by @VikParuchuri in #357
- Add markdown renderer, swap how ids are named by @VikParuchuri in #358
- Fix markdown output by @VikParuchuri in #359
- Add OCR Builder by @iammosespaulr in #356
- Output images, clean up other output formats by @VikParuchuri in #362
- Vik v2 by @VikParuchuri in #364
- Cleanup and speed up tests by @iammosespaulr in #363
- Add CI tests by @iammosespaulr in #366
- Add debug utils, fix output quality issues by @VikParuchuri in #367
- Allow Overriding Node Classes by @iammosespaulr in #368
- Reorganize tests by @VikParuchuri in #369
- Minor debugging and misc fixes by @iammosespaulr in #370
- Chunk JSON output by @VikParuchuri in #371
- Vik v2 by @VikParuchuri in #372
- Add code processor, fix issues with structure by @VikParuchuri in #375
- Add Line merging across Pages and Columns by @iammosespaulr in #373
- PDF Converter Initialization refactor + Tests by @iammosespaulr in #379
- Wire up convert_single by @VikParuchuri in #380
- Fix tests by @VikParuchuri in #381
- Add Docstrings for Processors, Builders and Converters and
-l
to list them from theconvert.py
CLI + Misc Fixes by @iammosespaulr in #382 - Fix broken text by @VikParuchuri in #383
- Fix marker app by @VikParuchuri in #384
- Fix marker server by @VikParuchuri in #385
- Misc Bugfixes by @iammosespaulr in #386
- Vik v2 by @VikParuchuri in #387
- Update tests by @iammosespaulr in #388
- Additional Fixes by @iammosespaulr in #390
- Vik v2 by @VikParuchuri in #391
- Marker v2 by @VikParuchuri in #392
- Improve comparison performance by @VikParuchuri in #394
- Dev by @VikParuchuri in #395
New Contributors
Full Changelog: v0.3.10...v1.0.0
Performance improvements, API server
- Improve performance by 10-15%
- Add a simple API server for local use-cases
Flatten PDF, fix page separators, fix torch/transformers bugs
- Fix issues with transformers 4.46 and torch 2.5
- Improve page separators - they now appear that the start of the page, and show the page number
- Flatten form fields into the PDF before extracting markdown
Fix table bug
- Fix bug that caused conversion to fail when start_page was set and the document had tables
Undo threads
Threads cause issues on a small % of devices
Speedups, bug fixes
- Fix some edge case OCR bugs
- ~20% end to end speedup by improving layout and text detection
Fix OCR bugs
- Fix bbox issue with OCR and resizing
- Fix issue with layout bboxes missing after OCR