Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #395

Merged
merged 266 commits into from
Nov 27, 2024
Merged

Dev #395

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
266 commits
Select commit Hold shift + click to select a range
8e7ddcd
clean up new column detection
iammosespaulr Nov 11, 2024
0606b9f
remove threshold for newline (instead have it be dynamic)
iammosespaulr Nov 11, 2024
ef53665
cleanup
iammosespaulr Nov 12, 2024
e9bdac1
update pyproject
iammosespaulr Nov 12, 2024
e723a52
update indentation thresholding [skip ci]
iammosespaulr Nov 12, 2024
06139f0
Merge pull request #344 from VikParuchuri/dev-mose/line-joining
VikParuchuri Nov 12, 2024
c198419
Merge pull request #335 from VikParuchuri/dev-mose/compilation-updates
VikParuchuri Nov 12, 2024
69d4c17
Start on v2
VikParuchuri Nov 12, 2024
b083c10
Initial document provider
VikParuchuri Nov 12, 2024
9b3b257
Refactor how structure works
VikParuchuri Nov 12, 2024
7b7fc2b
add pdf provider
iammosespaulr Nov 13, 2024
db2c797
add block type for line and span [skip ci]
iammosespaulr Nov 13, 2024
c988b8c
add type for block_type
iammosespaulr Nov 13, 2024
3360ff6
allow arbitrary types for models [skip ci]
iammosespaulr Nov 13, 2024
2cf577e
use hf dataset for tests [skip ci]
iammosespaulr Nov 13, 2024
eaff612
add document builder test and associated fixes [skip ci]
iammosespaulr Nov 13, 2024
fb7152e
typo [skip ci]
iammosespaulr Nov 13, 2024
65b8c56
update page_id
iammosespaulr Nov 13, 2024
baa19bc
Merge pull request #346 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 13, 2024
a571886
Add structure
VikParuchuri Nov 13, 2024
63cdac1
add layout merging changes
iammosespaulr Nov 14, 2024
0bc2b21
add page structure
iammosespaulr Nov 14, 2024
f4cd1c4
Update tests
VikParuchuri Nov 14, 2024
060d295
Merge pull request #348 from VikParuchuri/dev-mose/marker-v2
iammosespaulr Nov 14, 2024
dd873e9
Merge branch 'v2' into vik_v2
VikParuchuri Nov 14, 2024
92b9a2c
Merge pull request #349 from VikParuchuri/vik_v2
VikParuchuri Nov 14, 2024
7d8bbdc
fix get_block [skip ci]
iammosespaulr Nov 14, 2024
d27b874
Refactor config
VikParuchuri Nov 14, 2024
c1d97f0
Config is basemodel or dict
VikParuchuri Nov 14, 2024
8bcf548
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 14, 2024
944eba0
merge polygons when blocks are linked to other blocks, page rescaling…
iammosespaulr Nov 14, 2024
b3358b1
add test to assert layout parsing and merging
iammosespaulr Nov 14, 2024
f3bdc6d
Merge pull request #350 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 14, 2024
8cbfea5
Add processors
VikParuchuri Nov 14, 2024
6cb51df
Add tests, cleanup impls
VikParuchuri Nov 14, 2024
c8e9850
decouple spans from lines and update tests
iammosespaulr Nov 14, 2024
aa66523
Update structure
VikParuchuri Nov 14, 2024
6451576
Merge branch 'v2' into vik_v2
VikParuchuri Nov 14, 2024
5cfd486
Merge pull request #351 from VikParuchuri/vik_v2
VikParuchuri Nov 14, 2024
55e097a
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 14, 2024
e2e583d
Merge pull request #352 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 14, 2024
d6d8387
bugfixes
VikParuchuri Nov 14, 2024
706a159
update test imports [skip ci]
iammosespaulr Nov 14, 2024
dacd12e
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 14, 2024
24abca2
polygon merging fixes [skip ci]
iammosespaulr Nov 14, 2024
4386612
Fix structure, change blockid to class
VikParuchuri Nov 14, 2024
4be2a5b
Add initial renderer
VikParuchuri Nov 14, 2024
0c2f68d
update scope and add marker [skip ci]
iammosespaulr Nov 14, 2024
c9f478a
Add default renderer
VikParuchuri Nov 14, 2024
1246b68
Merge pull request #353 from VikParuchuri/vik_v2
VikParuchuri Nov 14, 2024
2ab60e4
add text extraction
iammosespaulr Nov 14, 2024
983d00f
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 14, 2024
c564341
drop lines and spans from the provider if we detect bad text [skip ci]
iammosespaulr Nov 15, 2024
afe6295
add OCR builder and tests
iammosespaulr Nov 15, 2024
3e5a020
fix tests
iammosespaulr Nov 15, 2024
58cd73e
update test [skip ci]
iammosespaulr Nov 15, 2024
76ea3e5
Add simple line and span renderer, add blocktype class
VikParuchuri Nov 15, 2024
035e3a7
Merge pull request #357 from VikParuchuri/vik_v2
VikParuchuri Nov 15, 2024
c119918
Merge remote-tracking branch 'origin/vik_v2' into dev-mose/marker-v2
iammosespaulr Nov 15, 2024
b04e7b4
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 15, 2024
2d6256c
Add markdown renderer, swap how ids are named
VikParuchuri Nov 15, 2024
7b0adc2
Merge pull request #358 from VikParuchuri/vik_v2
VikParuchuri Nov 15, 2024
7557251
update tests etc
iammosespaulr Nov 15, 2024
22805ae
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 15, 2024
a748e23
Fix markdown output
VikParuchuri Nov 15, 2024
81092a6
Clean up renderers, fix output
VikParuchuri Nov 15, 2024
7556d53
Speed up queries
VikParuchuri Nov 15, 2024
a479f94
switch to blocktype enums
iammosespaulr Nov 16, 2024
99b6c93
fix structure test
iammosespaulr Nov 16, 2024
c7c4a73
fix line block type
iammosespaulr Nov 16, 2024
f32ea04
save state
iammosespaulr Nov 16, 2024
70c4734
Merge remote-tracking branch 'origin/vik_v2' into dev-mose/marker-v2
iammosespaulr Nov 16, 2024
5e49728
missing batch sizes [skip ci]
iammosespaulr Nov 16, 2024
be779cc
get everything working and tests to pass
iammosespaulr Nov 16, 2024
4f731a0
cleanup and all fixes [skip ci]
iammosespaulr Nov 16, 2024
ac5d9f8
add garbled pdf test [skip ci]
iammosespaulr Nov 16, 2024
e95e1ab
use run_ocr [skip ci]
iammosespaulr Nov 16, 2024
ced0c5d
remove force_ocr from document builder [skip ci]
iammosespaulr Nov 16, 2024
373fab0
fix bugs and add typing [skip ci]
iammosespaulr Nov 17, 2024
82f589d
speed up get_blocks [skip ci]
iammosespaulr Nov 17, 2024
fb3a8e7
Merge pull request #359 from VikParuchuri/vik_v2
VikParuchuri Nov 17, 2024
1fde6b6
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 18, 2024
a66451b
[skip ci]
iammosespaulr Nov 18, 2024
ec9d5a8
assert ID's match when retrieving blocks [skip ci]
iammosespaulr Nov 18, 2024
9d06b12
reduce threshold and remove dead code [skip ci]
iammosespaulr Nov 18, 2024
028ef50
refactor block merging [skip ci]
iammosespaulr Nov 18, 2024
be91572
Output images, clean up other output formats
VikParuchuri Nov 18, 2024
19d313c
update pytest ini and conftest to ignore warnings and fix pdf provide…
iammosespaulr Nov 18, 2024
d9d991b
fix table_processor test [skip ci]
iammosespaulr Nov 18, 2024
e662972
Merge pull request #356 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 18, 2024
706bda3
Merge consecutive output tags
VikParuchuri Nov 18, 2024
35dbba1
Merge in v2 changes
VikParuchuri Nov 18, 2024
f4ff48b
Merge pull request #362 from VikParuchuri/vik_v2
VikParuchuri Nov 18, 2024
2fcf587
Add section header processing
VikParuchuri Nov 18, 2024
6bd4e7d
make tests much faster and cleanup [skip ci]
iammosespaulr Nov 18, 2024
aa26320
Add pagination, refactor config
VikParuchuri Nov 18, 2024
56e0f7a
Merge remote-tracking branch 'origin/vik_v2' into dev-mose/marker-v2
iammosespaulr Nov 18, 2024
44d3a4c
Merge pull request #364 from VikParuchuri/vik_v2
VikParuchuri Nov 18, 2024
5df6001
add config to all builders and providers [skip ci]
iammosespaulr Nov 18, 2024
8f65acf
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 18, 2024
6d8e180
Merge pull request #363 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 18, 2024
4a08718
update deps
iammosespaulr Nov 18, 2024
9a793c8
add pytest dev dep, and add CI test workflow
iammosespaulr Nov 18, 2024
3dbbf74
Merge pull request #366 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 18, 2024
8bd872b
Add debug utils, fix output quality issues
VikParuchuri Nov 18, 2024
3e7c4f3
Merge pull request #367 from VikParuchuri/vik_v2
VikParuchuri Nov 18, 2024
ffa3bbc
Reorganize tests
VikParuchuri Nov 18, 2024
b674363
static registry + ability to override nodes
iammosespaulr Nov 18, 2024
c479d53
add test for overriding config and update pyproj deps to include mark…
iammosespaulr Nov 18, 2024
8c71b35
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 18, 2024
ead37b3
use get_block_class everywhere, make registry mp compatible
iammosespaulr Nov 19, 2024
52e4d3f
add multiprocessing tests for block overriding
iammosespaulr Nov 19, 2024
10f6ed0
Merge pull request #368 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 19, 2024
b191e17
Merge in v2
VikParuchuri Nov 19, 2024
f89089c
Merge pull request #369 from VikParuchuri/vik_v2
VikParuchuri Nov 19, 2024
20fec48
fix typo [skip ci]
iammosespaulr Nov 19, 2024
d4564ac
fix block_type in BaseProcessor [skip ci]
iammosespaulr Nov 19, 2024
efcfc24
fixes to debug processors and pdf converter [skip ci]
iammosespaulr Nov 19, 2024
6fdfd97
Merge pull request #370 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 19, 2024
59b6224
Initial chunk JSON output
VikParuchuri Nov 19, 2024
70963a1
Merge v2 changes
VikParuchuri Nov 19, 2024
4fbae5f
initialize registry overrides in the worker [skip ci]
iammosespaulr Nov 19, 2024
43cbd2c
Compute TOC, fix image output
VikParuchuri Nov 19, 2024
4591f31
Add json renderer tests
VikParuchuri Nov 19, 2024
bd18169
Merge pull request #371 from VikParuchuri/vik_v2
VikParuchuri Nov 19, 2024
71648d1
initial text processor [skip ci]
iammosespaulr Nov 19, 2024
8291f57
Add page stats
VikParuchuri Nov 19, 2024
7b2b3d8
Add toc processor, refactor script
VikParuchuri Nov 19, 2024
cc08b4b
Fix issues with spans/lines and providers
VikParuchuri Nov 19, 2024
b8301ee
Fix tests
VikParuchuri Nov 19, 2024
a9e11a6
add line joining logic across pages and columns
iammosespaulr Nov 19, 2024
7e449a1
Fix ocr heuristics, somewhat fix code
VikParuchuri Nov 19, 2024
a72a508
Misc fixes
VikParuchuri Nov 19, 2024
7b817ff
Merge pull request #372 from VikParuchuri/vik_v2
VikParuchuri Nov 19, 2024
28de11a
better heuristics and check next blocks across page boundaries [skip ci]
iammosespaulr Nov 19, 2024
88fbb8e
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 19, 2024
b909e01
parameterize threshold and fix tests [skip ci]
iammosespaulr Nov 19, 2024
e9f8352
fix structure checking [skip ci]
iammosespaulr Nov 19, 2024
0157e2f
fixes and cleanup
iammosespaulr Nov 19, 2024
0286229
Add code processor, fix issues with structure
VikParuchuri Nov 19, 2024
59aac27
fix section header processor and line count threshold [skip ci]
iammosespaulr Nov 19, 2024
744e02f
Add block fallbacks
VikParuchuri Nov 19, 2024
bf9199f
Dedup page numbers
VikParuchuri Nov 20, 2024
5cffaaf
update continuation heuristic
iammosespaulr Nov 20, 2024
d9352be
Merge pull request #375 from VikParuchuri/vik_v2
VikParuchuri Nov 20, 2024
99c5f86
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 20, 2024
bb44846
add some tolerance by rounding down to the nearest int for indent che…
iammosespaulr Nov 20, 2024
dd4db58
fix extra space in <p> tags [skip ci]
iammosespaulr Nov 20, 2024
86c5234
clean up logic and add heuristic to check if the next text block is i…
iammosespaulr Nov 20, 2024
1dd3440
clean up
iammosespaulr Nov 20, 2024
42dac94
more cleanup [skip ci]
iammosespaulr Nov 20, 2024
e0c6ff3
fix thinko
iammosespaulr Nov 20, 2024
c27664a
fix ceil
iammosespaulr Nov 20, 2024
175747c
merge new_block indentation check logic
iammosespaulr Nov 20, 2024
9503e9e
fix column break logic [skip ci]
iammosespaulr Nov 20, 2024
1e157c1
column gap tolerances and gap ratio of 2%
iammosespaulr Nov 20, 2024
c6693c5
Merge pull request #373 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 20, 2024
d62da89
pdf converter initialization + tests
iammosespaulr Nov 20, 2024
8a542cf
Merge remote-tracking branch 'origin/v2' into dev-mose/marker-v2
iammosespaulr Nov 20, 2024
36102d3
update converter interface
iammosespaulr Nov 20, 2024
124b1de
pass in model_dict instead of model_lst
iammosespaulr Nov 20, 2024
25a0ed7
Merge pull request #379 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 20, 2024
9263efc
Initial integration
VikParuchuri Nov 20, 2024
4945edd
Wire up convert_single
VikParuchuri Nov 20, 2024
95d19d0
Update convert script
VikParuchuri Nov 21, 2024
52df53e
Clean up benchmarks
VikParuchuri Nov 21, 2024
7d3de8c
Fix poetry lock
VikParuchuri Nov 21, 2024
c6204ee
Merge pull request #380 from VikParuchuri/vik_v2
VikParuchuri Nov 21, 2024
ec0fe7e
Fix tests
VikParuchuri Nov 21, 2024
c61f195
Merge pull request #381 from VikParuchuri/vik_v2
VikParuchuri Nov 21, 2024
7442256
disable multiprocessing via config
iammosespaulr Nov 21, 2024
44f9eb7
table ocr -> table recognition changes [skip ci]
iammosespaulr Nov 21, 2024
dd83ad1
add printer for config crawling for all processors, builders and conv…
iammosespaulr Nov 21, 2024
39f0999
more class docstrings [skip ci]
iammosespaulr Nov 21, 2024
41f049e
oops typo [skip ci]
iammosespaulr Nov 21, 2024
3b383f8
add docstrings for all the processors
iammosespaulr Nov 21, 2024
72ba6ab
fix help [skip ci]
iammosespaulr Nov 21, 2024
78fd0a7
Fix broken text
VikParuchuri Nov 21, 2024
243ae0b
Merge pull request #382 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 21, 2024
0000792
Review comments
VikParuchuri Nov 21, 2024
44e0322
Minor tweaks
VikParuchuri Nov 21, 2024
1d98e6e
Merge branch 'v2' into vik_v2
VikParuchuri Nov 21, 2024
4dbe9c5
Fix test
VikParuchuri Nov 21, 2024
a0f0ea6
Merge remote-tracking branch 'origin/vik_v2' into vik_v2
VikParuchuri Nov 21, 2024
8d5459f
Merge pull request #383 from VikParuchuri/vik_v2
VikParuchuri Nov 21, 2024
1575e78
Fix marker app
VikParuchuri Nov 21, 2024
c0c0e0c
Merge pull request #384 from VikParuchuri/vik_v2
VikParuchuri Nov 21, 2024
e461cb6
Fix marker server
VikParuchuri Nov 21, 2024
fdb5564
Fix pdftext workers config
VikParuchuri Nov 21, 2024
b721d11
fix batch sizes [skip ci]
iammosespaulr Nov 21, 2024
fc34530
Merge pull request #385 from VikParuchuri/vik_v2
VikParuchuri Nov 21, 2024
b98e0a3
pass in detection batch size appropriately for ocr_extraction [skip ci]
iammosespaulr Nov 21, 2024
1156b68
switch to skip_existing and fix html format outputs
iammosespaulr Nov 21, 2024
e5aa2a5
fix bugs and incr num workers
iammosespaulr Nov 21, 2024
07238e3
fix text encoding issue
iammosespaulr Nov 21, 2024
726be6a
another structure noneType fix
iammosespaulr Nov 21, 2024
a045e17
update poetry lock + use ftfy
iammosespaulr Nov 21, 2024
0afb89a
add skip existing to convert.p
iammosespaulr Nov 21, 2024
d043dc6
Start rewriting the README
VikParuchuri Nov 22, 2024
c38aa41
import ordering [skip ci]
iammosespaulr Nov 23, 2024
364a440
fix hyphenation and remove extra whitespace on the last line [skip ci]
iammosespaulr Nov 25, 2024
a98c6e3
bugfix space [skip ci]
iammosespaulr Nov 25, 2024
93a4fa4
handle hyphenated newlines in the middle of spans [skip ci]
iammosespaulr Nov 25, 2024
69cf00a
Fix footnotes
VikParuchuri Nov 25, 2024
75aca96
Bugfixes
VikParuchuri Nov 25, 2024
a945a5a
structure fixes [skip ci]
iammosespaulr Nov 25, 2024
7d20055
Add debug mode to marker app
VikParuchuri Nov 25, 2024
390c9c5
Continue working on README
VikParuchuri Nov 25, 2024
a3fd0cc
ignoretext upgrades, include streak threshold [skip ci]
iammosespaulr Nov 25, 2024
3a42ea8
Fix bug with out of order OCR lines
VikParuchuri Nov 25, 2024
6ec2959
Bump layout2
VikParuchuri Nov 25, 2024
4ea61c6
ignoretext handle empty blocks
iammosespaulr Nov 25, 2024
26328ff
fix caption merging
iammosespaulr Nov 25, 2024
96d1b81
Merge pull request #386 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 26, 2024
69233b7
Merge branch 'v2' into vik_v2
VikParuchuri Nov 26, 2024
f26834c
Merge pull request #387 from VikParuchuri/vik_v2
VikParuchuri Nov 26, 2024
34ece17
update tests
iammosespaulr Nov 26, 2024
e2f55da
Merge pull request #388 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 26, 2024
28816ef
ignore empty headers [skip ci]
iammosespaulr Nov 26, 2024
eda0738
more ignoretext and line merging upgrades
iammosespaulr Nov 26, 2024
b417dc6
add page header processor and clean up text processor
iammosespaulr Nov 26, 2024
18c64b2
filter out width 0 lines from heuristics
iammosespaulr Nov 26, 2024
360126e
Fix various bugs
VikParuchuri Nov 26, 2024
1203e26
fix bug [skip ci]
iammosespaulr Nov 26, 2024
4373087
Add in missing text
VikParuchuri Nov 26, 2024
37ee636
Fix logic and tests
VikParuchuri Nov 26, 2024
674d57f
Merge pull request #390 from VikParuchuri/dev-mose/marker-v2
VikParuchuri Nov 26, 2024
3d3d4c5
Merge branch 'v2' into vik_v2
VikParuchuri Nov 26, 2024
d1fb7bf
Add examples
VikParuchuri Nov 26, 2024
422ed4b
Merge remote-tracking branch 'origin/vik_v2' into vik_v2
VikParuchuri Nov 26, 2024
2b9560e
Merge pull request #391 from VikParuchuri/vik_v2
VikParuchuri Nov 26, 2024
c9ea515
Update dependencies
VikParuchuri Nov 26, 2024
c78f4af
Merge pull request #392 from VikParuchuri/v2
VikParuchuri Nov 26, 2024
814b80b
fix last line hyphenated check [skip ci]
iammosespaulr Nov 27, 2024
bd95194
Improve comparison performance
VikParuchuri Nov 27, 2024
cbdf29b
Fix test failures
VikParuchuri Nov 27, 2024
12950fd
Merge remote-tracking branch 'origin/dev' into v2_fixes
VikParuchuri Nov 27, 2024
58dfbe9
Fix for bbox sizing
VikParuchuri Nov 27, 2024
a821f3d
Zero height box fix
VikParuchuri Nov 27, 2024
bbcb1bd
fix tests
VikParuchuri Nov 27, 2024
d4d8b0d
Fix OCR thresholds
VikParuchuri Nov 27, 2024
9626ffd
fix line merging area threshold [skip ci]
iammosespaulr Nov 27, 2024
90d779c
Fix OCR thresholds
VikParuchuri Nov 27, 2024
70da706
Merge remote-tracking branch 'origin/v2_fixes' into v2_fixes
iammosespaulr Nov 27, 2024
bbe9624
put fix text in the right spot
VikParuchuri Nov 27, 2024
f7955a8
Merge remote-tracking branch 'origin/v2_fixes' into v2_fixes
VikParuchuri Nov 27, 2024
c6e8190
Add examples
VikParuchuri Nov 27, 2024
2ae81d7
Merge pull request #394 from VikParuchuri/v2_fixes
VikParuchuri Nov 27, 2024
9ced3f0
Bump tabled and surya
VikParuchuri Nov 27, 2024
2e5cc03
Merge branch 'master' into dev
VikParuchuri Nov 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ env:
OCR_ENGINE: "surya"

jobs:
build:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
Expand All @@ -28,7 +28,7 @@ jobs:
- name: Run benchmark test
run: |
poetry run python benchmarks/overall.py benchmark_data/pdfs benchmark_data/references report.json
poetry run python scripts/verify_benchmark_scores.py report.json --type marker
poetry run python benchmarks/verify_scores.py report.json --type marker



25 changes: 25 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
name: CI tests

on: [push]

env:
TORCH_DEVICE: "cpu"
OCR_ENGINE: "surya"

jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.11
uses: actions/setup-python@v4
with:
python-version: 3.11
- name: Install python dependencies
run: |
pip install poetry
poetry install
- name: Run tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: poetry run pytest
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ wandb
report.json
benchmark_data
debug_data
temp.md
temp
conversion_results
uploads

# Byte-compiled / optimized / DLL files
Expand Down Expand Up @@ -171,3 +174,5 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/

.vscode/
311 changes: 186 additions & 125 deletions README.md

Large diffs are not rendered by default.

89 changes: 31 additions & 58 deletions benchmarks/overall.py
Original file line number Diff line number Diff line change
@@ -1,42 +1,26 @@
import argparse
import tempfile
import time
from collections import defaultdict

import click
from tqdm import tqdm
import pypdfium2 as pdfium

from marker.convert import convert_single_pdf
from marker.config.parser import ConfigParser
from marker.converters.pdf import PdfConverter
from marker.logger import configure_logging
from marker.models import load_all_models
from marker.benchmark.scoring import score_text
from marker.pdf.extract_text import naive_get_text
from marker.models import create_model_dict
from pdftext.extraction import plain_text_output
import json
import os
import subprocess
import shutil
from tabulate import tabulate
import torch
from scoring import score_text

configure_logging()


def start_memory_profiling():
torch.cuda.memory._record_memory_history(
max_entries=100000
)


def stop_memory_profiling(memory_file):
try:
torch.cuda.memory._dump_snapshot(memory_file)
except Exception as e:
logger.error(f"Failed to capture memory snapshot {e}")

# Stop recording memory snapshot history.
torch.cuda.memory._record_memory_history(enabled=None)


def nougat_prediction(pdf_filename, batch_size=1):
out_dir = tempfile.mkdtemp()
subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
Expand All @@ -46,62 +30,51 @@ def nougat_prediction(pdf_filename, batch_size=1):
shutil.rmtree(out_dir)
return data


def main():
parser = argparse.ArgumentParser(description="Benchmark PDF to MD conversion. Needs source pdfs, and a reference folder with the correct markdown.")
parser.add_argument("in_folder", help="Input PDF files")
parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
parser.add_argument("out_file", help="Output filename")
parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
# Nougat batch size 1 uses about as much VRAM as default marker settings
parser.add_argument("--marker_batch_multiplier", type=int, default=1, help="Batch size multiplier to use for marker when making predictions.")
parser.add_argument("--nougat_batch_size", type=int, default=1, help="Batch size to use for nougat when making predictions.")
parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
parser.add_argument("--profile_memory", action="store_true", help="Profile memory usage", default=False)

args = parser.parse_args()

@click.command(help="Benchmark PDF to MD conversion.")
@click.argument("in_folder", type=str)
@click.argument("reference_folder", type=str)
@click.argument("out_file", type=str)
@click.option("--nougat", is_flag=True, help="Run nougat and compare")
@click.option("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
def main(in_folder: str, reference_folder: str, out_file: str, nougat: bool, md_out_path: str):
methods = ["marker"]
if args.nougat:
if nougat:
methods.append("nougat")

if args.profile_memory:
start_memory_profiling()

model_lst = load_all_models()

if args.profile_memory:
stop_memory_profiling("model_load.pickle")
model_dict = create_model_dict()

scores = defaultdict(dict)
benchmark_files = os.listdir(args.in_folder)
benchmark_files = os.listdir(in_folder)
benchmark_files = [b for b in benchmark_files if b.endswith(".pdf")]
times = defaultdict(dict)
pages = defaultdict(int)

for idx, fname in tqdm(enumerate(benchmark_files)):
md_filename = fname.rsplit(".", 1)[0] + ".md"

reference_filename = os.path.join(args.reference_folder, md_filename)
reference_filename = os.path.join(reference_folder, md_filename)
with open(reference_filename, "r", encoding="utf-8") as f:
reference = f.read()

pdf_filename = os.path.join(args.in_folder, fname)
pdf_filename = os.path.join(in_folder, fname)
doc = pdfium.PdfDocument(pdf_filename)
pages[fname] = len(doc)

config_parser = ConfigParser({"output_format": "markdown"})
for method in methods:
start = time.time()
if method == "marker":
if args.profile_memory:
start_memory_profiling()
full_text, _, out_meta = convert_single_pdf(pdf_filename, model_lst, batch_multiplier=args.marker_batch_multiplier)
if args.profile_memory:
stop_memory_profiling(f"marker_memory_{idx}.pickle")
converter = PdfConverter(
config=config_parser.generate_config_dict(),
artifact_dict=model_dict,
processor_list=None,
renderer=config_parser.get_renderer()
)
full_text = converter(pdf_filename).markdown
elif method == "nougat":
full_text = nougat_prediction(pdf_filename, batch_size=args.nougat_batch_size)
full_text = nougat_prediction(pdf_filename, batch_size=1)
elif method == "naive":
full_text = naive_get_text(doc)
full_text = plain_text_output(doc, workers=1)
else:
raise ValueError(f"Unknown method {method}")

Expand All @@ -110,13 +83,13 @@ def main():
score = score_text(full_text, reference)
scores[method][fname] = score

if args.md_out_path:
if md_out_path:
md_out_filename = f"{method}_{md_filename}"
with open(os.path.join(args.md_out_path, md_out_filename), "w+") as f:
with open(os.path.join(md_out_path, md_out_filename), "w+") as f:
f.write(full_text)

total_pages = sum(pages.values())
with open(args.out_file, "w+") as f:
with open(out_file, "w+") as f:
write_data = defaultdict(dict)
for method in methods:
total_time = sum(times[method].values())
Expand Down
4 changes: 0 additions & 4 deletions marker/benchmark/scoring.py → benchmarks/scoring.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,4 @@
import math

from rapidfuzz import fuzz
import re
import regex
from statistics import mean

CHUNK_MIN_CHARS = 25
Expand Down
File renamed without changes.
2 changes: 0 additions & 2 deletions chunk_convert.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,6 @@ for (( i=0; i<$NUM_DEVICES; i++ )); do
export NUM_WORKERS
echo "Running convert.py on GPU $DEVICE_NUM"
cmd="CUDA_VISIBLE_DEVICES=$DEVICE_NUM marker $INPUT_FOLDER $OUTPUT_FOLDER --num_chunks $NUM_DEVICES --chunk_idx $DEVICE_NUM --workers $NUM_WORKERS"
[[ -n "$METADATA_FILE" ]] && cmd="$cmd --metadata_file $METADATA_FILE"
[[ -n "$MIN_LENGTH" ]] && cmd="$cmd --min_length $MIN_LENGTH"
eval $cmd &

sleep 5
Expand Down
Loading