Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File names #26

Closed
finkf opened this issue Dec 6, 2018 · 2 comments · Fixed by #48
Closed

File names #26

finkf opened this issue Dec 6, 2018 · 2 comments · Fixed by #48
Assignees

Comments

@finkf
Copy link

finkf commented Dec 6, 2018

There are three files {0006,0007,0008}.xml that all belong to the same filegroup gt. If I run ocrd-tesserocr-recognize on the filegroup gt, with output filegroup tess recognize searches for the files of the filegroup in the mets.xml file. If for some reason (files where not added to the workspace in nummerical order?) the files are not returned in numerical order - for example 0007, 0008, 0006 - recognize generates the files tess-0001.xml (0007.xml), tess-0002.xml (0008.xml) and tess-0003.xml (0006.xml).

This destroys the mapping between gt and ocr pages.

A simple solution would be to use:

self.workspace.add_file(
  ID=ID,
  file_grp=self.output_file_grp,
  basename=self.output_file_grp + '-' + os.path.basename(input_file.url),
  mimetype=MIMETYPE_PAGE,
  content=to_xml(pcgts),
)

to create the new files to the workspace.

@kba
Copy link
Member

kba commented Dec 6, 2018

Sounds reasonable. I would change the ID though, so the mapping between filename and ID is retained.

@finkf
Copy link
Author

finkf commented Dec 6, 2018

That's OK as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants