overlapping regions #7

bertsky · 2021-02-01T16:32:27Z

Not sure if this a bug at all. I've used your pretrained BBZ model to segment pages in similar data: Börsenblatt des Deutschen Buchhandels. These also have 2-column layouts besides the 3- and 4-column layouts of Berliner Börsenzeitung, and the advertisement parts look very different. But I assumed the domains are close enough for pages like this.

The bbz-segment results (via full Origami pipeline and compose --page-xml) do look very good in general. This is truly amazing work!

But some errors leave me puzzled:

original	origami

(Sorry, cannot get these to render with equal width in GFM...)

Here what I don't understand is:

How is it possible for the page segmentation to create overlapping regions? Since the model is basically a pixel classifer, it should force some flat partitioning. (Or is this really just about convex hull vs finding an alpha shape?)
Why did the table detector not pick up the full regular structure on the right? What would the cells and lines need to look like? (Or is this related to the perspective distortion? Or does it expect or get triggered by fg column separators?)
How is it even possible for Origami to create a PAGE-XML for the original image, if you even included page-level dewarping in between (i.e. how do you keep track of the coordinate system)?
Why are text lines sometimes split in the middle (vertically)? (Happens a couple of times per page.)
Why are text region outlines smaller than their constituent text line contours?
Why are text line contours overlapping each other? Would it be possible to get tight, non-overlapping polygonal contours?
Is there a way to export the separator regions, too? Similarly, would it be possible to represent tables as recursive TableRegions instead of recursive TextRegions?

The text was updated successfully, but these errors were encountered:

poke1024 · 2021-02-02T10:14:00Z

How is it possible for the page segmentation to create overlapping regions? Since the model is basically a pixel classifer, it should force some flat partitioning. (Or is this really just about convex hull vs finding an alpha shape?)

The pixel classification is turned into polygonal regions that are then going through operations that dilate and erode, which can lead to overlaps. For a high-level overview, see http://ceur-ws.org/Vol-2723/long20.pdf

The set of polygonal operations that finally lead to overlaps is defined here:

origami/origami/custom/layouts/bbz.py

Line 59 in 544485b

return Transformer([

Implementation of the different operations starts here:

origami/origami/batch/detect/layout.py

Line 310 in 544485b

class Transformer:

Why did the table detector not pick up the full regular structure on the right? What would the cells and lines need to look like? (Or is this related to the perspective distortion? Or does it expect or get triggered by fg column separators?)

Are we talking about the second example page? It looks to me like the pixel classifier already gets this wrong, i.e. classifies this as text. This would mean that our BBZ table training data did not generalize well for this case.

How is it even possible for Origami to create a PAGE-XML for the original image, if you even included page-level dewarping in between (i.e. how do you keep track of the coordinate system)?

The PAGE-XML is indeed output for the warped page, coordinates are transformed from dewarped into warped space for the export.

The dewarping transformation is basically a grid of dewarped points that models dewarping through linear interpolations and works both ways, i.e. warped -> dewarped and dewarped -> warped. So, for each regular grid point, there is one dewarped grid point, and this mapping of quadrilaterals defined the dewarping. This mapping is available at all post-dewarping stages of the pipeline (it's saved into a separate file).

Implementation is at https://github.com/poke1024/origami/blob/master/origami/core/dewarp.py where the Transformer class implements the actual interpolation, see

origami/origami/core/dewarp.py

Line 143 in 544485b

class Transformer:

.

Why are text lines sometimes split in the middle (vertically)? (Happens a couple of times per page.)

This is probably related to fine tuning of polygonal operations (also see questions below). Either the constituent text line polygons do not get merged in the first place (you might want to look into segment.zip which contains the raw pixel classifier output as png, image ratio is wrong there), or they get merged and then get split again by an operation called FixSpillOverH (see

origami/origami/batch/detect/layout.py

Line 928 in 544485b

class FixSpillOverH(FixSpillOver):

and

origami/origami/custom/layouts/bbz.py

Line 59 in 544485b

return Transformer([

), which tries to find whitespace columns in regions and splits along them to fix spillover from the pixel classification. Removing that FixSpillOverH (or changing its parameters) from the Transformer in

origami/origami/custom/layouts/bbz.py

Line 59 in 544485b

return Transformer([

might fix this. The idea of FixSpillOverH is that sometimes, in the pixel classifier, blocks do get merged which should not, and this tries to fix it - but sometimes it fixes too much.

Why are text region outlines smaller than their constituent text line contours?

This is a good question. My best ad hoc guess is the --contours-buffer in

origami/origami/batch/detect/lines.py

Line 158 in 544485b

@click.option(

which expands text line contours by some amount.

Why are text line contours overlapping each other? Would it be possible to get tight, non-overlapping polygonal contours?

Yes. The code location to experiment is at

origami/origami/custom/layouts/bbz.py

Line 59 in 544485b

return Transformer([

where you might want to remove operations or add a full overlap merge operation.

Instead of the current implementation you could use a

Transformer([
	OverlapMerger(0)
])

which means merging all overlapping regions (starting at any overlap > 0) and not doing any dilations or erosions. This might be worth experimenting with.

The current default set of operations is fine-tuned towards some border cases encountered in the BBZ layout.

Is there a way to export the separator regions, too? Similarly, would it be possible to represent tables as recursive TableRegions instead of recursive TextRegions?

Not in the API or exports at this point, but after running the "contours" stage, you can unzip contours.0.zip (in the .out folder that Origami created in your docs folder) which contains separators/H and separators/V folders, which contain all separators as polygonal wkt files (read using shapely.wkt.loads). Coordinates are in warped space at this point.

In terms of PageXML export, there is some simple support for exporting TableRegions (see

origami/origami/batch/detect/compose.py

Line 145 in 544485b

class TableRegion:

), but this assumes that the earlier stages and the pixel classifier classify a region as table, which might go wrong in this use case.

I would need to look into this in more detail to give a better answer.

poke1024 added the documentation Improvements or additions to documentation label Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overlapping regions #7

overlapping regions #7

bertsky commented Feb 1, 2021

poke1024 commented Feb 2, 2021

overlapping regions #7

overlapping regions #7

Comments

bertsky commented Feb 1, 2021

poke1024 commented Feb 2, 2021