-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrieve bounding box of text on a page #5643
Comments
You would want to use getTextContent() https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L812 instead -- it is used by TextLayerBuilder. TextItem has transform and width/height https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L567. Does this help?
That's not a use case that needed for PDF Viewer (yet). You have to implemented it on your side. |
@yurydelendik Thanks, that'll do the trick. |
I just found CSS3's |
Is the unit of the translations is |
PDF user unit, you have to use PageViewport to map it to the screen presentation, see e.g. https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L159 |
Hm, I can't seem to get it right. The transform property on the individual element reads
where I don't know how to to interpret it. I'm certainly not supposed to stretch the element by a factor of almost 10, right? Curiously, the |
Sorry we did not extend the documentation of those getTextContent that far (hoping we can improve the API). Currently we operate under assumption that user of the advanced features will have some knowledge in computer graphics. Let me help, based on https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L159 code, and let's start from simple example:
But your task you need to take in account all four points of the rectangle, and tx calculation must be performed a little bit different. It's unfortunate width and height are scaled by fontHeight, but you can easily revert that and calculate position of the [0,0], [width/fontHeight,0], [0,height/fontHeight], [width/fontHeight, height/fontHeight] for bounds calculation. |
Aha, I didn't know that. How do I retrieve the font height of an object? Does that coincide with |
From https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L164: |
When inspecting
while the actual translation in x-direction exceeds 100px. Any idea what might be causing this? |
I created a little plunk at http://plnkr.co/edit/m9oPJg80XHeeQ1nquxkz which confirms exactly what you're saying, @yurydelendik. However, in my full application, I still get the wrong offsets. I'm investigating why, and it might be due to the |
@nschloe it seems like you did some work for text extraction. How far did you come? |
Here is how I approached this. First, I ignore the scale factor in the textContent transform. (The transformation matrix provided in the textContent.item[x].transform doesn't make any sense to me because it sets a scale in both the x and y directions equal to font height. I don't know why you'd ever want to do canvas operations in those units. But that's beside the point.) The numbers that matter are:
In order to do any operations on the canvas using these values, you have to (1) fix the
Where
Note, I had to adjust y again by the height of the box because strokeRect wants the top left corner and even after adjusting for the PDF origin issue, what you end up with is the bottom left corner of the box. So you add the height to get the top, then scale, then fix for origin. There's probably a cleaner way of doing this, but this works, and it has the advantage that I kind of understand what's going on. :) Hope that helps. |
For others digging around for what Specifically, the components of a transformation matrix are described on page 142:
(there's an accompanying chart in the reference as well) And the vector itself is defined thus:
|
Closing since this is answered now. |
I have a use case where I would like to get the bounding box of each word within an item. For example, the str of an item can be "hello world!", but the transformation only gives the coordinates of the entire string. In my use case, I would like to get the coordinates of each of the "hello", "world", and "!". These words are not selected or highlighted. |
@haijun-ucsd have you found a way to achieve this? |
|
I would like to determine the margins of the text in a PDF document. One possibility would be to render the PDF and look at the text layer of each page, specifically the positionins of their
div
children (which represent rows of text).That strikes me as a little too cumbersome, though. Is there a way to retrieve the bounding box of all text on a page from the
PDFJS
object?The text was updated successfully, but these errors were encountered: