Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve bounding box of text on a page #5643

Closed
nschloe opened this issue Jan 13, 2015 · 18 comments
Closed

Retrieve bounding box of text on a page #5643

nschloe opened this issue Jan 13, 2015 · 18 comments
Labels

Comments

@nschloe
Copy link
Contributor

nschloe commented Jan 13, 2015

I would like to determine the margins of the text in a PDF document. One possibility would be to render the PDF and look at the text layer of each page, specifically the positionins of their div children (which represent rows of text).
That strikes me as a little too cumbersome, though. Is there a way to retrieve the bounding box of all text on a page from the PDFJS object?

@yurydelendik
Copy link
Contributor

You would want to use getTextContent() https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L812 instead -- it is used by TextLayerBuilder. TextItem has transform and width/height https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L567. Does this help?

That strikes me as a little too cumbersome, though. Is there a way to retrieve the bounding box of all text on a page from the PDFJS object?

That's not a use case that needed for PDF Viewer (yet). You have to implemented it on your side.

@nschloe
Copy link
Contributor Author

nschloe commented Jan 14, 2015

@yurydelendik Thanks, that'll do the trick.
Is there documentation on transform (an array of length 6)? I suppose those are entries in a transformation matrix, but I'm not sure which.

@nschloe
Copy link
Contributor Author

nschloe commented Jan 14, 2015

I just found CSS3's transform and assume that the same logic is applied.
http://www.w3schools.com/cssref/css3_pr_transform.asp
https://dev.opera.com/articles/understanding-the-css-transforms-matrix/

@nschloe
Copy link
Contributor Author

nschloe commented Jan 14, 2015

Is the unit of the translations is px or some PDF-related unit (e.g., in)?

@yurydelendik
Copy link
Contributor

PDF user unit, you have to use PageViewport to map it to the screen presentation, see e.g. https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L159

@nschloe
Copy link
Contributor Author

nschloe commented Jan 14, 2015

Hm, I can't seem to get it right. The transform property on the individual element reads

element transform: [9.9626, 0, 0, 9.9626, 74.558, 673.034]

where I don't know how to to interpret it. I'm certainly not supposed to stretch the element by a factor of almost 10, right? Curiously, the 9.9626 always coincide with the height property. What to make of this?

@yurydelendik
Copy link
Contributor

Sorry we did not extend the documentation of those getTextContent that far (hoping we can improve the API). Currently we operate under assumption that user of the advanced features will have some knowledge in computer graphics. Let me help, based on https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L159 code, and let's start from simple example:

var text; // pdfPage.getTextContent(function (t) { text = t; });
var viewport = pdfPage.getViewport(1.0 /* scale */);
// find all text start points
var xs = [], ys = [];
text.items.forEach(function (item) {
  var tx = PDFJS.Util.transform(viewport.transform, item.transform);
  // avoiding tx * [0,0,1] taking x, y directly from the transform
  xs.push(tx[4]); ys.push(tx[5]);
});
var boundsOfStartPoints = [Math.min.apply(null, xs), Math.min.apply(null, ys), Math.max.apply(null, xs), Math.max.apply(null, ys)];

But your task you need to take in account all four points of the rectangle, and tx calculation must be performed a little bit different. It's unfortunate width and height are scaled by fontHeight, but you can easily revert that and calculate position of the [0,0], [width/fontHeight,0], [0,height/fontHeight], [width/fontHeight, height/fontHeight] for bounds calculation.

@nschloe
Copy link
Contributor Author

nschloe commented Jan 14, 2015

It's unfortunate width and height are scaled by fontHeight,

Aha, I didn't know that. How do I retrieve the font height of an object? Does that coincide with content.items[i].height?

@yurydelendik
Copy link
Contributor

From https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L164: var fontHeight = Math.sqrt((tx[2] * tx[2]) + (tx[3] * tx[3]));

@nschloe
Copy link
Contributor Author

nschloe commented Jan 15, 2015

When inspecting tx[4] of an element on on the PDF, I notice that the actual x-y-pixel-coordinates of the top-left point do not coincide with tx[4] as computed from

tx = PDFJS.Util.transform(
  viewport.transform,
  content.items[i].transform
);

See:
prec

tx[4] = 76.6...

while the actual translation in x-direction exceeds 100px.

Any idea what might be causing this?

@nschloe
Copy link
Contributor Author

nschloe commented Mar 24, 2015

I created a little plunk at http://plnkr.co/edit/m9oPJg80XHeeQ1nquxkz which confirms exactly what you're saying, @yurydelendik. However, in my full application, I still get the wrong offsets. I'm investigating why, and it might be due to the pdf_viewer I'm using. I'd like to create a plunk with it, too, but I can't find pdf_viewer.js served from github.io like http://mozilla.github.io/pdf.js/build/pdf.js or http://mozilla.github.io/pdf.js/build/pdf.worker.js. Any pointer?

@TuningGuide
Copy link

@nschloe it seems like you did some work for text extraction. How far did you come?

@jmlsf
Copy link

jmlsf commented Aug 16, 2016

Here is how I approached this. First, I ignore the scale factor in the textContent transform. (The transformation matrix provided in the textContent.item[x].transform doesn't make any sense to me because it sets a scale in both the x and y directions equal to font height. I don't know why you'd ever want to do canvas operations in those units. But that's beside the point.)

The numbers that matter are:

const item = textContent.items[0];
const transform = item.transform;
const x = transform[4];
const y = transform[5];
const width = item.width;
const height = item.height;

In order to do any operations on the canvas using these values, you have to (1) fix the y coordinate from the PDF origin to the canvas origin and (2) scale the whole thing by whatever you've scaled the viewport (i.e. by whatever you passed getViewport). So I do this:

convertToCanvasCoords([x, y, width, height]) {
  const { scale } = this;
  return [x * scale, this.canvas.height - ((y + height) * scale), width * scale, height * scale];
}

Where this.scale is the same number I passed to getViewport. Then the following draws an accurate box around the text:

ctx.strokeRect(...this.convertToCanvasCoords([x, y, width, height]));

Note, I had to adjust y again by the height of the box because strokeRect wants the top left corner and even after adjusting for the PDF origin issue, what you end up with is the bottom left corner of the box. So you add the height to get the top, then scale, then fix for origin. There's probably a cleaner way of doing this, but this works, and it has the advantage that I kind of understand what's going on. :) Hope that helps.

@knowtheory
Copy link

For others digging around for what pdf.js is actually doing with transformation vectors, the PDF Reference includes a definition of how transformation vectors are laid out and how they relate to mapping into a two dimensional coordinate space.

Specifically, the components of a transformation matrix are described on page 142:

  • Translations are specified as [ 1 0 0 1 tx ty ], where tx and ty are the distances to translate the origin of the coordinate system in the horizontal and vertical dimensions, respectively.
  • Scaling is obtained by [sx 0 0 sy 0 0]. This scales the coordinates so that 1 unit in the horizontal and vertical dimensions of the new coordinate system is the same size as sx and sy units, respectively, in the previous coordinate system.
  • Rotations are produced by [cos θ sin θ −sin θ cos θ 0 0], which has the effect of rotating the coordinate system axes by an angle θ counterclockwise.
  • Skew is specified by [1 tan α tan β 1 0 0], which skews the x axis by an angle α and the y axis by an angle β.

(there's an accompanying chart in the reference as well)

And the vector itself is defined thus:

PDF represents coordinates in a two-dimensional space. The point (x, y) in such a space can be expressed in vector form as [x y 1]. The constant third element of this vector (1) is needed so that the vector can be used with 3-by-3 matrices in the calculations described below.
The transformation between two coordinate systems is represented by a 3-by-3 transformation matrix written as

[ed: pretend this is a matrix]

a b 0
c d 0
e f 1

Because a transformation matrix has only six elements that can be changed, it is usually specified in PDF as the six-element array [a b c d e f].

@timvandermeij
Copy link
Contributor

Closing since this is answered now.

@haijun-ucsd
Copy link

I have a use case where I would like to get the bounding box of each word within an item. For example, the str of an item can be "hello world!", but the transformation only gives the coordinates of the entire string. In my use case, I would like to get the coordinates of each of the "hello", "world", and "!". These words are not selected or highlighted.

@BernhardBehrendt
Copy link

@haijun-ucsd have you found a way to achieve this?

@snowfluke
Copy link

@haijun-ucsd have you found a way to achieve this?
@BernhardBehrendt You could do something like this:

  1. Get the width of the item, for example x1-x0
  2. Divide the width by the total character in that item including space and symbol (you get average width for each character)
  3. Multiple the average width by the index in the item and you will get the x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants