Retrieve bounding box of text on a page #5643

nschloe · 2015-01-13T22:54:48Z

I would like to determine the margins of the text in a PDF document. One possibility would be to render the PDF and look at the text layer of each page, specifically the positionins of their div children (which represent rows of text).
That strikes me as a little too cumbersome, though. Is there a way to retrieve the bounding box of all text on a page from the PDFJS object?

The text was updated successfully, but these errors were encountered:

yurydelendik · 2015-01-13T23:05:08Z

You would want to use getTextContent() https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L812 instead -- it is used by TextLayerBuilder. TextItem has transform and width/height https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L567. Does this help?

That strikes me as a little too cumbersome, though. Is there a way to retrieve the bounding box of all text on a page from the PDFJS object?

That's not a use case that needed for PDF Viewer (yet). You have to implemented it on your side.

nschloe · 2015-01-14T12:50:03Z

@yurydelendik Thanks, that'll do the trick.
Is there documentation on transform (an array of length 6)? I suppose those are entries in a transformation matrix, but I'm not sure which.

nschloe · 2015-01-14T14:34:29Z

I just found CSS3's transform and assume that the same logic is applied.
http://www.w3schools.com/cssref/css3_pr_transform.asp
https://dev.opera.com/articles/understanding-the-css-transforms-matrix/

nschloe · 2015-01-14T15:33:52Z

Is the unit of the translations is px or some PDF-related unit (e.g., in)?

yurydelendik · 2015-01-14T16:30:42Z

PDF user unit, you have to use PageViewport to map it to the screen presentation, see e.g. https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L159

nschloe · 2015-01-14T18:02:00Z

Hm, I can't seem to get it right. The transform property on the individual element reads

element transform: [9.9626, 0, 0, 9.9626, 74.558, 673.034]

where I don't know how to to interpret it. I'm certainly not supposed to stretch the element by a factor of almost 10, right? Curiously, the 9.9626 always coincide with the height property. What to make of this?

yurydelendik · 2015-01-14T18:52:15Z

Sorry we did not extend the documentation of those getTextContent that far (hoping we can improve the API). Currently we operate under assumption that user of the advanced features will have some knowledge in computer graphics. Let me help, based on https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L159 code, and let's start from simple example:

var text; // pdfPage.getTextContent(function (t) { text = t; });
var viewport = pdfPage.getViewport(1.0 /* scale */);
// find all text start points
var xs = [], ys = [];
text.items.forEach(function (item) {
  var tx = PDFJS.Util.transform(viewport.transform, item.transform);
  // avoiding tx * [0,0,1] taking x, y directly from the transform
  xs.push(tx[4]); ys.push(tx[5]);
});
var boundsOfStartPoints = [Math.min.apply(null, xs), Math.min.apply(null, ys), Math.max.apply(null, xs), Math.max.apply(null, ys)];

But your task you need to take in account all four points of the rectangle, and tx calculation must be performed a little bit different. It's unfortunate width and height are scaled by fontHeight, but you can easily revert that and calculate position of the [0,0], [width/fontHeight,0], [0,height/fontHeight], [width/fontHeight, height/fontHeight] for bounds calculation.

nschloe · 2015-01-14T19:38:46Z

It's unfortunate width and height are scaled by fontHeight,

Aha, I didn't know that. How do I retrieve the font height of an object? Does that coincide with content.items[i].height?

yurydelendik · 2015-01-14T19:41:13Z

From https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L164: var fontHeight = Math.sqrt((tx[2] * tx[2]) + (tx[3] * tx[3]));

nschloe · 2015-01-15T12:27:30Z

When inspecting tx[4] of an element on on the PDF, I notice that the actual x-y-pixel-coordinates of the top-left point do not coincide with tx[4] as computed from

tx = PDFJS.Util.transform(
  viewport.transform,
  content.items[i].transform
);

See:

tx[4] = 76.6...

while the actual translation in x-direction exceeds 100px.

Any idea what might be causing this?

nschloe · 2015-03-24T20:20:30Z

I created a little plunk at http://plnkr.co/edit/m9oPJg80XHeeQ1nquxkz which confirms exactly what you're saying, @yurydelendik. However, in my full application, I still get the wrong offsets. I'm investigating why, and it might be due to the pdf_viewer I'm using. I'd like to create a plunk with it, too, but I can't find pdf_viewer.js served from github.io like http://mozilla.github.io/pdf.js/build/pdf.js or http://mozilla.github.io/pdf.js/build/pdf.worker.js. Any pointer?

TuningGuide · 2016-05-29T10:33:22Z

@nschloe it seems like you did some work for text extraction. How far did you come?

jmlsf · 2016-08-16T03:38:58Z

Here is how I approached this. First, I ignore the scale factor in the textContent transform. (The transformation matrix provided in the textContent.item[x].transform doesn't make any sense to me because it sets a scale in both the x and y directions equal to font height. I don't know why you'd ever want to do canvas operations in those units. But that's beside the point.)

The numbers that matter are:

const item = textContent.items[0];
const transform = item.transform;
const x = transform[4];
const y = transform[5];
const width = item.width;
const height = item.height;

In order to do any operations on the canvas using these values, you have to (1) fix the y coordinate from the PDF origin to the canvas origin and (2) scale the whole thing by whatever you've scaled the viewport (i.e. by whatever you passed getViewport). So I do this:

convertToCanvasCoords([x, y, width, height]) {
  const { scale } = this;
  return [x * scale, this.canvas.height - ((y + height) * scale), width * scale, height * scale];
}

Where this.scale is the same number I passed to getViewport. Then the following draws an accurate box around the text:

ctx.strokeRect(...this.convertToCanvasCoords([x, y, width, height]));

Note, I had to adjust y again by the height of the box because strokeRect wants the top left corner and even after adjusting for the PDF origin issue, what you end up with is the bottom left corner of the box. So you add the height to get the top, then scale, then fix for origin. There's probably a cleaner way of doing this, but this works, and it has the advantage that I kind of understand what's going on. :) Hope that helps.

knowtheory · 2019-05-28T19:08:58Z

For others digging around for what pdf.js is actually doing with transformation vectors, the PDF Reference includes a definition of how transformation vectors are laid out and how they relate to mapping into a two dimensional coordinate space.

Specifically, the components of a transformation matrix are described on page 142:

Translations are specified as [ 1 0 0 1 tx ty ], where tx and ty are the distances to translate the origin of the coordinate system in the horizontal and vertical dimensions, respectively.

Scaling is obtained by [sx 0 0 sy 0 0]. This scales the coordinates so that 1 unit in the horizontal and vertical dimensions of the new coordinate system is the same size as sx and sy units, respectively, in the previous coordinate system.

Rotations are produced by [cos θ sin θ −sin θ cos θ 0 0], which has the effect of rotating the coordinate system axes by an angle θ counterclockwise.

Skew is specified by [1 tan α tan β 1 0 0], which skews the x axis by an angle α and the y axis by an angle β.

(there's an accompanying chart in the reference as well)

And the vector itself is defined thus:

PDF represents coordinates in a two-dimensional space. The point (x, y) in such a space can be expressed in vector form as [x y 1]. The constant third element of this vector (1) is needed so that the vector can be used with 3-by-3 matrices in the calculations described below.
The transformation between two coordinate systems is represented by a 3-by-3 transformation matrix written as

[ed: pretend this is a matrix]

a b 0

c d 0

e f 1

Because a transformation matrix has only six elements that can be changed, it is usually specified in PDF as the six-element array [a b c d e f].

timvandermeij · 2019-05-28T21:33:00Z

Closing since this is answered now.

haijun-ucsd · 2021-11-09T05:46:32Z

I have a use case where I would like to get the bounding box of each word within an item. For example, the str of an item can be "hello world!", but the transformation only gives the coordinates of the entire string. In my use case, I would like to get the coordinates of each of the "hello", "world", and "!". These words are not selected or highlighted.

BernhardBehrendt · 2023-08-21T14:47:41Z

@haijun-ucsd have you found a way to achieve this?

snowfluke · 2023-10-30T03:26:08Z

@haijun-ucsd have you found a way to achieve this?
@BernhardBehrendt You could do something like this:

Get the width of the item, for example x1-x0
Divide the width by the total character in that item including space and symbol (you get average width for each character)
Multiple the average width by the index in the item and you will get the x

yurydelendik added other information-requested labels Jan 13, 2015

timvandermeij removed the information-requested label Jan 15, 2015

timvandermeij closed this as completed May 28, 2019

vishnu-dev mentioned this issue Oct 22, 2019

Is there a way to get coordinates of the selected or hovered text? VadimDez/ng2-pdf-viewer#512

Closed

Ana0112 mentioned this issue Mar 22, 2023

Transform rectangle coordinates pdfjsLib.OPS.rect #16184

Closed

goughjo02 mentioned this issue Apr 18, 2024

How to highlight sentence over multiple lines? wojtekmaj/react-pdf#614

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve bounding box of text on a page #5643

Retrieve bounding box of text on a page #5643

nschloe commented Jan 13, 2015

yurydelendik commented Jan 13, 2015

nschloe commented Jan 14, 2015

nschloe commented Jan 14, 2015

nschloe commented Jan 14, 2015

yurydelendik commented Jan 14, 2015

nschloe commented Jan 14, 2015

yurydelendik commented Jan 14, 2015

nschloe commented Jan 14, 2015

yurydelendik commented Jan 14, 2015

nschloe commented Jan 15, 2015

nschloe commented Mar 24, 2015

TuningGuide commented May 29, 2016

jmlsf commented Aug 16, 2016

knowtheory commented May 28, 2019

timvandermeij commented May 28, 2019

haijun-ucsd commented Nov 9, 2021

BernhardBehrendt commented Aug 21, 2023

snowfluke commented Oct 30, 2023

Retrieve bounding box of text on a page #5643

Retrieve bounding box of text on a page #5643

Comments

nschloe commented Jan 13, 2015

yurydelendik commented Jan 13, 2015

nschloe commented Jan 14, 2015

nschloe commented Jan 14, 2015

nschloe commented Jan 14, 2015

yurydelendik commented Jan 14, 2015

nschloe commented Jan 14, 2015

yurydelendik commented Jan 14, 2015

nschloe commented Jan 14, 2015

yurydelendik commented Jan 14, 2015

nschloe commented Jan 15, 2015

nschloe commented Mar 24, 2015

TuningGuide commented May 29, 2016

jmlsf commented Aug 16, 2016

knowtheory commented May 28, 2019

timvandermeij commented May 28, 2019

haijun-ucsd commented Nov 9, 2021

BernhardBehrendt commented Aug 21, 2023

snowfluke commented Oct 30, 2023