How to search and replace text within a document? #71

distracteddev · 2016-05-22T01:23:38Z

First off, thanks for writing and maintaining Hummus!

From reading the documentation and perusing the issues, I've gathered so far that this is not supported at a high level by Hummus. I also found your explanation suggesting that it wouldn't necessarily be too difficult, you just had to understand the structure/anatomy of a pdf document.

Was just curious if this was actually simpler to do than I've discovered? Or perhaps an example exists but I've just missed it?

Either way, I've started to read the PDF spec, focusing on the text portions and am starting to understand some of the low level API calls now. Any tips to set me down the right path would be appreciated.

galkahana · 2016-05-22T06:11:44Z

Hi,
I've never tried implementing search & replace, but yeah, i think that should be possible. not sure about whether it should be easy. I'll provide some notes as to how i would approach it, but it might be a good idea to consult someone who did this or read into some library code that actually implements it.
I'd Allocate a few of weeks for it. just as an out of hand estimate.

you need to tackle these problems:

how to read the text in the pdf
how to correctly replace and display the new text

How to parse the text

Content (text, graphics) is placed in content streams of pages. So you need to look into the pages content streams. (each page may have more than one).

Text is placed in content streams inside blocks marked by "BT" and "ET" commands.
In these blocks you should track for text placement commands like "Tj". Tj has a single parameter (as string that precedes it) that is an encoded string of the text. it is encoded per the font that's current. you need to track this font then. You can decode the text using the font encoding or unicode map. The pdf specs has an explanation on how acrobat decodes the text, so you can use it for the implementation.
You need to somehow figure out words out of the text. meaning, when spaces come in. hopefully you can break words by relying on Tj commands being separate per word...but i'm not sure about it.

You need something to tokenize the content stream. i got a good class for it in the C++ implementation called PDFParserTokenizer, which i didn't expose via the hummusjs moduble. if it makes sense we may want to expose it, or reimplement it. it's def is here. This one here shows basic tokenization of a content stream. hope it's ok that its in C++.
by tokenizing the streams you can get to the commands and then track back (or rather save it in advance) to the relevant parameter.

Note that you may get form xobjects placed. these are pieces of reusable graphics that function like pages within pages. you need to track their content stream too in case they are placed in a page.

Get this up and running, and if you're happy with getting the text in a document/page you can move on to replacing the text.

How to replace the text

if you want to replace the text, you should track the original placement commands in charge of it and replace them with a new command placing the new text. you'll probably have to replace the whole paragraph (gotta figure out whether something is a paragraph) as the text length will change and placements will change and you don't want your replaced text to look funny. in funny i mean that it will run over the text following it or leave too much space. so actually you are looking to replace the whole paragraph text...that's probably a better approach. figure out the new paragraph text and place it. hopefully this will work.

You can use hummus commands to place new text or use lower level commands.
the tricky part is to add any new characters to the font definition. Assuming that the PDF has only the characters it needs for rendering the text is already has, this means probably that you need to know which original font was used...realizing it from the PDF is not very easy, but can be done. doing the actual embedding...you are probably better of creating a new font using hummus, with the same name, and writing all the text using that font. simply replace the Tf command placing the old font with the new one, and use Tjs to place the new text (sticking to that would avoid having to know the size and color of the original text).

Good luck,
Gal.

galkahana · 2017-01-29T21:18:32Z

how to parse text with hummus - http://pdfhummus.com/post/156548561656/extracting-text-from-pdf-files

filmerjarred · 2017-01-29T23:46:20Z

I managed to implement this the following way

var hummus = require('hummus');

//write our example pdf
var pdfWriter = hummus.createWriter('./source.pdf', {compress:false});
var arialFont = pdfWriter.getFontForFile('./LucidaBrightDemiBold.ttf');
var page = pdfWriter.createPage(0,0,600,800);
var cxt = pdfWriter.startPageContentContext(page);

var textOptions = {font:arialFont, size:14, color:0x222222};
cxt.writeText('Example text',75,75,textOptions)
pdfWriter.writePage(page)
pdfWriter.end();


//init modification writer
var modPdfWriter = hummus.createWriterToModify('./source.pdf', {modifiedFilePath:'./output.pdf', compress:false});

//get references to the contents stream on the relevant page (first, in this instance)
var sourceParser = modPdfWriter.createPDFCopyingContextForModifiedFile().getSourceDocumentParser();
var pageObject = sourceParser.parsePage(0);
var textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID();
var textStream = sourceParser.queryDictionaryObject(pageObject.getDictionary(), 'Contents');

//read the original block of text data
var data = [];
var readStream = sourceParser.startReadingFromStream(textStream);
while(readStream.notEnded()){
  var readData = readStream.read(10000);
  data = data.concat(readData);
}

//create new string
var string = new Buffer(data).toString();
string = string.replace(/Example text/g, 'Exmpl txt');

//Create and write our new text object
var objectsContext = modPdfWriter.getObjectsContext();
objectsContext.startModifiedIndirectObject(textObjectID);

var stream = objectsContext.startUnfilteredPDFStream();
stream.getWriteStream().write(strToByteArray(string));
objectsContext.endPDFStream(stream);

objectsContext.endIndirectObject();

modPdfWriter.end();

//removes old objects no longer in use
hummus.recrypt('./output.pdf', './outputClean.pdf');

function strToByteArray(str) {
  var myBuffer = [];
  var buffer = new Buffer(str);
  for (var i = 0; i < buffer.length; i++) {
      myBuffer.push(buffer[i]);
  }
  return myBuffer;
}

Note this will only work if the new text being written is already on the pdf (I think it's something to do with the font info for characters not already on the pdf not being included in the document), and to make the code work you need to organise a font file for writing the example text.

alexey-sh · 2018-06-02T15:52:16Z

@BrighTide it seems like your code doesn't work. the outputClean.pdf and output.pdf are empty

filmerjarred · 2018-06-04T00:05:04Z

Revisted the old code and this is what shook out in the end, this is working for us to this day


module.exports = function redactPDF ({filePath, patterns}) {
	const modPdfWriter = hummus.createWriterToModify(filePath, {modifiedFilePath: `${filePath}-modified`, compress: false})
	const numPages = modPdfWriter.createPDFCopyingContextForModifiedFile().getSourceDocumentParser().getPagesCount()

	for (let page = 0; page < numPages; page++) {
		const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile()
		const objectsContext = modPdfWriter.getObjectsContext()

		const pageObject = copyingContext.getSourceDocumentParser().parsePage(page)
		const textStream = copyingContext.getSourceDocumentParser().queryDictionaryObject(pageObject.getDictionary(), 'Contents')
		const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID()

		let data = []
		const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream)
		while (readStream.notEnded()) {
			const readData = readStream.read(10000)
			data = data.concat(readData)
		}

		const pdfPageAsString = Buffer.from(data).toString()

		let toRedactString = findInText({patterns, string: pdfPageAsString})

		const redactedPdfPageAsString = pdfPageAsString.replace(new RegExp(toRedactString, 'g'), new Array(toRedactString.length).join('-'))

		// Create what will become our new text object
		objectsContext.startModifiedIndirectObject(textObjectID)

		const stream = objectsContext.startUnfilteredPDFStream()
		stream.getWriteStream().write(strToByteArray(redactedPdfPageAsString))
		objectsContext.endPDFStream(stream)

		objectsContext.endIndirectObject()
	}

	modPdfWriter.end()

	hummus.recrypt(`${filePath}-modified`, filePath)
}

function findInText ({patterns, string}) {
	for (let pattern of patterns) {
		const match = new RegExp(pattern, 'g').exec(string)
		if (match) {
			if (match[1]) {
				return match[1]
			}
			else {
				return match[0]
			}
		}
	}

	return false
}

function strToByteArray (str) {
	let myBuffer = []
	let buffer = Buffer.from(str)
	for (let i = 0; i < buffer.length; i++) {
		myBuffer.push(buffer[i])
	}
	return myBuffer
}

dongnthut19 · 2018-06-15T09:55:39Z

Please explain about: 'let toRedactString = findInText({patterns, string: pdfPageAsString})'. I don't understant that code.

filmerjarred · 2018-06-17T23:48:54Z

findInText is defined further down, it simply executes on an array of regexes
findInText({patterns: [/abc/], string: pdfPageAsString})
would try and find 'abc' somewhere in the pdf, after which it would redact it.

You might also be confused about the es6 feature that's being used? http://www.benmvp.com/learning-es6-enhanced-object-literals/#property-value-shorthand

dongnthut19 · 2018-06-18T03:46:26Z

Thanks @BrighTide. I coppied your code and run it. but I cound not find text that I need. 'toRedactString = undefined'. please see my code:

function replaceText1(sourceFile: string, targetFile: string, patterns: any) {
const modPdfWriter = hummus.createWriterToModify(sourceFile, { modifiedFilePath: targetFile, compress: false });
const numPages = modPdfWriter.createPDFCopyingContextForModifiedFile().getSourceDocumentParser().getPagesCount();

for (let page = 0; page < numPages; page += 1) {
const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile();
const objectsContext = modPdfWriter.getObjectsContext();

const pageObject = copyingContext.getSourceDocumentParser().parsePage(page);
const textStream = copyingContext.getSourceDocumentParser().queryDictionaryObject(pageObject.getDictionary(), 'Contents');
const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID();

let data: any = [];
const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream);
while (readStream.notEnded()) {
  const readData = readStream.read(10000);
  data = data.concat(readData);
}

const pdfPageAsString = Buffer.from(data).toString();
console.log('pdfPageAsString = ', pdfPageAsString);

const toRedactString = findInText(patterns, pdfPageAsString);

console.log('toRedactString = ', toRedactString);

let redactedPdfPageAsString: string = '';
if (toRedactString !== undefined) {
  redactedPdfPageAsString = pdfPageAsString.replace(new RegExp(toRedactString, 'g'), new Array(toRedactString.length).join('-'));
}

// Create what will become our new text object
objectsContext.startModifiedIndirectObject(textObjectID);

const stream = objectsContext.startUnfilteredPDFStream();
stream.getWriteStream().write(strToByteArray(redactedPdfPageAsString));
objectsContext.endPDFStream(stream);

objectsContext.endIndirectObject();

}

modPdfWriter.end();

return;
}

replaceText1(sourcePDF, destinationPDF, [/amount/]);

DNikolic-Paycor · 2018-09-20T08:44:32Z

Hey!
I reealy need help with your function about part with matching the pattern.
I think that there is problem in:
function findInText ({patterns, string}) {}
when you pass more then one char in pattern because console.log('pdfPageAsString = ', pdfPageAsString) returns string that looks like
5.27877 0 Td (c)Tj 5.2798 0 Td (t)Tj 6.35974 0 Td (t)Tj 3.35985 0 Td (h)Tj 6 0 Td (e)Tj 8.27865 0 Td (g)Tj 5.87977 0 Td (r)Tj 3.95982 0 Td (a)Tj 5.27977 0 Td (p)Tj 6 0 Td
So how is possible that regEx like /abc/g find something in this string ?
Function works when you pass one char to regExp so could you help me with this?
Best regards!

nithinkashyapn · 2018-11-28T08:08:14Z

Hey,

Thanks for the snippet

But when running i am getting the following error

var textObjectId = pageObject.getDictionary().toJSObject().Contents.getObjectID();                                                            

TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function

I tried downgrading the package but it's not downgrading as well.

tinwinaung · 2018-12-27T21:18:57Z

I also get the same error

Hey,

Thanks for the snippet

But when running i am getting the following error
var textObjectId = pageObject.getDictionary().toJSObject().Contents.getObjectID();                                                            

TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function
I tried downgrading the package but it's not downgrading as well.

… or nice but seems to work for now. We need to revisit this to re-integrate token support - maybe with this code example: galkahana/HummusJS#71 (comment)

bingomvm · 2019-04-30T07:55:07Z

Hey!
I reealy need help with your function about part with matching the pattern.
I think that there is problem in:
function findInText ({patterns, string}) {}
when you pass more then one char in pattern because console.log('pdfPageAsString = ', pdfPageAsString) returns string that looks like
5.27877 0 Td (c)Tj 5.2798 0 Td (t)Tj 6.35974 0 Td (t)Tj 3.35985 0 Td (h)Tj 6 0 Td (e)Tj 8.27865 0 Td (g)Tj 5.87977 0 Td (r)Tj 3.95982 0 Td (a)Tj 5.27977 0 Td (p)Tj 6 0 Td
So how is possible that regEx like /abc/g find something in this string ?
Function works when you pass one char to regExp so could you help me with this?
Best regards!

@kicaUBUNTU . I have the same question. Do you solve this problem? can you tell how to solve it.

venkatarajeshm · 2019-08-23T17:25:59Z

Revisted the old code and this is what shook out in the end, this is working for us to this day


module.exports = function redactPDF ({filePath, patterns}) {
	const modPdfWriter = hummus.createWriterToModify(filePath, {modifiedFilePath: `${filePath}-modified`, compress: false})
	const numPages = modPdfWriter.createPDFCopyingContextForModifiedFile().getSourceDocumentParser().getPagesCount()

	for (let page = 0; page < numPages; page++) {
		const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile()
		const objectsContext = modPdfWriter.getObjectsContext()

		const pageObject = copyingContext.getSourceDocumentParser().parsePage(page)
		const textStream = copyingContext.getSourceDocumentParser().queryDictionaryObject(pageObject.getDictionary(), 'Contents')
		const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID()

		let data = []
		const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream)
		while (readStream.notEnded()) {
			const readData = readStream.read(10000)
			data = data.concat(readData)
		}

		const pdfPageAsString = Buffer.from(data).toString()

		let toRedactString = findInText({patterns, string: pdfPageAsString})

		const redactedPdfPageAsString = pdfPageAsString.replace(new RegExp(toRedactString, 'g'), new Array(toRedactString.length).join('-'))

		// Create what will become our new text object
		objectsContext.startModifiedIndirectObject(textObjectID)

		const stream = objectsContext.startUnfilteredPDFStream()
		stream.getWriteStream().write(strToByteArray(redactedPdfPageAsString))
		objectsContext.endPDFStream(stream)

		objectsContext.endIndirectObject()
	}

	modPdfWriter.end()

	hummus.recrypt(`${filePath}-modified`, filePath)
}

function findInText ({patterns, string}) {
	for (let pattern of patterns) {
		const match = new RegExp(pattern, 'g').exec(string)
		if (match) {
			if (match[1]) {
				return match[1]
			}
			else {
				return match[0]
			}
		}
	}

	return false
}

function strToByteArray (str) {
	let myBuffer = []
	let buffer = Buffer.from(str)
	for (let i = 0; i < buffer.length; i++) {
		myBuffer.push(buffer[i])
	}
	return myBuffer
}

Hi, Thank you for the code. With the help of this snippet, I could extract the Text in TJ and replace it. However, text from all TJs in the output pdf disappeared. I guess, it is something to do with the font? How can I embed font to CopyingContext? Please help.
Best Regards.

mohammedabualsoud · 2019-09-30T09:59:52Z

@venkatarajeshm I got this error
TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function

Cali93 · 2019-10-08T21:03:24Z

Hey!
I reealy need help with your function about part with matching the pattern.
I think that there is problem in:
function findInText ({patterns, string}) {}
when you pass more then one char in pattern because console.log('pdfPageAsString = ', pdfPageAsString) returns string that looks like
5.27877 0 Td (c)Tj 5.2798 0 Td (t)Tj 6.35974 0 Td (t)Tj 3.35985 0 Td (h)Tj 6 0 Td (e)Tj 8.27865 0 Td (g)Tj 5.87977 0 Td (r)Tj 3.95982 0 Td (a)Tj 5.27977 0 Td (p)Tj 6 0 Td
So how is possible that regEx like /abc/g find something in this string ?
Function works when you pass one char to regExp so could you help me with this?
Best regards!

@kicaUBUNTU . I have the same question. Do you solve this problem? can you tell how to solve it.

I'm also having the same problem and the issue is even before the findText, it is the because of the data bytes array coming from the readStream that are already formatted like that. But I have no clue how to solve that problem as I'm new to Hummus and PDF manipulation. Also I'm not sure but it might be because the text is vectorised as in some cases the text is formatted as a normal string.

But it might be related to what @galkahana said above:

How to parse the text
Content (text, graphics) is placed in content streams of pages. So you need to look into the pages content streams. (each page may have more than one).
Text is placed in content streams inside blocks marked by "BT" and "ET" commands.
In these blocks you should track for text placement commands like "Tj". Tj has a single parameter (as string that precedes it) that is an encoded string of the text. it is encoded per the font that's current. you need to track this font then. You can decode the text using the font encoding or unicode map. The pdf specs has an explanation on how acrobat decodes the text, so you can use it for the implementation.
You need to somehow figure out words out of the text. meaning, when spaces come in. hopefully you can break words by relying on Tj commands being separate per word...but i'm not sure about it.

apic-apps · 2020-03-16T16:34:02Z

Did anyone find a solution for this error?

TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function

duchiep123 · 2021-03-19T17:50:11Z

@venkatarajeshm Hi venkatarajeshm, have you successfully replaced the text?
Currently I have identified the text in BT and ET

BT
/ F6 16 Tf 1 0 0 -1 0 0 Tm
230 -286 Td <0012> Tj
10.6718750 0 Td <0003> Tj
6.21875000 0 Td <0004> Tj
8.89062500 0 Td <0015> Tj
8.89062500 0 Td <0040> Tj
8.89062500 0 Td <0010> Tj
9,76562500 0 Td <000A> Tj
ET

Are <0012>, <0003> characters, right?
But I don't know how it's encoded, I just know it's encoding based on the current font in the file
I want to find the email in the pdf file so I have to locate the @ character. But each font has a different encoding,
in the above example <0040> is the @ character but I tested it on a different font it is not <0040>
so is there a way to help me find out what is the @ encoded character in a specific pdf file?

I really need it
Thank you so much

creativebull · 2022-10-25T05:24:51Z

I got the error which the others mentioned before. Did anyone find a solution for this error?

TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function

Suraj0704 · 2023-01-12T10:15:46Z

@galkahana @filmerjarred
i want to modified pdf in such way:
first search the String in the pdf and then bold the string.
I have a project pls try to give any solution for this.....
Thanks in Advance

galkahana closed this as completed Jul 29, 2017

nschnierer mentioned this issue Oct 5, 2017

Possibility for page numbers in PDF? puppeteer/puppeteer#373

Closed

miqmago mentioned this issue Oct 11, 2018

Fetch all hyperlinks in document and replace them #334

Open

chunyenHuang mentioned this issue May 25, 2020

Erase text in existing pdf. chunyenHuang/hummusRecipe#183

Open

This was referenced Feb 15, 2021

Text replace #454

Open

Text replace chunyenHuang/hummusRecipe#207

Open

duchiep123 mentioned this issue Mar 17, 2021

Remove or replace some text in PDF file #456

Open

Suraj0704 mentioned this issue Jan 13, 2023

@galkahana @filmerjarred #474

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to search and replace text within a document? #71

How to search and replace text within a document? #71

distracteddev commented May 22, 2016

galkahana commented May 22, 2016

galkahana commented Jan 29, 2017

filmerjarred commented Jan 29, 2017

alexey-sh commented Jun 2, 2018

filmerjarred commented Jun 4, 2018

dongnthut19 commented Jun 15, 2018

filmerjarred commented Jun 17, 2018

dongnthut19 commented Jun 18, 2018

DNikolic-Paycor commented Sep 20, 2018

nithinkashyapn commented Nov 28, 2018

tinwinaung commented Dec 27, 2018

bingomvm commented Apr 30, 2019

venkatarajeshm commented Aug 23, 2019

mohammedabualsoud commented Sep 30, 2019

Cali93 commented Oct 8, 2019 •

edited

Loading

apic-apps commented Mar 16, 2020

duchiep123 commented Mar 19, 2021

creativebull commented Oct 25, 2022

Suraj0704 commented Jan 12, 2023

How to search and replace text within a document? #71

How to search and replace text within a document? #71

Comments

distracteddev commented May 22, 2016

galkahana commented May 22, 2016

How to parse the text

How to replace the text

galkahana commented Jan 29, 2017

filmerjarred commented Jan 29, 2017

alexey-sh commented Jun 2, 2018

filmerjarred commented Jun 4, 2018

dongnthut19 commented Jun 15, 2018

filmerjarred commented Jun 17, 2018

dongnthut19 commented Jun 18, 2018

DNikolic-Paycor commented Sep 20, 2018

nithinkashyapn commented Nov 28, 2018

tinwinaung commented Dec 27, 2018

bingomvm commented Apr 30, 2019

venkatarajeshm commented Aug 23, 2019

mohammedabualsoud commented Sep 30, 2019

Cali93 commented Oct 8, 2019 • edited Loading

apic-apps commented Mar 16, 2020

duchiep123 commented Mar 19, 2021

creativebull commented Oct 25, 2022

Suraj0704 commented Jan 12, 2023

Cali93 commented Oct 8, 2019 •

edited

Loading