-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to search and replace text within a document? #71
Comments
Hi, you need to tackle these problems:
How to parse the textContent (text, graphics) is placed in content streams of pages. So you need to look into the pages content streams. (each page may have more than one). Text is placed in content streams inside blocks marked by "BT" and "ET" commands. You need something to tokenize the content stream. i got a good class for it in the C++ implementation called PDFParserTokenizer, which i didn't expose via the hummusjs moduble. if it makes sense we may want to expose it, or reimplement it. it's def is here. This one here shows basic tokenization of a content stream. hope it's ok that its in C++. Note that you may get form xobjects placed. these are pieces of reusable graphics that function like pages within pages. you need to track their content stream too in case they are placed in a page. Get this up and running, and if you're happy with getting the text in a document/page you can move on to replacing the text. How to replace the textif you want to replace the text, you should track the original placement commands in charge of it and replace them with a new command placing the new text. you'll probably have to replace the whole paragraph (gotta figure out whether something is a paragraph) as the text length will change and placements will change and you don't want your replaced text to look funny. in funny i mean that it will run over the text following it or leave too much space. so actually you are looking to replace the whole paragraph text...that's probably a better approach. figure out the new paragraph text and place it. hopefully this will work. You can use hummus commands to place new text or use lower level commands. Good luck, |
how to parse text with hummus - http://pdfhummus.com/post/156548561656/extracting-text-from-pdf-files |
I managed to implement this the following way
Note this will only work if the new text being written is already on the pdf (I think it's something to do with the font info for characters not already on the pdf not being included in the document), and to make the code work you need to organise a font file for writing the example text. |
@BrighTide it seems like your code doesn't work. the outputClean.pdf and output.pdf are empty |
Revisted the old code and this is what shook out in the end, this is working for us to this day
|
Please explain about: 'let toRedactString = findInText({patterns, string: pdfPageAsString})'. I don't understant that code. |
findInText is defined further down, it simply executes on an array of regexes You might also be confused about the es6 feature that's being used? http://www.benmvp.com/learning-es6-enhanced-object-literals/#property-value-shorthand |
Thanks @BrighTide. I coppied your code and run it. but I cound not find text that I need. 'toRedactString = undefined'. please see my code: function findInText( function replaceText1(sourceFile: string, targetFile: string, patterns: any) { for (let page = 0; page < numPages; page += 1) {
} modPdfWriter.end(); return; replaceText1(sourcePDF, destinationPDF, [/amount/]); |
Hey! |
Hey, Thanks for the snippet But when running i am getting the following error
I tried downgrading the package but it's not downgrading as well. |
I also get the same error
|
… or nice but seems to work for now. We need to revisit this to re-integrate token support - maybe with this code example: galkahana/HummusJS#71 (comment)
… or nice but seems to work for now. We need to revisit this to re-integrate token support - maybe with this code example: galkahana/HummusJS#71 (comment)
@kicaUBUNTU . I have the same question. Do you solve this problem? can you tell how to solve it. |
Hi, Thank you for the code. With the help of this snippet, I could extract the Text in TJ and replace it. However, text from all TJs in the output pdf disappeared. I guess, it is something to do with the font? How can I embed font to CopyingContext? Please help. |
@venkatarajeshm I got this error |
I'm also having the same problem and the issue is even before the But it might be related to what @galkahana said above: How to parse the text |
Did anyone find a solution for this error?
|
@venkatarajeshm Hi venkatarajeshm, have you successfully replaced the text? BT Are <0012>, <0003> characters, right? I really need it |
I got the error which the others mentioned before. Did anyone find a solution for this error? TypeError: pageObject.getDictionary(...).toJSObject(...).Contents.getObjectID is not a function |
@galkahana @filmerjarred |
First off, thanks for writing and maintaining Hummus!
From reading the documentation and perusing the issues, I've gathered so far that this is not supported at a high level by Hummus. I also found your explanation suggesting that it wouldn't necessarily be too difficult, you just had to understand the structure/anatomy of a pdf document.
Was just curious if this was actually simpler to do than I've discovered? Or perhaps an example exists but I've just missed it?
Either way, I've started to read the PDF spec, focusing on the text portions and am starting to understand some of the low level API calls now. Any tips to set me down the right path would be appreciated.
The text was updated successfully, but these errors were encountered: