Convert illegal html unicode without glyph to space or zero-width space. #493

duanyao · 2015-03-21T07:27:14Z

This patch convert space-like char (usually illegal in html) in PDF to space/ZWSP in html, rather than private unicodes. Now sample in #477 can be converted perfectly.

This patch depends on CairoFontEngine so ENABLE_SVG should be on (if set off, will not convert at all).

There are minor modifications to CairoFontEngine.h|cc which should be merged in furture update.

coolwanglu · 2015-03-21T10:57:28Z

src/HTMLTextLine.cc

-                                << ' ' << CSS::WHITESPACE_CN << wid << "\">" << (target > (threshold - EPS) ? " " : "") << "</span>";
+                                << ' ' << CSS::WHITESPACE_CN << wid << "\">";
+                            if (target > (threshold - EPS))
+                                dump_unicode(out, ' ');


I think that the space character will bring extra width after the span, which could be unintended?

The space is in that span, not after it. This change doesn't change previous behavior, just updates last_output_unicode.

Ah I see, sorry for the mistake.

coolwanglu · 2015-03-21T11:08:12Z

Thanks for the patch! Maybe we can enabled it also for --space-as-offset, or just wait until such kind of PDF files emerge.

One thing I don't quite like is that touching Cairo* is not recommended, would make it harder to maintain.

I feel that this patch adds too many states/variables, making the renderer even more messy. Of course this is due to the ugly design of the HTMLRenderer currently. I wonder if we may delay this until the overhual is finished, such that this feature may be implemented as a separated self-enclosing pass.

Finally, could you also add a test case for this (maybe after we have no issues left for the code).

Other possible methods I was thinking of

To detect empty chars using FontForge, but Type 3 fonts might not be supported, and the FontForge API could be messy
To detect empty chars by painting them (e.g. with SplashOutputDev), and to check if there's any update, which could be done in the preprocessor

duanyao · 2015-03-21T11:30:36Z

One thing I don't quite like is that touching Cairo* is not recommended, would make it harder to maintain.

Then we can copy necessary codes from CairoFontEngine.h|cc and make a dedicated font engine for this task. Actually this engine don't have to depend on cairo, only depends on poppler and freetype. Detecting empty glyph with freetype seems simple enough and should be much faster than actually drawing chars.

I feel that this patch adds too many states/variables, making the renderer even more messy.

Yeah, I also plan to simplify it a little in following days, especially in drawSting().

Finally, could you also add a test case for this (maybe after we have no issues left for the code).

The sample file in #477 was used to test this patch.

Unforturnately I found a problem in browsers, FF and webkit disagree with whether add letter-spacing around ZWSP:
<div style="font-size:30px;letter-spacing:10px">MM</div>
This means we can't output ZWSP if letter-spacing is not zero. Can we can just omit ZWSP in this case? it should be rare.

coolwanglu · 2015-03-22T05:04:19Z

Further, it seems that this is better done in Preprocessor, as checking if a glyph is empty might be expensive, we may check each glyph used in the document first, and store the results along with the mapping info.
But again, it would be better to do so after the overhaul.

duanyao · 2015-03-22T08:04:39Z

OK, I agree.

duanyao · 2015-03-23T08:35:22Z

After some thought, I think most (if not all) "illegal html unicode without glyph" should be converted to a <span class="_ _N"> <span>, because:

Browsers implement letter-spacing for ZWSP etc differently (should not apply, FF and IE have bugs ), while implement letter-spacing for inline-block consistently (not apply at all). So converting an illegal unicode to a ZWSP + offset pair is not ideal.
In order to largely maintain the sematic, most illegal html unicodes should be converted to delimiters (space or something), not just empty spans. P.S. I think if --space-as-offset is on, at least one space should also be retained for each converted offset.

In order to accomplish this, HTMLTextLine::append_offset(double width) may be extended to HTMLTextLine::append_offset(double width, Unicode char), means a char (in most cases a space) can be assosiated to the offset. This is called "mixed offset"

HTMLTextLine::text field can also be extended to type vector<HTMLTextUnit>, and HTMLTextUnit is defined as:

struct HTMLTextUnit
{
  HTMLTextState * state; // or int state, index of HTMLTextLine::states
  int char_count; // corresponding to how many CharCode
  Unicode unicode;
  float width; // width of offset
}

Type of HTMLTextUnit is implied:

Normal text: char_count == 1, unicode > 0, width is NaN (means defined by font/state)
Pure offset: char_count == 0, unicode == 0, width != 0
Mixed offset: char_count > 0, unicode == 0x20 (may be extended), width is not NaN
Repeating spaces: char_count > 1, unicode == 0x20, width is NaN
Decomposed ligature
- First: same as normal text
- Followings: char_count == 0, unicode > 0
Not in use: char_count == 0, unicode == 0, width == 0

Some notes:

char_count can be used to sync with CovertTextDetector
Pure/mixed offsets may be merged during text optimazition, and char_count and width are added up. After merging, removed HTMLTextUnit can be marked as "not in use" (don't have to compact the vector).
~~Large offsets may converted to "repeating spaces" during text optimazition~~no, this will invalidate char_count. Large offsets may be converted to muliple normal text units..
HTMLTextLine::offsets and HTMLTextLine::decomposed_text are not needed anymore.

Convert illegal html unicode without glyph to space or zero-width space.

9dbd504

coolwanglu reviewed Mar 21, 2015
View reviewed changes

coolwanglu self-assigned this Jun 23, 2015

jwuttke added a commit to jwuttke/pdf2htmlEX that referenced this pull request Sep 29, 2016

Merge branch pull request coolwanglu#493 into merge_all

551f1b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert illegal html unicode without glyph to space or zero-width space. #493

Convert illegal html unicode without glyph to space or zero-width space. #493

duanyao commented Mar 21, 2015

coolwanglu Mar 21, 2015

duanyao Mar 21, 2015

coolwanglu Mar 21, 2015

coolwanglu commented Mar 21, 2015

duanyao commented Mar 21, 2015

coolwanglu commented Mar 22, 2015

duanyao commented Mar 22, 2015

duanyao commented Mar 23, 2015

Convert illegal html unicode without glyph to space or zero-width space. #493

Are you sure you want to change the base?

Convert illegal html unicode without glyph to space or zero-width space. #493

Conversation

duanyao commented Mar 21, 2015

coolwanglu Mar 21, 2015

Choose a reason for hiding this comment

duanyao Mar 21, 2015

Choose a reason for hiding this comment

coolwanglu Mar 21, 2015

Choose a reason for hiding this comment

coolwanglu commented Mar 21, 2015

duanyao commented Mar 21, 2015

coolwanglu commented Mar 22, 2015

duanyao commented Mar 22, 2015

duanyao commented Mar 23, 2015