-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SRT HTML Entities #122
SRT HTML Entities #122
Conversation
webvtt.go
Outdated
@@ -36,8 +36,8 @@ var ( | |||
bytesWebVTTTimeBoundariesSeparator = []byte(webvttTimeBoundariesSeparator) | |||
webVTTRegexpInlineTimestamp = regexp.MustCompile(`<((?:\d{2,}:)?\d{2}:\d{2}\.\d{3})>`) | |||
webVTTRegexpTag = regexp.MustCompile(`(</*\s*([^\.\s]+)(\.[^\s/]*)*\s*([^/]*)\s*/*>)`) | |||
webVTTEscaper = strings.NewReplacer("&", "&", "<", "<") | |||
webVTTUnescaper = strings.NewReplacer("&", "&", "<", "<") | |||
webVTTEscaper = strings.NewReplacer("astisub.amp", "&") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like the idea of using this astisub.amp
string to mess with html escape/unescape 🤔
Wouldn't adding "\u00A0", " "
to htmlEscaper
and " ", "\u00A0""
to htmlUnescaper
do the trick? 🤔 (\u00A0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solves for all html entities (e.g. €
"
...) rather than keeping a short list of acceptable entities. I tried just using go's html.EscapeString
, however that leads to converting everything in the output. So, things like single quotes are forced to '
double quotes go to "
. Not sure if that is acceptable for this library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that when converting to ssa
, stl
or teletext
there will be astisub.amp
as well in the output. What is parsed from any subtitle format and written in the Subtitles
struct should not be format-specific and should be usable by all formats. Therefore we need to transform the html entities into "neutral" characters that express themselves properly and are usable by all formats: &
can become &
, <
<
, €
€
, "
"
, etc. Unfortunately, I don't see another way than having a short list of acceptable entities 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I updated my changes back to using a short list of acceptable entities with nbsp added. I might look into a larger solution for this later.
5b3397c
to
dbd8b1b
Compare
Thanks for the PR! ❤️ |
This PR solves an issue where srt files output to webvtt are double encoded. This also updates the current webvtt escaping to be generalized with all HTML entities. For example, currently an srt file containing an
will output to&nbsp;
when being written to vtt.