Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRT HTML Entities #122

Merged
merged 3 commits into from
Dec 17, 2024
Merged

Conversation

justin-taylor
Copy link
Contributor

This PR solves an issue where srt files output to webvtt are double encoded. This also updates the current webvtt escaping to be generalized with all HTML entities. For example, currently an srt file containing an   will output to   when being written to vtt.

srt.go Outdated Show resolved Hide resolved
srt.go Outdated Show resolved Hide resolved
webvtt.go Outdated
@@ -36,8 +36,8 @@ var (
bytesWebVTTTimeBoundariesSeparator = []byte(webvttTimeBoundariesSeparator)
webVTTRegexpInlineTimestamp = regexp.MustCompile(`<((?:\d{2,}:)?\d{2}:\d{2}\.\d{3})>`)
webVTTRegexpTag = regexp.MustCompile(`(</*\s*([^\.\s]+)(\.[^\s/]*)*\s*([^/]*)\s*/*>)`)
webVTTEscaper = strings.NewReplacer("&", "&amp;", "<", "&lt;")
webVTTUnescaper = strings.NewReplacer("&amp;", "&", "&lt;", "<")
webVTTEscaper = strings.NewReplacer("astisub.amp", "&")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the idea of using this astisub.amp string to mess with html escape/unescape 🤔

Wouldn't adding "\u00A0", "&nbsp;" to htmlEscaper and "&nbsp;", "\u00A0"" to htmlUnescaper do the trick? 🤔 (\u00A0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solves for all html entities (e.g. &euro; &quot; ...) rather than keeping a short list of acceptable entities. I tried just using go's html.EscapeString, however that leads to converting everything in the output. So, things like single quotes are forced to &#39; double quotes go to &#34;. Not sure if that is acceptable for this library.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that when converting to ssa, stl or teletext there will be astisub.amp as well in the output. What is parsed from any subtitle format and written in the Subtitles struct should not be format-specific and should be usable by all formats. Therefore we need to transform the html entities into "neutral" characters that express themselves properly and are usable by all formats: &amp; can become &, &lt; <, &euro; , &quot; ", etc. Unfortunately, I don't see another way than having a short list of acceptable entities 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I updated my changes back to using a short list of acceptable entities with nbsp added. I might look into a larger solution for this later.

webvtt_test.go Outdated Show resolved Hide resolved
srt_test.go Outdated Show resolved Hide resolved
@asticode asticode merged commit f285923 into asticode:master Dec 17, 2024
1 check passed
@asticode
Copy link
Owner

Thanks for the PR! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants