Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode characters in slugs #640

Merged
merged 19 commits into from
Oct 4, 2017
Merged

Support Unicode characters in slugs #640

merged 19 commits into from
Oct 4, 2017

Conversation

tech4him1
Copy link
Contributor

@tech4him1 tech4him1 commented Sep 30, 2017

- Summary

Right now, we only support Latin chars in slugs/filenames (we strip accents from chars if we can, and drop everything else). This isn't very user-friendly, and latest browsers (including IE11) support Unicode, so we might as well allow users to use it. We still have to sanitize the slugs to make sure that the filenames are safe , but other than that, we don't have to do anything else.

This PR upgrades the slug formatter to use the new IRI standards (RFC 3987), which support other than latin-1 chars in slugs, instead of the URI one (RFC 3986).

One other thing I did here was to sanitize all types of slugs, not just the input titles. This should make it much easier to do slug customization later.

As a side effect, this closes #607.

- Test plan

Lots of testing needs done on this, and we should probably announce it as well (for users that were previously having accents stripped).

Tests were also added.

Example Test Cases:

Title Old Slug New Slug
ěščřžý escrzy ěščřžý
日本語のタイトル (empty) 日本語のタイトル
duck\\goose.elephant duckgooseelephant duck-goose-elephant
Netlify CMS Planning: Working in Sprints netlify-cms-planning-working-in-sprints netlify-cms-planning-working-in-sprints
This, that, and the other! this-that-and-the-other this-that-and-the-other

- Description for the changelog

Support Unicode characters in slugs.

- A picture of a cute animal (not mandatory but encouraged)
wynand-van-poortvliet-364366

@tech4him1
Copy link
Contributor Author

tech4him1 commented Sep 30, 2017

Should we still convert all characters to (localized) lower-case here? Done
Also, should we convert spaces to dashes? Done
What about symbols (commas, exclamation points)? Done

Or should we just do the bare minimum and let any SSG do the rest? But then why even have slug customization -- the CMS wouldn't even know what the slug was going to be, then.

@tech4him1 tech4him1 changed the title WIP: Support UTF-8 characters in slugs. Support UTF-8 characters in slugs. Sep 30, 2017
@tech4him1 tech4him1 changed the title Support UTF-8 characters in slugs. WIP: Support UTF-8 characters in slugs. Sep 30, 2017
@tech4him1 tech4him1 changed the title WIP: Support UTF-8 characters in slugs. Support UTF-8 characters in slugs. Sep 30, 2017
@tech4him1 tech4him1 changed the title Support UTF-8 characters in slugs. Support UTF-8 characters in slugs Sep 30, 2017
@tech4him1 tech4him1 requested a review from biilmann September 30, 2017 22:25
@tech4him1
Copy link
Contributor Author

tech4him1 commented Sep 30, 2017

OK, everybody, this is ready for review. Please throw some test cases at it, and let me know if you find a unique one.

@tech4him1 tech4him1 force-pushed the utf8-slugs branch 2 times, most recently from 80690e4 to dcc7a73 Compare October 1, 2017 02:12
@tech4him1 tech4him1 changed the title Support UTF-8 characters in slugs Support Unicode characters in slugs Oct 2, 2017
Copy link
Contributor

@erquhart erquhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @tech4him1, thank you! Just a couple of small change requests, after which this should be good to go.

@@ -42,7 +43,7 @@ const slugFormatter = (template = "{{slug}}", entryData) => {
return identifier;
};

return template.replace(/\{\{([^\}]+)\}\}/g, (_, field) => {
let slug = template.replace(/\{\{([^\}]+)\}\}/g, (_, field) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we allow some mutable ops for dense logic or areas where performance is a concern, we generally want to avoid them. I'd recommend something more like:

const slug = template.replace(...)
  .method()
  .method();

I'd also move sanitizeFilename and the trailing character replacement to a new sanitizeSlug export from urlHelper, so you can just return sanitizeSlug(slug).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I'm not sure if I have it the best way. Could you please review sanitizeSlug?

@@ -12,6 +12,22 @@ export function getNewEntryUrl(collectionName, direct) {
return getUrl(`/collections/${ collectionName }/entries/new`, direct);
}

// Unreserved chars from RFC3987.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be a little bit more verbose in this comment, maybe even add a reference link? You did a lot of research, just want to capture some of it here.

Copy link
Contributor Author

@tech4him1 tech4him1 Oct 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good, because I sure don't want to have to do it over again :).

@erquhart erquhart removed the request for review from biilmann October 2, 2017 23:53
@tech4him1
Copy link
Contributor Author

@erquhart Can you please review the last commit on this and see if you have any improvements on the sanitizeSlug function?

export function sanitizeSlug(str, { replacement = '-' }) {
if (!isString(str)) throw "`sanitizeSlug` only accepts strings as input.";
if (!isString(replacement)) throw "the `sanitizeSlug` replacement character must be a string.";
let slug = str;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to avoid mutations - we can use a couple of Lodash functions (flow and partialRight) to go immutable without the cruft:

const sanitize = flow([
  partialRight(sanitizeIRI, { replacement }),
  partialRight(sanitizeFilename, { replacement }),
]);

const sanitizedSlug = sanitize(slug);

// After declaring replacement expressions:

const normalizedSlug = slug
  .replace(...)
  .replace(...);

Be sure to keep your source comments, those are helpful.

}
});
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you bring back your source comment explaining this regex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tech4him1
Copy link
Contributor Author

tech4him1 commented Oct 3, 2017

@erquhart OK, done. Are the comments clear enough? Also, do you have any suggestions on making the sanitizeIRI immutable, or is that a case where we don't need to?

return result;
}

export function sanitizeSlug(str, { replacement = '-' } = {}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use an options object here? Do we expect more complex arguments in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dopry No, I was just keeping the interface the same as sanitizeFilename and sanitizeIRI. Also, I think sanitizeSlug(slug, {replacement: '-'}) is clearer than sanitizeSlug(slug, '-').

Is there any particular reason that you don't think it should be there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use Intellisense so I get hinted variable names and types usually. If you were to write a jsdoc block for this you'd have to create a typedef for each options object. When there is only one option there isn't much function purpose to use it.

I personally think it's much easier to understand...

function sanitizeSlug(slug, replacement = '-') {} but this is a matter of preference and I'd approve with or without... just curious at this point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dopry I'm talking about calling the function -- where sanitizeSlug(slug, {replacement: '-'}) is clearer than sanitizeSlug(slug, '-').

const doubleReplacement = new RegExp('(?:' + escapeRegExp(replacement) + ')+', 'g');
const trailingReplacment = new RegExp(escapeRegExp(replacement) + '$');
const normalizedSlug = sanitizedSlug
.replace(doubleReplacement, '-')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use replacement instead of '-'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


return normalizedSlug;
}

export function urlize(string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urlize is no longer used, it should be removed so we're not dragging unused code around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, but there are a couple reasons I don't want to remove it yet, plus it gets trimmed from the bundle anyway. Thanks for noting this, though, I'll look into it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What reasons? It's in the git history, we can always dig it up if we need it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@biilmann added it. Also, we weren't using it before, either, so I think it should be a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@tech4him1
Copy link
Contributor Author

@erquhart @Benaiah Do you care at all whether the interface is sanitizeSlug(str, { replacement: '-' }) or sanitizeSlug(str, '-')?

@tech4him1 tech4him1 requested review from dopry and Benaiah October 4, 2017 13:37
@erquhart erquhart merged commit 876cb2c into master Oct 4, 2017
@erquhart erquhart deleted the utf8-slugs branch October 4, 2017 14:34
@tech4him1 tech4him1 removed request for dopry and Benaiah October 4, 2017 14:35
@tech4him1 tech4him1 mentioned this pull request Jan 17, 2018
tech4him1 added a commit that referenced this pull request Jan 17, 2018
originally removed in #640, accidently added as part of #638
tech4him1 added a commit that referenced this pull request Jan 17, 2018
originally removed in #640, accidently added as part of #638
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch to a smaller slugifier.
4 participants