Skip to content

Conversation

@xnetcat
Copy link

@xnetcat xnetcat commented Jan 11, 2023

as in title, this option replaces similar characters to their closest ASCII equivalent

for example:
𝙎 -> s
𝘽 -> b
Drake - WizKid Kyla - One Dance (𝙎𝙇𝙊𝙒𝙀𝘿 + 𝙍𝙀𝙑𝙀𝙍𝘽) -> drake-wizkid-kyla-one-dance-slowed-reverb

before:
𝙎 ->
𝘽 ->
Drake - WizKid Kyla - One Dance (𝙎𝙇𝙊𝙒𝙀𝘿 + 𝙍𝙀𝙑𝙀𝙍𝘽) -> drake-wizkid-kyla-one-dance

full list: https://pastebin.com/yY9UmByb

based on the: https://github.com/hediet/vscode-unicode-data/

@un33k
Copy link
Owner

un33k commented Jan 11, 2023

Looks good, please rename, prune and re-push. Thx

@jdevera
Copy link
Collaborator

jdevera commented Jan 11, 2023

Hi there. Thanks a lot for taking the time to put a PR together with this contribution!

Something that would be awesome to see in PRs, in general, is the rationale behind the proposed changes. It helps make the subsequent discussion much more fruitful.

That would be my first question here. What problem took you to this change?

I'll add another:

  1. Is there any advantage of this approach over adding the additional replacement as an after-step of the slugification. I mean, what if you use the list of ambiguous characters and run the same replacement after the slug is created with this library?

@jdevera
Copy link
Collaborator

jdevera commented Jan 11, 2023

Also, the linked repo quoted as the source does not have a readme, which leaves me with more questions about this list of replacements, and since you are also the owner there, please allow me:

  1. What is the source of the list?
  2. If it is manually curated, what is your process to decide on the replacements that get added or not?
  3. How might we ensure the copy you are bringing in the PR is in sync?
  4. Have you thought of publishing those replacements, if you are curating them, as a small package?

@xnetcat
Copy link
Author

xnetcat commented Jan 11, 2023

0 Something that would be awesome to see in PRs, in general, is the rationale behind the proposed changes. It helps make the subsequent discussion much more fruitful.

I've felt that it would be a good change to add this since the current version just removes ambiguous characters.
In my opinion the intended behaviour should be to replace them with the closest ASCII equivalent the same way Cyrillic, asian, greek characters, etc are being processed.

1.1 Also, the linked repo quoted as the source does not have a readme, which leaves me with more questions about this list of replacements, and since you are also the owner there, please allow me:

I am not the owner of that repo, I've found it by accident. I've noticed that VS Code suggests replacements for ambiguous characters so I've checked the source code and that led me to that repo.

image

2.1 What is the source of the list?

From what I've check it's based on the official Confusable characters dataset from unicode org and some manually added ones https://github.com/hediet/vscode-unicode-data/blob/main/data/README.md

2.2 If it is manually curated, what is your process to decide on the replacements that get added or not?

2.1

2.2 How might we ensure the copy you are bringing in the PR is in sync?

This list shouldn't change that often. Unicode Standard is updated once a year, but it mostly brings new emojis and scripts (ex. Nag Mundari) which shouldn't affect

Unicode standard 8 released in 2015: 6069 characters
Unicode standard 12 released in 2019: 6296 characters
Unicode standard 13 released in 2020: 6311 characters <- no new updates from 2020, only 15 new characters added in 2020
Unicode standard 14 released in 2021: 6311 characters
Unicode standard 15 released in 2022: 6311 characters

2.4 Have you thought of publishing those replacements, if you are curating them, as a small package?

1.1

1.2 Is there any advantage of this approach over adding the additional replacement as an after-step of the slugification. I mean, what if you use the list of ambiguous characters and run the same replacement after the slug is created with this library?

I've just noticed that the Unidecode package works as expected with ambiguous characters so this PR is obsolete now 🤦‍♂️, I will close this sorry for the trouble

@xnetcat xnetcat closed this Jan 11, 2023
@un33k
Copy link
Owner

un33k commented Feb 24, 2023

@xnetcat thank you for the info on this PR. I have included the info on the wiki page.

https://github.com/un33k/python-slugify/wiki/Python-Slugify-Wiki#notes-on-unidecode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants