ubase.js
is a javascript library for removing accents, diacritics
(and more) from utf8 strings.
Many utf8 characters are "based" on latin letters; that's clear for accents, like "é" which is based on "e", but also for more rare symbols like "🅴" or "Ǝ" ! The idea of this simple library is to give you back the base letter of these characters.
npm install ubase.js
or simply copy the ubase.js
file where you need it.
You just need the ubase.js
file. Usage is straighforward. The main
function is basify
:
> const ubase = require ("ubase.js");
undefined
> ubase.basify ('Bøǹĵöůɍ');
'Bonjour'
If you just copied the ubase.js
file to your current directory,
replace the first line above by:
> const ubase = require ("./ubase.js");
You may control the behaviour of basify
in case of malformed
utf8, or non-latin characters:
set_malformed ( s )
: the given strings
will be used to replace any malformed utf8 char. Default is '?'.set_strip ( s )
:s
can be either a string, orundefined
. Ifs
is a string, it will replace any non-ASCII utf8 char that is not based on a latin char, like '→'. It is allowed fors
to be the empty string (hence the name "strip"). Ifs
isundefined
, no replacement takes place (this is the default).
If both malformed
and strip
contain only ASCII characters, then
the result of basify
is guaranteed to contain only ASCII
characters.
Other helper functions:
isolatin_to_utf8 ( s )
: convert the isolatin-encoded strings
to utf8.cp1252_to_utf8 ( s )
: convert the cp1252-encoded (aka Windows encoding) strings
to utf8.
<!DOCTYPE html>
<html>
<body>
<script src="./ubase.js"></script>
<h1>Ubase</h1>
<p>
<script>
document.write(basify('Ŧħïŝ ịṣ Ĝóôđ!'));
</script>
</p>
</body>
</html>
The standalone executable version of ubase is ubasex.js
. You can
test it with node
:
$ node ubasex.js Bøǹĵöůɍ
Bonjour
This library is automatically generated from the
OCaml
ubase
version using
js-of-ocaml
.
ubase.js
covers more than 2000 utf8 chars, it should be quite
complete. File an issue if some character is not properly basified.