Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probability normalization #89

Open
fabiospampinato opened this issue Sep 17, 2020 · 3 comments
Open

Probability normalization #89

fabiospampinato opened this issue Sep 17, 2020 · 3 comments

Comments

@fabiospampinato
Copy link

Currently franc to me often returns a probability close to 1 for many languages, IMO all these probabilities should be normalized to add up to 1.

Also there seems to always be a language at the top with probability 1, this makes it difficult to judge how sure the "model" is about the detection, which would be another interesting point of data to have.

@wooorm
Copy link
Owner

wooorm commented Sep 18, 2020

IMO all these probabilities should be normalized to add up to 1.

That difference is exactly what you can use to check how sure franc is. For example, if you’re checking whether documents are probably in English, you could see if the score for it is > 0.85 or so.
Or, to check if franc is sure about one language, do topLanguage > secondLanguage + 0.1 or so.
And you can round them yourself if you want too?

this makes it difficult to judge how sure the "model" is about the detection

The raw score has to do with how long a value is passed in, so it’s not very interesting.
I feel sureness can be better detected by comparing the normalized scores to each other?

@fabiospampinato
Copy link
Author

For example, if you’re checking whether documents are probably in English, you could see if the score for it is > 0.85 or so.

That doesn't really work though, for example for this 2048-characters text:

Sample text
# 00 - A War and Peace

BOOK ONE: 1805

CHAPTER I

“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”

It was in July, 1805, and the speaker was the well-known Anna Pávlovna Schérer, maid of honor and favorite of the Empress Márya Fëdorovna. With these words she greeted Prince Vasíli Kurágin, a man of high rank and importance, who was the first to arrive at her reception. Anna Pávlovna had had a cough for some days. She was, as she said, suffering from la grippe; grippe being then a new word in St. Petersburg, used only by the elite.

All her invitations without exception, written in French, and delivered by a scarlet-liveried footman that morning, ran as follows:

“If you have nothing better to do, Count (or Prince), and if the prospect of spending an evening with a poor invalid is not too terrible, I shall be very charmed to see you tonight between 7 and 10—Annette Schérer.”

“Heavens! what a virulent attack!” replied the prince, not in the least disconcerted by this reception. He had just entered, wearing an embroidered court uniform, knee breeches, and shoes, and had stars on his breast and a serene expression on his flat face. He spoke in that refined French in which our grandfathers not only spoke but thought, and with the gentle, patronizing intonation natural to a man of importance who had grown old in society and at court. He went up to Anna Pávlovna, kissed her hand, presenting to her his bald, scented, and shining head, and complacently seated himself on the sofa.

“First of all, dear friend, tell me how you are. Set your friend’s mind at rest,” said he without altering his tone, beneath the pol

I get these languages with probability > 0.85:

Detected probabilities
[
  ["eng", 1],
  ["sco", 0.9884040272815849],
  ["dan", 0.9701409548554726],
  ["nld", 0.9696602793114648],
  ["nob", 0.9695199740175382],
  ["spa", 0.9692211757063982],
  ["fra", 0.9688704124715817],
  ["nds", 0.9686521597921403],
  ["deu", 0.968223449171809],
  ["ita", 0.9680571614160441],
  ["src", 0.9679506333225073],
  ["afr", 0.9664306593049692],
  ["glg", 0.9657914907437479],
  ["nno", 0.9654121468009094],
  ["cat", 0.9638246183825917],
  ["por", 0.9636609288730107],
  ["swe", 0.9631308866515103],
  ["tiv", 0.9577031503734978],
  ["vec", 0.9570016239038649],
  ["ron", 0.9563208834037025],
  ["hat", 0.9549568041571939],
  ["rmn", 0.9538317635595973],
  ["epo", 0.952776875608964],
  ["snk", 0.9511737577135434],
  ["pam", 0.9503371224423514],
  ["bcl", 0.9500850925625203],
  ["fuf", 0.9482741149723937],
  ["tpi", 0.9480168886001948],
  ["tzm", 0.9479233517375771],
  ["bin", 0.9472504059759662],
  ["sot", 0.9470373497888925],
  ["jav", 0.946686586554076],
  ["rmy", 0.9465254952906788],
  ["bum", 0.9464085742124067],
  ["plt", 0.9461513478402078],
  ["est", 0.9459902565768107],
  ["fin", 0.9441221175706398],
  ["bug", 0.9431139980513153],
  ["nso", 0.9430958103280286],
  ["tur", 0.9429918804806755],
  ["fuv", 0.9428983436180578],
  ["mad", 0.9428723611562195],
  ["tgl", 0.9426229295225723],
  ["hau", 0.9425293926599545],
  ["ind", 0.9423267294576161],
  ["war", 0.9421864241636895],
  ["mos", 0.9421500487171159],
  ["bci", 0.9420227346541085],
  ["zlm", 0.9419369925300423],
  ["tsn", 0.9418486521597922],
  ["ckb", 0.9416018187723287],
  ["slv", 0.9411237414745047],
  ["ban", 0.9409262747645339],
  ["ilo", 0.9407495940240338],
  ["ceb", 0.9405469308216954],
  ["nya", 0.9404871711594673],
  ["kde", 0.9404507957128938],
  ["swh", 0.9403468658655407],
  ["hms", 0.9399883078921728],
  ["hil", 0.9395595972718415],
  ["sun", 0.9394426761935694],
  ["ndo", 0.9392763884378045],
  ["hun", 0.9391802533290029],
  ["als", 0.9390165638194219],
  ["vmw", 0.9385852549529068],
  ["ace", 0.9383696005196492],
  ["wol", 0.9378889249756415],
  ["ssw", 0.9377875933744723],
  ["ibb", 0.9376291003572589],
  ["emk", 0.9375017862942514],
  ["quz", 0.9373770704774278],
  ["slk", 0.9373069178304645],
  ["lun", 0.9371406300746996],
  ["hnj", 0.9370341019811628],
  ["lin", 0.9370081195193245],
  ["tso", 0.9365560246833388],
  ["uig", 0.9365456316986034],
  ["qug", 0.9363715492042871],
  ["srp", 0.9361896719714193],
  ["bem", 0.9357843455667425],
  ["quy", 0.935641442026632],
  ["aka", 0.9354257875933745],
  ["lit", 0.9353816174082494],
  ["min", 0.9351867489444625],
  ["som", 0.9351321857746021],
  ["ven", 0.9350074699577785],
  ["run", 0.9349217278337123],
  ["dip", 0.9348723611562195],
  ["yao", 0.9347996102630725],
  ["ces", 0.9345683663527119],
  ["hrv", 0.9342435855797336],
  ["bos", 0.9342176031178955],
  ["lua", 0.9341604417018512],
  ["uzn", 0.9341110750243585],
  ["nyn", 0.934014939915557],
  ["suk", 0.9339240012991231],
  ["cjk", 0.9338018837284833],
  ["zul", 0.933747320558623],
  ["ada", 0.9336589801883728],
  ["bam", 0.9336459889574538],
  ["xho", 0.9335758363104905],
  ["ewe", 0.9334952906787918],
  ["gax", 0.9334874959402404],
  ["kin", 0.9331860993829165],
  ["sna", 0.933014615134784],
  ["lav", 0.9329730431958428],
  ["tuk", 0.9328717115946736],
  ["kng", 0.9327703799935044],
  ["ibo", 0.9322143553101656],
  ["knc", 0.9321935693406951],
  ["nhn", 0.9317700552127314],
  ["men", 0.9316167586878856],
  ["snn", 0.9314530691783046],
  ["ayr", 0.9296628775576485],
  ["umb", 0.9292029879831114],
  ["lug", 0.9291380318285157],
  ["kmb", 0.9287586878856772],
  ["toi", 0.9286391685612212],
  ["pol", 0.9275816823644041],
  ["fon", 0.9258824293601818],
  ["gaa", 0.9254095485547256],
  ["zyb", 0.9240324780772978],
  ["tem", 0.9233621305618708],
  ["vie", 0.9229827866190321],
  ["azj", 0.9229697953881131],
  ["sag", 0.9223955829814875],
  ["dyu", 0.9208184475479052],
  ["kbp", 0.9189503085417343],
  ["yor", 0.9112257226372199]
] 

I mean if 129 languages out of 180 supported languages are considered probable that's not very useful.

Or, to check if franc is sure about one language, do topLanguage > secondLanguage + 0.1 or so.

Is 0.1 a big enough threshold though? It's a bit hard for me to gauge how big a difference 0.1 makes when 120 languages are above 0.85 anyway 🤔

And you can round them yourself if you want too?

I could normalize them myself but I would get the top language at like 1%, maybe, that doesn't tell me much really. The top percentage should be much higher for long non-very-ambiguous documents like the one I'm feeding it in the example above.

I feel sureness can be better detected by comparing the normalized scores to each other?

Can it? I don't know, because looking at the probabilities there are like 30 languages detected within a 0.005 percentage range, I'm not sure how I'm supposed to gauge the sureness of the model on those languages. Even the difference between english and spanish is only like 0.031.

@wooorm
Copy link
Owner

wooorm commented Feb 22, 2021

I believe whatlang-rs, which is inspired by franc, does some smart things here: https://github.com/greyblake/whatlang-rs#how-does-it-work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants