Using RapidFuzz with 2 Dataframes #347

jsimo22 · 2023-09-08T14:46:14Z

jsimo22
Sep 8, 2023

I am looking for an example code using RapidFuzz with two distinct data frames. Has any one tried to use it in this case?

abubelinha · 2024-03-11T12:06:49Z

abubelinha
Mar 11, 2024

Me too.
I found this one but I'd be happy having some explained examples available in the repository documentation:
https://mlexplained.blog/2023/08/02/fuzzy-match-dataframes-using-rapidfuzz-and-pandas/

0 replies

maxbachmann · 2024-03-11T16:42:42Z

maxbachmann
Mar 11, 2024
Maintainer

This is largely not part of the documentation because:

writing documentation takes time
I am not really great at writing docs
I do not know a lot of the use cases, where tutorial like documentation would help

They would be a welcome addition though. So if anyone is willing to write tutorial like explanations (or really any improvements to the documentation), this would be amazing.

0 replies

abubelinha · 2024-03-11T17:45:34Z

abubelinha
Mar 11, 2024

Thanks.
If I get a reproducible working example I'd be happy to share it.
First I need a basic understanding how RapidFuzz works, related to my problem.
See related questions in issue #366

0 replies

abubelinha · 2024-03-11T19:05:59Z

abubelinha
Mar 11, 2024

Coming from #366 with my example data, which is pretty related to this discussion.

I want to match fuzzynames column in a dataframe, against another dataframe of name choices coming from a scientific-curated online database. So I can extract the closest name in DB plus its associated values in other columns, and merge that curated info into my originally fuzzy dataframe.

I'd better show a small example of source data and expected output:

dfchoices = pd.DataFrame(data = [ # ACTUALLY, about 170K rows and 20 columns (some of them, needed to merge into dfuzdata)
		[1001,'Soliva sessilis auct., non Ruiz & Pav.',1002,'Asteraceae'],
		[1002,'Soliva sessilis Ruiz & Pav.',1002,'Asteraceae'],
		[1004,'Soliva pterosperma (Juss.) Samp.',1004,'Asteraceae'],
		[1005,'Soliva sp.',1006,'Asteraceae'],
		[1006,'Soliva',1006,'Asteraceae'],
		[1007,'Solanum L.',1007,'Solanaceae'],
		[1009,'Solanum tuberosum L.',1009,'Solanaceae'],
	],
	columns=['taxonKey','scientificName','acceptedTaxonKey','family']
)
dfuzdata = pd.DataFrame(data = [ # USUALLY, between 100 and 1000 rows, with about 30 columns to keep
		['Soliva sessilis','USA'],
		['Saliva sesilis','Brazil'],
		['Solanun tuberosun','Perú-Bolivia'],
	],
	columns = ['fuzname','origin']
)

For each row in dfuzdata, I want to get the best matching row in dfchoices (where left fuzname and right name choices are the most similar) and get also some of the other dfchoices columns merged into dfuzdata to produce a dataframe dfoutput with the same number of rows as dfuzdata, plus the columns extracted from dfchoices.

For the data above, something like this would be the expected data output:

dfoutput = pd.DataFrame(data = [ 
		['Soliva sessilis','USA',1002,'Soliva sessilis Ruiz & Pav.',1002,'Asteraceae'],
		['Saliva sesilis','Brazil',1002,'Soliva sessilis Ruiz & Pav.',1002,'Asteraceae'],
		['Solanun tuberosun','Perú-Bolivia',1009,'Solanum tuberosum L.',1009,'Solanaceae'],
	],
	columns = ['fuzname','origin','taxonKey','scientificName','acceptedTaxonKey','family']
)

I guess this is a pretty common use case but I am not sure what is the best RapidFuzz approach here in terms of speed and accuracy:
using extractOne() or perhaps extract(limit=1)?

I can provide big sample datasets in order to test it with real data.
I'd also happy to share my code if I get it working and you find it useful to document as a common usage example.

Thanks
@abubelinha

0 replies

maxbachmann · 2024-03-12T13:56:04Z

maxbachmann
Mar 12, 2024
Maintainer

I guess this is a pretty common use case but I am not sure what is the best RapidFuzz approach here in terms of speed and accuracy:
using extractOne() or perhaps extract(limit=1)?

I would prefer extractOne, since it makes the intent clearer. However they should perform pretty much the same, since extract will internally call extractOne if the limit is 1.

Generally you have two options right now and both of these have their up and downsides:

using extractOne

for fuzzElem in fuzzData:
   extractOne(fuzzElem, choices, ...)

requires essentially no extra memory
can use the currently best score known as score_cutoff for further matching

doesn't directly support multithreading, however you should be able to get mutlithreading using something like the multiprocessing module
doesn't support SIMD so far which could help to match multiple of the short sequences in parallel

using cdist

cdist(fuzzData, choices, ...)

directly supports multithreading using the workers argument
can make use of SIMD to compare multiple short sequences at the same time

requires len(queries) * len(choices) * sizeof(dtype) memory. So depending on the dataset size this could require matching in chunks. For your case this would be around 0.6gb of memory if you match them in one batch
can't make use of the previous best score. Can still make use of the score_cutoff though

The optimal approach would be some kind of many x many match which returns only the best entry. I still want to add this to the library at some point (see #188), but I haven't gotten around to it so far.

0 replies

abubelinha · 2024-03-12T19:51:20Z

abubelinha
Mar 12, 2024

I think I'll go with extractOne since it was the way I was already thinking to do it.
(plus I don't know a word of SIMD or multithreading, so why to bother with that ... LOL)

One think that makes me doubt about using extract instead is that, for some of my data, the best match could be not the most similar.
In most of those cases, "my preferred match" will be probably among the top 3 matches (so it could appear when using extract with a higher limit).

But this is another problem (not related to dataframes) so I'll better open a new discussion about it (#367).

Will come back here when I get a reproducible 2-dataframe fuzzy matching script, since you will probably have ideas to improve my awful coding.
Thanks a lot for your help!

0 replies

maxbachmann · 2024-03-13T14:00:38Z

maxbachmann
Mar 13, 2024
Maintainer

(plus I don't know a word of SIMD or multithreading, so why to bother with that ... LOL)

That is precisely the advantage of cdist, since it automatically takes advantage of SIMD instructions behind the scenes and multithreading can be as easy as telling it to run with a certain amount of threads using something like workers=2 or even just to run on all cores using workers=-1.

I will have a look over your code and give it a try with cdist then to see whether it's any faster in this case :)
I do think this would make for a reasonable tutorial.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using RapidFuzz with 2 Dataframes #347

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Using RapidFuzz with 2 Dataframes #347

jsimo22 Sep 8, 2023

Replies: 7 comments

abubelinha Mar 11, 2024

maxbachmann Mar 11, 2024 Maintainer

abubelinha Mar 11, 2024

abubelinha Mar 11, 2024

maxbachmann Mar 12, 2024 Maintainer

abubelinha Mar 12, 2024

maxbachmann Mar 13, 2024 Maintainer

jsimo22
Sep 8, 2023

abubelinha
Mar 11, 2024

maxbachmann
Mar 11, 2024
Maintainer

abubelinha
Mar 11, 2024

abubelinha
Mar 11, 2024

maxbachmann
Mar 12, 2024
Maintainer

abubelinha
Mar 12, 2024

maxbachmann
Mar 13, 2024
Maintainer