Replies: 7 comments
-
Me too. |
Beta Was this translation helpful? Give feedback.
-
This is largely not part of the documentation because:
They would be a welcome addition though. So if anyone is willing to write tutorial like explanations (or really any improvements to the documentation), this would be amazing. |
Beta Was this translation helpful? Give feedback.
-
Thanks. |
Beta Was this translation helpful? Give feedback.
-
Coming from #366 with my example data, which is pretty related to this discussion. I want to match fuzzynames column in a dataframe, against another dataframe of name choices coming from a scientific-curated online database. So I can extract the closest name in DB plus its associated values in other columns, and merge that curated info into my originally fuzzy dataframe. I'd better show a small example of source data and expected output: dfchoices = pd.DataFrame(data = [ # ACTUALLY, about 170K rows and 20 columns (some of them, needed to merge into dfuzdata)
[1001,'Soliva sessilis auct., non Ruiz & Pav.',1002,'Asteraceae'],
[1002,'Soliva sessilis Ruiz & Pav.',1002,'Asteraceae'],
[1004,'Soliva pterosperma (Juss.) Samp.',1004,'Asteraceae'],
[1005,'Soliva sp.',1006,'Asteraceae'],
[1006,'Soliva',1006,'Asteraceae'],
[1007,'Solanum L.',1007,'Solanaceae'],
[1009,'Solanum tuberosum L.',1009,'Solanaceae'],
],
columns=['taxonKey','scientificName','acceptedTaxonKey','family']
)
dfuzdata = pd.DataFrame(data = [ # USUALLY, between 100 and 1000 rows, with about 30 columns to keep
['Soliva sessilis','USA'],
['Saliva sesilis','Brazil'],
['Solanun tuberosun','Perú-Bolivia'],
],
columns = ['fuzname','origin']
) For each row in For the data above, something like this would be the expected data output: dfoutput = pd.DataFrame(data = [
['Soliva sessilis','USA',1002,'Soliva sessilis Ruiz & Pav.',1002,'Asteraceae'],
['Saliva sesilis','Brazil',1002,'Soliva sessilis Ruiz & Pav.',1002,'Asteraceae'],
['Solanun tuberosun','Perú-Bolivia',1009,'Solanum tuberosum L.',1009,'Solanaceae'],
],
columns = ['fuzname','origin','taxonKey','scientificName','acceptedTaxonKey','family']
) I guess this is a pretty common use case but I am not sure what is the best RapidFuzz approach here in terms of speed and accuracy: I can provide big sample datasets in order to test it with real data. Thanks |
Beta Was this translation helpful? Give feedback.
-
I would prefer Generally you have two options right now and both of these have their up and downsides:
for fuzzElem in fuzzData:
extractOne(fuzzElem, choices, ...)
The optimal approach would be some kind of many x many match which returns only the best entry. I still want to add this to the library at some point (see #188), but I haven't gotten around to it so far. |
Beta Was this translation helpful? Give feedback.
-
I think I'll go with One think that makes me doubt about using But this is another problem (not related to dataframes) so I'll better open a new discussion about it (#367). Will come back here when I get a reproducible 2-dataframe fuzzy matching script, since you will probably have ideas to improve my awful coding. |
Beta Was this translation helpful? Give feedback.
-
That is precisely the advantage of I will have a look over your code and give it a try with cdist then to see whether it's any faster in this case :) |
Beta Was this translation helpful? Give feedback.
-
I am looking for an example code using RapidFuzz with two distinct data frames. Has any one tried to use it in this case?
Beta Was this translation helpful? Give feedback.
All reactions