-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download of wrong species list #705
Comments
@francescorota93 Could you upload here the list of species which you are using? It is likely that synonyms are being included or the names you are using match to a different than you expect. |
Hi, thanks for your quick response. Here is my species list (zip file that should contaiin an rds list). This is my script: match the namesgbif_taxon_keys <- species_list %>% check the namescheck_names <- list() I tried several ways, one big request with the full list download species which I did not required, so I also tried splitting the list in chunks and made a specific request per chunk, here the full list query: check how many spec are keptlength(gbif_taxon_keys) ## 2519 species make uniquegbif_taxon_keys1 <- unique(gbif_taxon_keys) ## 2492 species remove 7707728match(gbif_taxon_keys1, 7707728) ## this was downloading all tracheophyta gbif_taxon_keys1 <- gbif_taxon_keys1[-780] ## occ_data <- occ_download( occ_data This is the doi of one of the last dowloads: https://doi.org/10.15468/dl.t8am54 I guess there are some issues both with the species names and with the polygon probably. Let me know, thanks :) Best Francesco |
I think your main problem is that some of your names are matching to "higherrank", which basically means the names aren't present in the GBIF backbone. I would remove any names that match to higherrank, and this will probably reduce your download size a lot.
|
Ok thanks, I willl try to remove the HIGHERRANK and keep only the accepted names, I will let you know |
I used this now: gbif_taxon_keys1 <- unique(gbif_taxon_keys) ## 2161 species occ_data <- occ_download( d <- occ_download_get('0007908-240216155721649') %>% The number of species now it is fine, however for many the occurrences are biased (check e.g. Abies alba or Picea abies which result with few occurrences than I would expect). Should I keep the SYNONIM in status and FUZZY in matchType too? |
There might be some issues with your polygon being the wrong way still. polygon from your code polygon from download |
Thanks, I fixed the polygon and made another query: occ_data <- occ_download( occ_data d <- occ_download_get('0008830-240216155721649') %>% However, the number of occurrences is still lower than expected: down_occ <- table(as.factor(d$species)) ##
Acer platanoides Acer pseudoplatanus I try again with the proper request and polygon, but splitting the list into chunks with less species. |
Splitting into chunks of less species now it is fine at least for the first chunks:
Achnatherum calamagrostis Aconitum lycoctonum Aconitum napellus I will let you know when the process is finished if it worked for all the species. |
I made this download with just Abies alba, and it seems fine. I honestly don't think splitting into chunks is going to make a difference. https://www.gbif.org/occurrence/download/0010747-240216155721649
|
I don't know either why with a long list it downloads less occurrences than expected. See my comments above, with the full list Abies alba got 3052 occurrences, while when I used chunks it got 201352 occurrences. |
https://www.gbif.org/occurrence/download/0010766-240216155721649 And it appears Abies alba gets the right count |
the download you were checking results like this to me: It downloads data for 2313 species length(table(as.factor(d$species))) but for Abies alba I have only 4346 observations: table(as.factor(d$species))["Abies alba"] |
This is what I did on our GBIF servers. There could be some problem with importing the really large file. val df = spark.read.
option("header", "true").
option("delimiter", "\t").
csv("0010766-240216155721649.csv")
df.count()
df.printSchema()
df.filter($"taxonKey" === "2685484").count() Then there could be something deeper going on. Maybe occ_download_import isn't importing the entire download. So that is why your chunks method is working. |
Just a thought but since you are working with a really large download, you might want to try another file format. It isn't well documented but you can also download parquet files directly. https://data-blog.gbif.org/post/apache-arrow-and-parquet/ occ_download(
pred_in("taxonKey", long_list_of_keys),
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within("POLYGON((-31.3 32.4,69.1 32.4,69.1 81.8,-31.3 81.8,-31.3 32.4))"),
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
format = "SIMPLE_PARQUET"
) https://api.gbif.org/v1/occurrence/download/request/0013314-240216155721649.zip |
Ok great, thanks for your response and help. I will check also this parquet format. |
I need to download data for more than 2500 species. I use a similar script to this:
https://docs.ropensci.org/rgbif/articles/downloading_a_long_species_list.html
However, the query done with my species list download far too many species compared to the required and moreover, the data I got for many species were lacking in many areas.
I tried also to split the species list in less long lists (50 species per 50 lists), this time the occurrences are better downloaded, but in some cases I still get data from species which I have not inserted in the list.
Could you please help in that?
Session Info
The text was updated successfully, but these errors were encountered: