Download of wrong species list #705

francescorota93 · 2024-02-20T13:07:02Z

I need to download data for more than 2500 species. I use a similar script to this:

https://docs.ropensci.org/rgbif/articles/downloading_a_long_species_list.html

However, the query done with my species list download far too many species compared to the required and moreover, the data I got for many species were lacking in many areas.

I tried also to split the species list in less long lists (50 species per 50 lists), this time the occurrences are better downloaded, but in some cases I still get data from species which I have not inserted in the list.

Could you please help in that?

Session Info

jhnwllr · 2024-02-21T07:48:35Z

@francescorota93 Could you upload here the list of species which you are using? It is likely that synonyms are being included or the names you are using match to a different than you expect.

francescorota93 · 2024-02-21T08:09:02Z

Hi, thanks for your quick response.

Here is my species list (zip file that should contaiin an rds list).
list_species.zip.

This is my script:

match the names

gbif_taxon_keys <- species_list %>%
name_backbone_checklist() %>% # match to backbone
filter(!matchType == "NONE") %>% # get matched names
pull(usageKey)

check the names

check_names <- list()
for(i in 1:length(gbif_taxon_keys)){
nu <- name_usage(key = gbif_taxon_keys[i])
nu$data$species
check_names[[length(check_names) + 1]] <- nu$data$species
}

I tried several ways, one big request with the full list download species which I did not required, so I also tried splitting the list in chunks and made a specific request per chunk, here the full list query:

check how many spec are kept

length(gbif_taxon_keys) ## 2519 species

make unique

gbif_taxon_keys1 <- unique(gbif_taxon_keys) ## 2492 species

remove 7707728

match(gbif_taxon_keys1, 7707728) ## this was downloading all tracheophyta

gbif_taxon_keys1 <- gbif_taxon_keys1[-780] ##

occ_data <- occ_download(
pred_in("taxonKey", gbif_taxon_keys1),
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within("POLYGON((-31.3 32.4, -31.3 81.9, 69.1 81.9, 69.1 32.4, -31.3 32.4))"), ## "POLYGON((-11 34, 41 34, 41 72, -11 72, -11 34))" , clockwise (wrong) "POLYGON((-31.3 32.4, 69.1 32.4, 69.1 81.9, -31.3 81.9, -31.3 32.4))"
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
format = "SIMPLE_CSV",
user=user,pwd=pwd,email=email
)

occ_data

This is the doi of one of the last dowloads: https://doi.org/10.15468/dl.t8am54

I guess there are some issues both with the species names and with the polygon probably.

Let me know, thanks :)

Best

Francesco

jhnwllr · 2024-02-21T09:10:33Z

@francescorota93

I think your main problem is that some of your names are matching to "higherrank", which basically means the names aren't present in the GBIF backbone. I would remove any names that match to higherrank, and this will probably reduce your download size a lot.

# A tibble: 4 × 2
# Groups:   status [4]
  status       n
  <chr>    <int>
1 ACCEPTED  2238
2 DOUBTFUL     2
3 SYNONYM    279
4 NA           5

  matchType      n
  <chr>      <int>
1 EXACT       2440
2 FUZZY          5
3 HIGHERRANK    74
4 NONE           5

francescorota93 · 2024-02-21T09:32:03Z

Ok thanks, I willl try to remove the HIGHERRANK and keep only the accepted names, I will let you know

francescorota93 · 2024-02-21T10:14:52Z

I used this now:
gbif_taxon_keys <- species_list %>%
name_backbone_checklist() %>%
filter(matchType == "EXACT" & status == "ACCEPTED") %>% # match to backbone
filter(!matchType == "NONE") %>% # get matched names
pull(usageKey)

gbif_taxon_keys1 <- unique(gbif_taxon_keys) ## 2161 species

occ_data <- occ_download(
pred_in("taxonKey", gbif_taxon_keys1), # important to use pred_in
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within("POLYGON((-31.3 32.4, -31.3 81.9, 69.1 81.9, 69.1 32.4, -31.3 32.4))"), ## "POLYGON((-11 34, 41 34, 41 72, -11 72, -11 34))" , clockwise (wrong) "POLYGON((-31.3 32.4, 69.1 32.4, 69.1 81.9, -31.3 81.9, -31.3 32.4))"
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
format = "SIMPLE_CSV",
user=user,pwd=pwd,email=email
)

d <- occ_download_get('0007908-240216155721649') %>%
occ_download_import()

The number of species now it is fine, however for many the occurrences are biased (check e.g. Abies alba or Picea abies which result with few occurrences than I would expect). Should I keep the SYNONIM in status and FUZZY in matchType too?

jhnwllr · 2024-02-21T11:09:55Z

There might be some issues with your polygon being the wrong way still.

polygon from your code
POLYGON((-31.3 32.4, 69.1 32.4, 69.1 81.9, -31.3 81.9, -31.3 32.4))

polygon from download
POLYGON((-31.3 32.4, -31.3 81.9, 69.1 81.9, 69.1 32.4, -31.3 32.4))
https://www.gbif.org/occurrence/download/0007908-240216155721649

francescorota93 · 2024-02-21T14:52:31Z

Thanks, I fixed the polygon and made another query:

occ_data <- occ_download(
pred_in("taxonKey", gbif_taxon_keys1), # important to use pred_in
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within(gbif_bbox2wkt(minx = -31.3, miny = 32.4, maxx = 69.1, maxy = 81.8)), ## "POLYGON((-11 34, 41 34, 41 72, -11 72, -11 34))" , clockwise (wrong) "POLYGON((-31.3 32.4, 69.1 32.4, 69.1 81.9, -31.3 81.9, -31.3 32.4))"
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
format = "SIMPLE_CSV",
user=user,pwd=pwd,email=email
)

occ_data

d <- occ_download_get('0008830-240216155721649') %>%
occ_download_import()

However, the number of occurrences is still lower than expected:

down_occ <- table(as.factor(d$species)) ##
head(down_occ)

                         Abies alba      Acer campestre         Acer opalus 
           1843                3052               13820                1506

Acer platanoides Acer pseudoplatanus
13291 22743

I try again with the proper request and polygon, but splitting the list into chunks with less species.

francescorota93 · 2024-02-22T07:52:36Z

Splitting into chunks of less species now it is fine at least for the first chunks:

table(as.factor(d$species))

           Abies alba            Acer campestre               Acer opalus 
               201352                    436076                     56684 
     Acer platanoides       Acer pseudoplatanus       Achillea erba-rotta 
               327219                    694768                      8576 
 Achillea macrophylla      Achillea millefolium          Achillea nobilis 
                 5775                   1033674                     11727 
    Achillea ptarmica          Achillea setacea        Achillea tomentosa 
               242224                      5213                      3068

Achnatherum calamagrostis Aconitum lycoctonum Aconitum napellus
68214 48626 41959
Aconitum variegatum Acrocordia gemmata Actaea spicata
6891 9662 96543
Actinidia chinensis Adenophora liliifolia Adenostyles alliariae
602 2370 39418
Adenostyles alpina Adiantum capillus-veneris Adonis vernalis
28410 22208 22452
Adoxa moschatellina Aegopodium podagraria Aesculus hippocastanum
115235 459150 163052
Aethusa cynapium Agaricus augustus Agaricus sylvaticus
108623 6310 8621
Agaricus sylvicola Agonimia allobata Agonimia tristicula
9558 1009 5142
Agrimonia eupatoria Agrimonia procera Agrostis canina
311111 22856 191572
Agrostis capillaris Agrostis gigantea Agrostis rupestris
832968 108800 22180
Agrostis schraderiana Agrostis stolonifera Ailanthus altissima
8867 752792 93875
Aira caryophyllea Ajuga chamaepitys Ajuga genevensis
83626 37629 36173
Ajuga pyramidalis Ajuga reptans Alcea rosea
70224 428209 29269
Alchemilla mollis Alchemilla monticola
21629 58794

I will let you know when the process is finished if it worked for all the species.

jhnwllr · 2024-02-22T08:48:13Z

@francescorota93

I made this download with just Abies alba, and it seems fine. I honestly don't think splitting into chunks is going to make a difference.

https://www.gbif.org/occurrence/download/0010747-240216155721649

occ_download(
pred_in("taxonKey", 2685484), 
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within("POLYGON((-31.3 32.4,69.1 32.4,69.1 81.8,-31.3 81.8,-31.3 32.4))"),
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
format = "SIMPLE_CSV"
)

francescorota93 · 2024-02-22T09:01:18Z

I don't know either why with a long list it downloads less occurrences than expected.
With only one species it is fine, but with a large list it does not work properly. Differently, if I keep 50 species per chunks it works properly.

See my comments above, with the full list Abies alba got 3052 occurrences, while when I used chunks it got 201352 occurrences.

jhnwllr · 2024-02-22T10:51:14Z

https://www.gbif.org/occurrence/download/0010766-240216155721649
I checked this BIG download with all of the taxonKeys and it appears that

And it appears Abies alba gets the right count
200877

francescorota93 · 2024-02-22T11:45:27Z

the download you were checking results like this to me:

It downloads data for 2313 species

length(table(as.factor(d$species)))
[1] 2313

but for Abies alba I have only 4346 observations:

table(as.factor(d$species))["Abies alba"]
Abies alba
4346

jhnwllr · 2024-02-23T07:58:22Z

This is what I did on our GBIF servers.

There could be some problem with importing the really large file.

val df = spark.read.
option("header", "true").
option("delimiter", "\t").
csv("0010766-240216155721649.csv")

df.count()
df.printSchema()

df.filter($"taxonKey" === "2685484").count()

Then there could be something deeper going on. Maybe occ_download_import isn't importing the entire download. So that is why your chunks method is working.

jhnwllr · 2024-02-23T08:14:10Z

Just a thought but since you are working with a really large download, you might want to try another file format. It isn't well documented but you can also download parquet files directly.

https://data-blog.gbif.org/post/apache-arrow-and-parquet/

occ_download(
pred_in("taxonKey", long_list_of_keys), 
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
pred_within("POLYGON((-31.3 32.4,69.1 32.4,69.1 81.8,-31.3 81.8,-31.3 32.4))"),
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
format = "SIMPLE_PARQUET"
)

https://api.gbif.org/v1/occurrence/download/request/0013314-240216155721649.zip

francescorota93 · 2024-02-26T13:57:52Z

Ok great, thanks for your response and help. I will check also this parquet format.

jhnwllr · 2024-03-05T04:55:47Z

gbif/occurrence#340

jhnwllr added the documentation label Feb 21, 2024

jhnwllr added this to the 3.8.0 milestone Feb 21, 2024

jhnwllr removed this from the 3.8.0 milestone May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download of wrong species list #705

Download of wrong species list #705

francescorota93 commented Feb 20, 2024 •

edited

Loading

jhnwllr commented Feb 21, 2024

francescorota93 commented Feb 21, 2024

jhnwllr commented Feb 21, 2024

francescorota93 commented Feb 21, 2024

francescorota93 commented Feb 21, 2024

jhnwllr commented Feb 21, 2024

francescorota93 commented Feb 21, 2024

francescorota93 commented Feb 22, 2024

jhnwllr commented Feb 22, 2024

francescorota93 commented Feb 22, 2024 •

edited

Loading

jhnwllr commented Feb 22, 2024

francescorota93 commented Feb 22, 2024

jhnwllr commented Feb 23, 2024

jhnwllr commented Feb 23, 2024 •

edited

Loading

francescorota93 commented Feb 26, 2024

jhnwllr commented Mar 5, 2024

Download of wrong species list #705

Download of wrong species list #705

Comments

francescorota93 commented Feb 20, 2024 • edited Loading

jhnwllr commented Feb 21, 2024

francescorota93 commented Feb 21, 2024

match the names

check the names

check how many spec are kept

make unique

remove 7707728

jhnwllr commented Feb 21, 2024

francescorota93 commented Feb 21, 2024

francescorota93 commented Feb 21, 2024

jhnwllr commented Feb 21, 2024

francescorota93 commented Feb 21, 2024

francescorota93 commented Feb 22, 2024

jhnwllr commented Feb 22, 2024

francescorota93 commented Feb 22, 2024 • edited Loading

jhnwllr commented Feb 22, 2024

francescorota93 commented Feb 22, 2024

jhnwllr commented Feb 23, 2024

jhnwllr commented Feb 23, 2024 • edited Loading

francescorota93 commented Feb 26, 2024

jhnwllr commented Mar 5, 2024

francescorota93 commented Feb 20, 2024 •

edited

Loading

francescorota93 commented Feb 22, 2024 •

edited

Loading

jhnwllr commented Feb 23, 2024 •

edited

Loading