This is probably a pipelines process that will compare the taxon for a given record with a list of all known taxa for Australia.
The know list of Australian taxa should be derived from the ALA Biocache data, using a filter for country:Australia (uses AUS EEC layer).
CSV download:
https://biocache.ala.org.au/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_name
https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_name
https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=country:Australia&qualityProfile=AVH&facets=taxonConceptID&count=true&file=AU_all_taxa_tc_counts.csv
https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=country:Australia&fq=taxonRankID:[6000 TO 7000]&qualityProfile=AVH&facets=scientificName,taxonConceptID&count=true&file=AU_all_taxa_counts
Trying to generate a list of taxa for that query using SOLR or biocache-service is difficult due to the huge result set size and the API times out trying.
One option is to use SOLR with deep pagination using cursors.
Another is to run the query on Pipelines via Spark and save the result in S3. This seems to be the safest and most reliable option. Use the CSV download (above) to get data into Pipelines. The existing species-list pipeline would be a good starting point in the code. This pipeline accesses the ALA list API to pull down KV data and populate avro files using the taxon as a primary key.
It needs a field name for this data, something like presentInCountry:Australia. There might be an existing term for this, so needs some research.
This is probably a pipelines process that will compare the taxon for a given record with a list of all known taxa for Australia.
The know list of Australian taxa should be derived from the ALA Biocache data, using a filter for
country:Australia(uses AUS EEC layer).CSV download:
https://biocache.ala.org.au/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_namehttps://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_namehttps://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=country:Australia&qualityProfile=AVH&facets=taxonConceptID&count=true&file=AU_all_taxa_tc_counts.csvhttps://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=country:Australia&fq=taxonRankID:[6000 TO 7000]&qualityProfile=AVH&facets=scientificName,taxonConceptID&count=true&file=AU_all_taxa_counts
Trying to generate a list of taxa for that query using SOLR or biocache-service is difficult due to the huge result set size and the API times out trying.
One option is to use SOLR with deep pagination usingUse the CSV download (above) to get data into Pipelines. The existingcursors.Another is to run the query on Pipelines via Spark and save the result in S3. This seems to be the safest and most reliable option.
species-listpipeline would be a good starting point in the code. This pipeline accesses the ALA list API to pull down KV data and populate avro files using the taxon as a primary key.It needs a field name for this data, something like
presentInCountry:Australia. There might be an existing term for this, so needs some research.