-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Challenge 24 - Using transformer models to develop a search engine for datasets, charts, and documentation
Stream 2 - Machine Learning for Earth Science
Goal
Develop a natural language search engine to improve the discoverability of ECMWF datasets, graphical products, and documentation using natural language
Mentors and skills
- Mentors: Sylvie Lamy-Thepaut, Baudouin Raoult, Helen Setchell, Myranda Uselton Shirk
- Skills required:
- Python
- Machine Learning
- Data Science
- Possible extra: Confluence plugins and macros (Java, Velocity)
Note: Only nationals or residents from the ECMWF Member States and Co-operating States are eligible to participate (see Terms and Conditions).
Challenge description
It is difficult for users to find ECMWF data, both when using external and internal searches. This is true even though we have added Google structured data to our dataset pages because we only have limited content and metadata for our datasets.
There is editorial inconsistency in our documentation and a lot of it! The data and charts content has recently been reviewed and rewritten, but it can still be difficult for users to find the documentation they need.
Data/System to use
- Transformer machine learning model in Python
- HuggingFace Python library
- ECMWF chart discovery API (can be extended if needed)
- ECMWF dataset API (in development)
- ECMWF dataset DOIs (to investigate)
- Confluence content API
- A city database
Solution
A ML-based search engine presents users with a simple free text search box into which they can type natural language search terms and questions. This will then show a list of matching results, selected by the ML search system.
An example user search might be "what data do you have for Oslo rainfall in 1963?"
Consideration should be given to users using other languages to search and read results.
It should be possible to weigh results by, for example, population or proximity to ECMWF.
A possible extra for this project - time permitting - could be to write a Confluence plugin or macro, with parameters for search scope.
Implementation. Possible milestones
- Explore the functionalities of HuggingFace Transformers.
- Explore ECMWF's datasets and charts and find their metadata.
- Setup a test system
We will devise a reference set of questions, based on popular real user enquiries, to test search results before and after implementation.
The search box could be used on the datasets search page, the chart search page and the chart browser search and the support portal and possibly as a replacement for the confluence search features.
Additional comment
We hope to mentor this project in cooperation with Myranda Uselton Shirk at NOAA who provided this following presentation from AMS that greatly inspired this proposal.