Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store geolocations of TES instances #130

Open
uniqueg opened this issue Feb 6, 2023 · 7 comments
Open

Store geolocations of TES instances #130

uniqueg opened this issue Feb 6, 2023 · 7 comments
Labels
flag: good 1st issue Good for newcomers priority: medium Medium priority type: feature New feature or request workload: hours Likely takes hours to resolve

Comments

@uniqueg
Copy link
Member

uniqueg commented Feb 6, 2023

Is your feature request related to a problem? Please describe.

The distance-based task distribution is quite slow. A major reason is that the geolocations for each TES instance are retrieved from a remote service, one by one, for each individual task, via a costly HTTP call.

Describe the solution you'd like

On the execution of the first task with the distance-based task distribution logic, fetch the geolocations of all TES instances from the remote service and store them in a dedicated database collection, using the host (e.g,. csc-tesk-noauth.rahtiapp.fi) as a key. Then, for any successive task, retrieve the geolocations of all TES instances in the list from the database collection with a single database call. Should there be any TES instances missing in the database (because they were added in the meantime; probably a rare occurence), fetch the missing geolocations from the remote service and add them to the database collection.

In this way, for the vast majority of calls, only input URI geolocations will need to be fetched from the remote service.

Describe alternatives you've considered

In addition to storing the geolocations of all encountered TES instances, it may be worthwhile to check if the geolocations for multiple input files can be fetched with a single request. This may further reduce the cost and limit the number of remote calls to one, even with multiple input files.

@uniqueg uniqueg added flag: good 1st issue Good for newcomers priority: medium Medium priority type: feature New feature or request workload: hours Likely takes hours to resolve labels Mar 8, 2023
@uniqueg
Copy link
Member Author

uniqueg commented Mar 8, 2023

Note that it is easy to set up another database collection via the FOCA-based app config used in proTES (it becomes basically a declarative issue).

@limarkdcunha
Copy link

Hi @uniqueg,I would to like to work on this issue but I have some doubts I would like to clear.
As I understood, after the first call we need to store the  geolocations to the database and afterwards simply fetch them from the db.
I am confused as to what is meant by first call here ? Is it first call of session initiated by a new user or the first call of day or of a particular timeframe. Knowing what a first call is will help simplify the process
Also for "fetching the missing geolocations from the remote service and adding them to the database collection" when should one check if we have missing geolocations.

@uniqueg
Copy link
Member Author

uniqueg commented Mar 22, 2023

Great, @limarkdcunha, please go ahead and assign yourself :)

As for your question: Just check for every domain if it's in the database collection. If it is, then use the stored geolocations. If it isn't, follow the current procedure and then store them in the database collection. So for now let's assume all domain-coordinate pairs are stored in the database forever and are never updated once created (unless the database is wiped). This is of course not suitable for a production setting, but it's pretty much the same configuration we have for tasks, and we can worry about life cycle management for these documents later on.

So you can just follow the way we are dealing with the tasks collection, including its registration via the app configuration in config.yaml that takes care of the boilerplate of setting up and configuring the collection.

@limarkdcunha
Copy link

Hi @uniqueg,thanks for the clarification.
From what I understand from the codebase there are two task_distribution functions (random and distance).If the input urls are present in request ,distance based task_distribution is called or else random one is called.

Random TD does nothing but reshuffles 5 domains present in the app context from config.yaml.Now here I am not able to figure out where are the geolocations fetched for a particular domain? As I am able to infer from create_task function of task_runs.py, url itself is used to create a task using tes.HTTPClient.

Also in your first comment you said "In this way, for the vast majority of calls, only input URI geolocations will need to be fetched from the remote service". I find this statement a little confusing as I previously stated that without input_urls distance based task_distribution is not called.Can you please shed some light on this as well.
A final thing I am not able to assign the issue to myself

@uniqueg
Copy link
Member Author

uniqueg commented Mar 25, 2023

I have assigned you @limarkdcunha.

Geolocations are only used for the distance-based distribution logic. The point is to choose a TES instance that is closest to the inputs. So if there are no inputs, it doesn't make sense to use it, and we just distribute tasks randomly. In that case, geolocations don't play a role.

So geolocations are ONLY used in the distance-based task distribution logic, meaning that for addressing this issue, you will not need to look at the random distribution logic at all.

Does that make it clear?

@limarkdcunha
Copy link

Yeah got it. Now I have good idea of what should be done.

@limarkdcunha limarkdcunha removed their assignment Jul 28, 2023
@uniqueg
Copy link
Member Author

uniqueg commented Oct 19, 2023

Mental note: Now that we provide the distance-based task distribution plugin (inside proTES, yet still apart), we need to think carefully how to set up the database so that it is not too tightly coupled with the main app db. Ideally, a collection for storing IP geolocations should be created automatically (and only) when the plugin is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flag: good 1st issue Good for newcomers priority: medium Medium priority type: feature New feature or request workload: hours Likely takes hours to resolve
Projects
None yet
Development

No branches or pull requests

2 participants