Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only global tasks are started and necessary ES indexes are missing #585

Closed
ncsibra-lab49 opened this issue Mar 21, 2023 · 5 comments · Fixed by chaoss/grimoirelab-sirmordred#592

Comments

@ncsibra-lab49
Copy link

We deployed GrimoireLab components to AWS ECS, mostly based on the docker-compose example using the latest image versions.
Sortinghat, sortinghat-worker, sirmordred and nginx are separate services.
They're using managed AWS resources, MariaDB 10.6.5, Elasticsearch 6.8 and Redis 7.0.

Config files (${...} parts are replaced from env at container start):

setup.cfg
[general]
short_name = GrimoireLab
update = true
min_update_delay = 60
debug = true
logs_dir = /home/grimoire/logs
bulk_size = 100
scroll_size = 100
aliases_file = /home/grimoire/aliases.json

[projects]
projects_file = /home/grimoire/conf/projects.json

[es_collection]
url =${ELASTICSEARCH_HOST}

[es_enrichment]
url = ${ELASTICSEARCH_HOST}
autorefresh = true

[sortinghat]
host = ${NGINX_HOST}
user = root
password = ${SORTINGHAT_PASSWORD}
port = ${NGINX_PORT}
path = /identities/api/
ssl = false
database = sortinghat_db
autoprofile = [github, git]
matching = [email,name,username]
sleep_for = 100
unaffiliated_group = Unknown
affiliate = true

[panels]
kibiter_time_from = now-90d
kibiter_default_index = git
kibiter_url = ${KIBANA_ENDPOINT}
kibiter_version = 6.8.13

[phases]
collection = true
identities = true
enrichment = true
panels = true

[git]
raw_index = git_demo_raw
enriched_index = git_demo_enriched
latest-items = true
studies = [enrich_demography:git, enrich_areas_of_code:git, enrich_onion:git]

[github]
api-token = [${GH_TOKEN0},${GH_TOKEN1},${GH_TOKEN2},${GH_TOKEN3},${GH_TOKEN4},${GH_TOKEN5},${GH_TOKEN6},${GH_TOKEN7},${GH_TOKEN8},${GH_TOKEN9}]

[github:issue]
raw_index = github_raw
enriched_index = github_enriched
category = issue
sleep-for-rate = true
no-archive = true
studies = [enrich_onion:github,enrich_geolocation:user,enrich_geolocation:assignee,enrich_backlog_analysis,enrich_demography:github]

[github:pull]
raw_index = github-pull_raw
enriched_index = github-pull_enriched
category = pull_request
sleep-for-rate = true
no-archive = true
studies = [enrich_geolocation:user,enrich_geolocation:assignee,enrich_demography:github]
           
[github:repo]
raw_index = github-repo_raw
enriched_index = github-repo_enriched
category = repository
sleep-for-rate = true
no-archive = true
studies = [enrich_demography:github]

[githubql]
raw_index = github_event_raw
enriched_index = github_event_enriched
api-token = [${GH_TOKEN0},${GH_TOKEN1},${GH_TOKEN2},${GH_TOKEN3},${GH_TOKEN4},${GH_TOKEN5},${GH_TOKEN6},${GH_TOKEN7},${GH_TOKEN8},${GH_TOKEN9}]
sleep-for-rate = true
sleep-time = "300"
no-archive = true 
studies = [enrich_duration_analysis:kanban, enrich_reference_analysis]

[github2]
api-token = [${GH_TOKEN0},${GH_TOKEN1},${GH_TOKEN2},${GH_TOKEN3},${GH_TOKEN4},${GH_TOKEN5},${GH_TOKEN6},${GH_TOKEN7},${GH_TOKEN8},${GH_TOKEN9}]

[github2:issue]
raw_index = github2-issues_raw
enriched_index = github2-issues_enriched
sleep-for-rate = true
category = issue
no-archive = true
studies = [enrich_geolocation:user, enrich_geolocation:assignee, enrich_feelings]

[github2:pull]
raw_index = github2-pull_raw
enriched_index = github2-pull_enriched
sleep-for-rate = true
category = pull_request
no-archive = true
studies = [enrich_geolocation:user, enrich_geolocation:assignee, enrich_feelings]

[enrich_demography:git]

[enrich_areas_of_code:git]
in_index = git_demo_raw
out_index = git-aoc_demo_enriched

[enrich_onion:git]
in_index = git
out_index = git-onion_demo_enriched
contribs_field = hash

[enrich_onion:github]
in_index_iss = github_issues_onion-src
in_index_prs = github_prs_onion-src
out_index_iss = github-issues-onion_enriched
out_index_prs = github-prs-onion_enriched

[enrich_geolocation:user]
location_field = user_location
geolocation_field = user_geolocation

[enrich_geolocation:assignee]
location_field = assignee_location
geolocation_field = assignee_geolocation

[enrich_feelings]
attributes = [title, body]
nlp_rest_url = http://localhost:2901

[enrich_backlog_analysis]
out_index = github_enrich_backlog
interval_days = 7
reduced_labels = [bug,enhancement]
map_label = [others, bugs, enhancements]

[enrich_demography:github]

[enrich_duration_analysis:kanban]
start_event_type = MovedColumnsInProjectEvent
fltr_attr = board_name
target_attr = board_column
fltr_event_types = [MovedColumnsInProjectEvent, AddedToProjectEvent]

[enrich_duration_analysis:label]
start_event_type = UnlabeledEvent
target_attr = label
fltr_attr = label
fltr_event_types = [LabeledEvent]

[enrich_reference_analysis]
projects.json structure
{
  "Project1": {
    "git": [
      ...
    ],
    "github2:issue": [
      ...
    ],
    "github2:pull": [
      ...
    ],
    "github:issue": [
      ...
    ],
    "github:pull": [
      ...
    ],
    "github:repo": [
      ...
    ],
    "githubql": [
      ...
    ]
  },
  "Project2": {
    "git": [
      ...
    ],
    "github2:issue": [
      ...
    ],
    "github2:pull": [
      ...
    ],
    "github:issue": [
      ...
    ],
    "github:pull": [
      ...
    ],
    "github:repo": [
      ...
    ],
    "githubql": [
      ...
    ]
  }
}

The real projects.json contains around 30.000 lines, overall ~4000 repo in every category.

At first, it worked correctly, collected 8,533,696 documents in elasticsearch and created these indexes/aliases:

indices
health status index                   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   .grimoirelab-sigils     OOP63k6SS_iMoBPbnKA6WA   1   1         39            0     16.5kb         16.5kb
green  open   .kibana_1               5e1J35IPTtm7RRE9MY02CQ   1   0        245           21    412.1kb        412.1kb
yellow open   git-aoc_demo_enriched   m8AoYdcET1Cv3FSyBNO1dw   1   1    6421491            0      3.4gb          3.4gb
yellow open   git-onion_demo_enriched H_VOYXZNT4aw5G7Zfjr2Mw   1   1       3543            0    848.1kb        848.1kb
yellow open   git_demo_enriched       BjK5KJ6FTNOsm_gG4iuing   1   1     853195        20562      1.7gb          1.7gb
yellow open   git_demo_raw            Qovgzh_SRB-uflQaVL-Wfw   1   1     854108        12815        1gb            1gb
yellow open   github-pull_raw         BPfXonH4RQuIlradBFTcGA   1   1      15291          729    172.6mb        172.6mb
yellow open   github-repo_enriched    urkhbFoQSkCFX4a2V1ukYg   1   1      34724          219       12mb           12mb
yellow open   github-repo_raw         x5nbW7BARNWFtTCmo2WNdg   1   1      39289            0     99.1mb         99.1mb
yellow open   github2-issues_raw      Tc5QIBzjQ7aFN95RYcCWYQ   1   1      95354         1128    512.3mb        512.3mb
yellow open   github2-pull_raw        VVX-NqiEQoemv8nbYveImg   1   1      15191          729    171.1mb        171.1mb
yellow open   github_event_raw        AEuBFKoURwi4Xc0R0h6ZBA   1   1     105872            0    307.2mb        307.2mb
yellow open   github_raw              _FK1r5amQYKz9KtNIRhu8A   1   1      95354         1134    514.7mb        514.7mb
aliases
alias                     index                 filter routing.index routing.search
.kibana                   .kibana_1             -      -             -
affiliations              git_demo_enriched     -      -             -
all_enriched              git_demo_enriched     -      -             -
demographics              git_demo_enriched     -      -             -
git                       git_demo_enriched     -      -             -
git-raw                   git_demo_raw          -      -             -
git_areas_of_code         git-aoc_demo_enriched -      -             -
git_author                git_demo_enriched     -      -             -
git_enrich                git_demo_enriched     -      -             -
github-raw                github_raw            -      -             -
github2_issues-raw        github2-issues_raw    -      -             -
github2_pull_requests-raw github2-pull_raw      -      -             -
github_events-raw         github_event_raw      -      -             -
github_pull_requests-raw  github-pull_raw       -      -             -
github_repositories       github-repo_enriched  -      -             -
github_repositories-raw   github-repo_raw       -      -             -

The first problem is that not all the indexes are created from the setup.cfg e.g.: github_enriched, github-pull_enriched, github_event_enriched, etc.
This could be because it wasn't able to finish the whole enrichment process.

The bigger issue is, that now it does not start any of the git/github tasks, only the global tasks:

all.log

2023-03-20 12:35:11,911 - sirmordred.sirmordred - INFO -
2023-03-20 12:35:11,911 - sirmordred.sirmordred - INFO - ----------------------------
2023-03-20 12:35:11,911 - sirmordred.sirmordred - INFO - Starting SirMordred engine ...
2023-03-20 12:35:11,911 - sirmordred.sirmordred - INFO - - - - - - - - - - - - - - -
2023-03-20 12:35:22,009 - sirmordred.sirmordred - INFO - Loading projects
2023-03-20 12:35:28,504 - sirmordred.sirmordred - INFO - Projects loaded
2023-03-20 12:35:28,514 - sirmordred.sirmordred - INFO - TaskProjects TaskIdentitiesMerge will be executed on Mon, 20 Mar 2023 12:37:08
2023-03-20 12:35:30,201 - sirmordred.task_projects - INFO - Reading projects data from  /home/grimoire/conf/projects.json
2023-03-20 12:35:31,202 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm email
2023-03-20 12:35:42,354 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: da40b6aa-e173-469a-b473-28bcd8f223c1
2023-03-20 12:36:53,094 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm name
2023-03-20 12:37:06,286 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 70a4a8f1-077f-4b54-aeaa-ec5b449175eb
2023-03-20 12:38:17,005 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm username
2023-03-20 12:38:28,969 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 040cd3d7-f77c-49e1-a666-f9da6932fd9f
2023-03-20 12:39:09,512 - sgqlc.endpoint.http - ERROR - http://aslive-nginx.github-metrics-dashboard:8000/identities/api/: HTTP Error 503: Service Unavailable
2023-03-20 12:39:09,512 - sgqlc.endpoint.http - INFO - Response header: connection: close
2023-03-20 12:39:09,512 - sgqlc.endpoint.http - INFO - Response header: content-length: 168
2023-03-20 12:39:09,512 - sgqlc.endpoint.http - INFO - Response header: content-type: text/plain
2023-03-20 12:39:09,512 - sgqlc.endpoint.http - INFO - Response header: date: Mon, 20 Mar 2023 12:39:09 GMT
2023-03-20 12:39:09,512 - sgqlc.endpoint.http - INFO - Response header: server: envoy
2023-03-20 12:39:09,513 - sgqlc.endpoint.http - INFO - Response [text/plain]:
upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure, transport failure reason: delayed connect error: 113
2023-03-20 12:39:09,513 - grimoire_elk.enriched.sortinghat_gelk - ERROR - [sortinghat] Error unify criteria: ['username']
{'message': 'HTTP Error 503: Service Unavailable', 'exception': <HTTPError 503: 'Service Unavailable'>, 'status': 503, 'headers': <http.client.HTTPMessage object at 0x7fb32c13e2e0>, 'body': 'upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure, transport failure reason: delayed connect error: 113'}
2023-03-20 12:39:09,513 - sirmordred.task_identities - INFO - [sortinghat] Executing affiliate
2023-03-20 12:39:18,728 - sgqlc.endpoint.http - ERROR - http://aslive-nginx.github-metrics-dashboard:8000/identities/api/: HTTP Error 503: Service Unavailable
2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response header: connection: close
2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response header: content-length: 168
2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response header: content-type: text/plain
2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response header: date: Mon, 20 Mar 2023 12:39:18 GMT
2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response header: server: envoy
2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response [text/plain]:
upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure, transport failure reason: delayed connect error: 113
2023-03-20 12:39:18,729 - sirmordred.task_identities - INFO - [sortinghat] Autogender not configured. Skipping.
2023-03-20 12:39:18,729 - sirmordred.task_manager - INFO - [Global tasks] sleeping for 100 seconds
2023-03-20 12:40:59,826 - sirmordred.task_projects - INFO - Reading projects data from  /home/grimoire/conf/projects.json
2023-03-20 12:41:00,828 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm email
2023-03-20 12:41:28,267 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 98d3c123-61e6-4f87-ae70-889df9411aee
2023-03-20 12:42:39,088 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm name
2023-03-20 12:42:50,132 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 5901a732-f894-4d9e-a832-24aca1d0573f
2023-03-20 12:44:00,846 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm username
2023-03-20 12:44:13,938 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 5aa046f1-b8a9-49f7-bf2d-36aa51f16b12
2023-03-20 12:45:19,764 - sirmordred.task_identities - INFO - [sortinghat] Executing affiliate
2023-03-20 12:45:31,444 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Affiliate job id: 817d5976-5b4e-401d-9def-193a88fce607
2023-03-20 12:46:42,323 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Affiliate finished job id: 817d5976-5b4e-401d-9def-193a88fce607
2023-03-20 12:46:42,325 - sirmordred.task_identities - INFO - [sortinghat] Autogender not configured. Skipping.
2023-03-20 12:46:42,325 - sirmordred.task_manager - INFO - [Global tasks] sleeping for 100 seconds
2023-03-20 12:48:23,423 - sirmordred.task_projects - INFO - Reading projects data from  /home/grimoire/conf/projects.json
2023-03-20 12:48:24,425 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm email
2023-03-20 12:48:51,317 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 30233057-6762-4afa-9158-4d9a4ee3c58a
2023-03-20 12:50:02,183 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm name
2023-03-20 12:50:15,415 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 14b0af80-fe71-4252-9ede-3e74d0b15b3c
2023-03-20 12:51:26,311 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm username
2023-03-20 12:51:37,014 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 5db0fe50-3ebb-4559-8896-99f9c70b4fd7
2023-03-20 12:52:42,722 - sirmordred.task_identities - INFO - [sortinghat] Executing affiliate
2023-03-20 12:52:54,807 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Affiliate job id: db8c28c9-9a68-476a-a1c5-f60b7b2430d3
2023-03-20 12:54:05,702 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Affiliate finished job id: db8c28c9-9a68-476a-a1c5-f60b7b2430d3
2023-03-20 12:54:05,705 - sirmordred.task_identities - INFO - [sortinghat] Autogender not configured. Skipping.
2023-03-20 12:54:05,705 - sirmordred.task_manager - INFO - [Global tasks] sleeping for 100 seconds

Enabled debug logging too, but does not show any kind of error or other reason why the git/github etc. tasks are not running.
But looks like it knows that some backend task should be started, but only start global tasks.

debug log snippet

2023-03-21 08:38:03,600 - sirmordred.sirmordred - INFO - Projects loaded
2023-03-21 08:38:03,602 - sirmordred.sirmordred - DEBUG - backend_tasks = [<class 'sirmordred.task_collection.TaskRawDataCollection'>, <class 'sirmordred.task_enrich.TaskEnrich'>]
2023-03-21 08:38:03,602 - sirmordred.sirmordred - DEBUG - global_tasks = [<class 'sirmordred.task_projects.TaskProjects'>, <class 'sirmordred.task_identities.TaskIdentitiesMerge'>]
2023-03-21 08:38:03,603 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts
2023-03-21 08:38:03,603 - sirmordred.sirmordred - INFO - TaskProjects TaskIdentitiesMerge will be executed on Tue, 21 Mar 2023 08:39:43
2023-03-21 08:38:03,603 - sirmordred.task_manager - DEBUG - [<class 'sirmordred.task_projects.TaskProjects'>, <class 'sirmordred.task_identities.TaskIdentitiesMerge'>]

Tried to restart the container multiple times, but does not help.
Any idea what's the issue here?

@atomheartmother
Copy link

I had a similar issue with sirmordred whereby it would complete one cycle of collection and enrichment but then not start another, it would only perform the sortinghat tasks on a loop. I would be interested in what the solution is here.

@ncsibra-lab49
Copy link
Author

ncsibra-lab49 commented Mar 28, 2023

I figured out the problem, basically a synchronization issue in sirmordred.
I added a few extra debug logs to see the issue.

debug log

2023-03-28 11:22:24,978 - sirmordred.sirmordred - INFO -
2023-03-28 11:22:24,978 - sirmordred.sirmordred - INFO - ----------------------------
2023-03-28 11:22:24,978 - sirmordred.sirmordred - INFO - Starting SirMordred engine ...
2023-03-28 11:22:24,978 - sirmordred.sirmordred - INFO - - - - - - - - - - - - - - -
2023-03-28 11:22:24,983 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): vpc-***.es.amazonaws.com:443
2023-03-28 11:22:25,004 - urllib3.connectionpool - DEBUG - https://vpc-***.es.amazonaws.com:443 "GET / HTTP/1.1" 200 515
2023-03-28 11:22:25,014 - urllib3.connectionpool - DEBUG - https://vpc-***.es.amazonaws.com:443 "GET / HTTP/1.1" 200 515
2023-03-28 11:22:25,015 - sirmordred.sirmordred - DEBUG - backend_tasks = []
2023-03-28 11:22:25,015 - sirmordred.sirmordred - DEBUG - global_tasks = [<class 'sirmordred.task_panels.TaskPanels'>, <class 'sirmordred.task_panels.TaskPanelsMenu'>]
2023-03-28 11:22:25,015 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts
2023-03-28 11:22:25,015 - sirmordred.task_manager - DEBUG - [<class 'sirmordred.task_panels.TaskPanels'>, <class 'sirmordred.task_panels.TaskPanelsMenu'>]
2023-03-28 11:22:25,018 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): nginx.***:8000
2023-03-28 11:22:26,017 - sirmordred.sirmordred - DEBUG - stopper set
2023-03-28 11:22:28,126 - urllib3.connectionpool - DEBUG - http://nginx.***:8000 "GET /identities/api/ HTTP/1.1" 400 79
2023-03-28 11:22:28,130 - sgqlc.endpoint.http - DEBUG - Query:
mutation {
tokenAuth(username: "***", password: "***") {
token
}
}
2023-03-28 11:22:31,448 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): nginx.***:8000
2023-03-28 11:22:34,540 - urllib3.connectionpool - DEBUG - http://nginx.***:8000 "GET /identities/api/ HTTP/1.1" 400 79
2023-03-28 11:22:34,542 - sgqlc.endpoint.http - DEBUG - Query:
mutation {
tokenAuth(username: "***", password: "***") {
token
}
}
2023-03-28 11:22:38,346 - sirmordred.task_panels - DEBUG - Active data sources for menu: ['git', 'github', 'github', 'community']
2023-03-28 11:22:38,347 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks will be executed in this order: [<sirmordred.task_panels.TaskPanels object at 0x7fd6847977c0>, <sirmordred.task_panels.TaskPanelsMenu object at 0x7fd684797c40>]
2023-03-28 11:22:38,347 - sirmordred.task_manager - DEBUG - [Global tasks] Task is exiting
2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - [thread:main] No exceptions in threads queue. Let's continue ..
2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - [thread:main] All threads (and their tasks) are finished
2023-03-28 11:22:38,347 - sirmordred.sirmordred - INFO - Loading projects
2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - backend_tasks = []
2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - global_tasks = [<class 'sirmordred.task_projects.TaskProjects'>]
2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts
2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [<class 'sirmordred.task_projects.TaskProjects'>]
2023-03-28 11:22:38,350 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): nginx.***:8000
2023-03-28 11:22:39,349 - sirmordred.sirmordred - DEBUG - stopper set
2023-03-28 11:22:41,456 - urllib3.connectionpool - DEBUG - http://nginx.***:8000 "GET /identities/api/ HTTP/1.1" 400 79
2023-03-28 11:22:41,457 - sgqlc.endpoint.http - DEBUG - Query:
mutation {
tokenAuth(username: "***", password: "***") {
token
}
}
2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks will be executed in this order: [<sirmordred.task_projects.TaskProjects object at 0x7fd6847d9070>]
2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Task is exiting
2023-03-28 11:22:41,633 - sirmordred.sirmordred - DEBUG - [thread:main] No exceptions in threads queue. Let's continue ..
2023-03-28 11:22:41,633 - sirmordred.sirmordred - DEBUG - [thread:main] All threads (and their tasks) are finished
2023-03-28 11:22:41,633 - sirmordred.sirmordred - INFO - Projects loaded
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - backend_tasks = [<class 'sirmordred.task_collection.TaskRawDataCollection'>, <class 'sirmordred.task_enrich.TaskEnrich'>]
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - global_tasks = [<class 'sirmordred.task_projects.TaskProjects'>, <class 'sirmordred.task_identities.TaskIdentitiesMerge'>]
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - TaskProjects get_projects: {}
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - repos found: {}
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - repos to be retrieved: {}
2023-03-28 11:22:41,635 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts
2023-03-28 11:22:41,635 - sirmordred.sirmordred - INFO - TaskProjects TaskIdentitiesMerge will be executed on Tue, 28 Mar 2023 11:24:21
2023-03-28 11:22:41,635 - sirmordred.task_manager - DEBUG - [<class 'sirmordred.task_projects.TaskProjects'>, <class 'sirmordred.task_identities.TaskIdentitiesMerge'>]
2023-03-28 11:22:41,636 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): nginx.***:8000
2023-03-28 11:22:41,642 - urllib3.connectionpool - DEBUG - http://nginx.***:8000 "GET /identities/api/ HTTP/1.1" 400 79
2023-03-28 11:22:41,643 - sgqlc.endpoint.http - DEBUG - Query:
mutation {
tokenAuth(username: "***", password: "***") {
token
}
}
2023-03-28 11:22:41,842 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): nginx.***:8000
2023-03-28 11:22:41,847 - urllib3.connectionpool - DEBUG - http://nginx.***:8000 "GET /identities/api/ HTTP/1.1" 400 79
2023-03-28 11:22:41,848 - sgqlc.endpoint.http - DEBUG - Query:
mutation {
tokenAuth(username: "***", password: "***") {
token
}
}
2023-03-28 11:22:42,023 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks will be executed in this order: [<sirmordred.task_projects.TaskProjects object at 0x7fd6847d9100>, <sirmordred.task_identities.TaskIdentitiesMerge object at 0x7fd684ad2430>]
2023-03-28 11:22:43,024 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks started: <sirmordred.task_projects.TaskProjects object at 0x7fd6847d9100>
2023-03-28 11:22:43,025 - sirmordred.task_projects - INFO - Reading projects data from  /home/grimoire/conf/projects.json
2023-03-28 11:22:43,050 - sirmordred.task_projects - DEBUG - Projects file has changed
2023-03-28 11:22:43,050 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks finished: <sirmordred.task_projects.TaskProjects object at 0x7fd6847d9100>
2023-03-28 11:22:43,050 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks started: <sirmordred.task_identities.TaskIdentitiesMerge object at 0x7fd684ad2430>
2023-03-28 11:22:44,050 - sirmordred.task_identities - DEBUG - [unify] Enrich tasks active: 0
2023-03-28 11:22:44,051 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm email

The interesting parts are:

2023-03-28 11:22:38,347 - sirmordred.sirmordred - INFO - Loading projects
2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - backend_tasks = []
2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - global_tasks = [<class 'sirmordred.task_projects.TaskProjects'>]
2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts
2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [<class 'sirmordred.task_projects.TaskProjects'>]
2023-03-28 11:22:39,349 - sirmordred.sirmordred - DEBUG - stopper set
2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks will be executed in this order: [<sirmordred.task_projects.TaskProjects object at 0x7fd6847d9070>]
2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Task is exiting
2023-03-28 11:22:41,633 - sirmordred.sirmordred - DEBUG - [thread:main] No exceptions in threads queue. Let's continue ..
2023-03-28 11:22:41,633 - sirmordred.sirmordred - DEBUG - [thread:main] All threads (and their tasks) are finished
2023-03-28 11:22:41,633 - sirmordred.sirmordred - INFO - Projects loaded
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - backend_tasks = [<class 'sirmordred.task_collection.TaskRawDataCollection'>, <class 'sirmordred.task_enrich.TaskEnrich'>]
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - global_tasks = [<class 'sirmordred.task_projects.TaskProjects'>, <class 'sirmordred.task_identities.TaskIdentitiesMerge'>]
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - TaskProjects get_projects: {}
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - repos found: {}
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - repos to be retrieved: {}
2023-03-28 11:22:41,635 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts

At first, the __execute_initial_load function tries to collect the projects from the configured projects_file or projects_url .
In this case, the wait_for_threads flag is set to True in the execute_batch_tasks function which sets the stopper after 1 second and that's the cause of the issue.

if wait_for_threads:
    time.sleep(1)  # Give enough time create and run all threads
    stopper.set()  # All threads must stop in the next iteration
    logger.debug('stopper set')

I added an extra log there.

In task_manager.py the loop only runs when the stopper not set.

As you can see from the logs stopper.set() runs before even TaskManager had a chance to run the loop:

2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts
2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [<class 'sirmordred.task_projects.TaskProjects'>]
2023-03-28 11:22:39,349 - sirmordred.sirmordred - DEBUG - stopper set
2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks will be executed in this order: [<sirmordred.task_projects.TaskProjects object at 0x7fd6847d9070>]
2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Task is exiting

There's a 3 seconds delay between Task start and Tasks will be executed in this order which is logged just before the loop.
So the projects are not collected at first and when running execute_batch_tasks again from execute_nonstop_tasks with wait_for_threads=False it will not found any projects and won't try to create backend threads again, because the global tasks thread will never finish, so execute_batch_tasks won't run again.

In a limited environment, as we created in ECS, this time-based synchronization will not work.
Currently, in execute_batch_tasks function child threads should stop when the parent says but feels like the real requirement in wait_for_threads=True case is to run the TaskManager loop at least once and then stop the thread.
So the child thread should notify the parent thread when it is finished.

My quick and dirty solution was to change the loop in TaskManager to Loop and a Half pattern.

So from this:

while not self.stopper.is_set():
    # we give 1 extra second to the stopper, so this loop does
    # not finish before it is set.
    time.sleep(1)

    for task in self.tasks:
        logger.debug('[%s] Tasks started: %s', self.backend_section, task)
        try:
            task.execute()
        except Exception as ex:
            logger.error("[%s] Exception in Task Manager %s", self.backend_section, ex, exc_info=True)
            TasksManager.COMM_QUEUE.put(sys.exc_info())
            raise
        logger.debug('[%s] Tasks finished: %s', self.backend_section, task)

    timer = self.__get_timer(self.backend_section)
    if timer > 0 and self.config.get_conf()['general']['update']:
        logger.info("[%s] sleeping for %s seconds ", self.backend_section, timer)
        time.sleep(timer)

To this:

while True:
    # we give 1 extra second to the stopper, so this loop does
    # not finish before it is set.
    time.sleep(1)

    for task in self.tasks:
        logger.debug('[%s] Tasks started: %s', self.backend_section, task)
        try:
            task.execute()
        except Exception as ex:
            logger.error("[%s] Exception in Task Manager %s", self.backend_section, ex, exc_info=True)
            TasksManager.COMM_QUEUE.put(sys.exc_info())
            raise
        logger.debug('[%s] Tasks finished: %s', self.backend_section, task)

    timer = self.__get_timer(self.backend_section)
    if timer > 0 and self.config.get_conf()['general']['update']:
        logger.info("[%s] sleeping for %s seconds ", self.backend_section, timer)
        time.sleep(timer)

    if self.stopper.is_set():
        break

Now all the backend tasks are started too because the projects are collected correctly.

@sduenas
Copy link
Member

sduenas commented Mar 29, 2023

I never liked how this was implemented. It's a bit messy the way threads and tasks are handled so it's prone to bugs like the one you are mentioning.

@ncsibra-lab49 I think your code fixes this problem but I'm not sure if it could lead to other race conditions. Any insight about it?

@ncsibra-lab49
Copy link
Author

ncsibra-lab49 commented Mar 30, 2023

I don't think it leads to a race condition because the shared objects (backend_tasks, global_tasks, stopper, etc.) are not modified in the child or parent thread.
But as I said, my change is just a quick and dirty solution (more like a workaround), I wouldn't put it in the actual codebase.

The real solution is to avoid time-based synchronization.
I think one option is to remove stopper and just simply pass a parameter to TaskManager to run indefinitely or not, then join on the threads.
When run from execute_nonstop_tasks the threads never stop anyway, so it would be the same as the current behavior.

The other option is to control the number of runs in the parent process, so remove the loop from TaskManager and move it to SirMordred.
The advantage of this, that we can do anything between two runs, like reloading the backend list with _get_repos_by_backend

sduenas added a commit to sduenas/grimoirelab-sirmordred that referenced this issue May 12, 2023
When mordred starts, it loads the dashboards to Kibana
and next, it reads the list of projects to analyze.
Due to a race condition, the thread that reads the list
of projects could not even start, so data backends won't
have any input to work with.

This commit fixes the problem adding isolated code that
will run the initial tasks. Once these tasks are finished,
the main loop will start the task manager.

Fixes chaoss/grimoirelab#585

Signed-off-by: Santiago Dueñas <[email protected]>
sduenas added a commit to sduenas/grimoirelab-sirmordred that referenced this issue May 16, 2023
When mordred starts, it loads the dashboards to Kibana
and next, it reads the list of projects to analyze.
Due to a race condition, the thread that reads the list
of projects could not even start, so data backends won't
have any input to work with.

This commit fixes the problem adding isolated code that
will run the initial tasks. Once these tasks are finished,
the main loop will start the task manager.

Fixes chaoss/grimoirelab#585

Signed-off-by: Santiago Dueñas <[email protected]>
sduenas added a commit to sduenas/grimoirelab-sirmordred that referenced this issue May 17, 2023
When mordred starts, it loads the dashboards to Kibana
and next, it reads the list of projects to analyze.
Due to a race condition, the thread that reads the list
of projects could not even start, so data backends won't
have any input to work with.

This commit fixes the problem adding isolated code that
will run the initial tasks. Once these tasks are finished,
the main loop will start the task manager.

Fixes chaoss/grimoirelab#585

Signed-off-by: Santiago Dueñas <[email protected]>
sduenas added a commit to sduenas/grimoirelab-sirmordred that referenced this issue May 17, 2023
When mordred starts, it loads the dashboards to Kibana
and next, it reads the list of projects to analyze.
Due to a race condition, the thread that reads the list
of projects could not even start, so data backends won't
have any input to work with.

This commit fixes the problem adding isolated code that
will run the initial tasks. Once these tasks are finished,
the main loop will start the task manager.

Fixes chaoss/grimoirelab#585

Signed-off-by: Santiago Dueñas <[email protected]>
@sduenas
Copy link
Member

sduenas commented May 17, 2023

This issue is fixed now. The new release of GrimoireLab will include the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants