Skip to content

[ML] Writing Results Retries Continue After Analytics Job Stopped #53687

@blaklaybul

Description

@blaklaybul

This was found on a recent 7.7 build. I created a classification analysis that became stuck in the writing_results phase.

[2020-03-17T12:58:57,528][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [reba.lan] [openml-kr-vs-kp-classifier-0] [data_frame_analyzer/8755] [CBoostedTreeImpl.cc@241] Training finished after 18 iterations. Time per iteration in ms mean: 1287.84 std. dev:  2697.19
[2020-03-17T12:58:57,626][INFO ][o.e.x.m.d.p.AnalyticsResultProcessor] [reba.lan] [openml-kr-vs-kp-classifier-0] Started writing results
[2020-03-17T12:58:57,882][INFO ][o.e.c.m.MetaDataMappingService] [reba.lan] [openml-kr-vs-kp-classified-0/h_cT5mm3QSWfTv6b3ZBlTQ] update_mapping [_doc]
[2020-03-17T12:58:58,149][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [1] attempts. Will attempt again in [50ms].
[2020-03-17T12:58:58,361][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [2] attempts. Will attempt again in [75ms].
[2020-03-17T12:58:58,570][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [3] attempts. Will attempt again in [276ms].
...lots more retires here...
[2020-03-17T13:16:19,080][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [15] attempts. Will attempt again in [846433ms]

Stopping the job via the UI removed the job from the jobs list, but the job remains in a stopping state. The retires continue even after stopping the job:

[2020-03-17T13:44:24,643][INFO ][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [reba.lan] [openml-kr-vs-kp-classifier-0] Stopping task with force [true]
[2020-03-17T13:44:24,668][INFO ][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [reba.lan] [openml-kr-vs-kp-classifier-0] Stopping task with force [true]
[2020-03-17T13:44:24,669][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [reba.lan] [controller/6507] [CDetachedProcessSpawner.cc@177] Child process with PID 8755 was terminated by signal 15
[2020-03-17T13:44:24,670][ERROR][o.e.x.m.p.l.CppLogMessageHandler] [reba.lan] [controller/6507] [CDetachedProcessSpawner.cc@99] Will not attempt to kill process 8755: not a child process
[2020-03-17T13:44:24,670][ERROR][o.e.x.m.p.l.CppLogMessageHandler] [reba.lan] [controller/6507] [CCommandProcessor.cc@96] Failed to kill process with PID 8755
[2020-03-17T13:45:13,300][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [17] attempts. Will attempt again in [884782ms]
.
.
.
[2020-03-17T14:14:48,971][WARN ][o.e.x.m.u.p.ResultsPersisterService] [reba.lan] [openml-kr-vs-kp-classifier-0] failed to index after [19] attempts. Will attempt again in [850734ms]

job config:

{
      "id" : "openml-kr-vs-kp-classifier-0",
      "source" : {
        "index" : [
          "openml-kr-vs-kp"
        ],
        "query" : {
          "match_all" : { }
        }
      },
      "dest" : {
        "index" : "openml-kr-vs-kp-classified-0",
        "results_field" : "ml"
      },
      "analysis" : {
        "classification" : {
          "dependent_variable" : "class",
          "class_assignment_objective" : "maximize_accuracy",
          "num_top_classes" : 2,
          "prediction_field_name" : "class_prediction",
          "training_percent" : 90.0,
          "randomize_seed" : 7077816937788972687
        }
      },
      "model_memory_limit" : "512mb",
      "create_time" : 1584464253331,
      "version" : "7.7.0",
      "allow_lazy_start" : false
    }

Metadata

Metadata

Assignees

Labels

:mlMachine learning>bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions