Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the concurrent map access crashing problem #162

Merged
merged 9 commits into from
Jul 7, 2022
Merged

Conversation

haoel
Copy link
Contributor

@haoel haoel commented Jul 5, 2022

Background

We have a map to collect all of the probers' probe results, all of the probers would set its result into this map concurrently in a different key(this is not a problem), however, we have another go routine that persistent this map periodically, this causes the "concurrent map iteration and map write" problem and it makes EaseProbe crash.

the stake trace of the crash as below:

fatal error: concurrent map iteration and map write

goroutine 50 [running]:
runtime.throw({0xfd17f1?, 0x0?})
        /usr/local/go/src/runtime/panic.go:992 +0x71 fp=0xc0172745f0 sp=0xc0172745c0 pc=0x438b31
runtime.mapiternext(0x0?)
        /usr/local/go/src/runtime/map.go:871 +0x4eb fp=0xc017274660 sp=0xc0172745f0 pc=0x41058b
runtime.mapiterinit(0xc0172747e8?, 0xc0265008d0?, 0xc017274710?)
        /usr/local/go/src/runtime/map.go:861 +0x228 fp=0xc017274680 sp=0xc017274660 pc=0x410048
reflect.mapiterinit(0xc0172746c8?, 0x8c24b3?, 0xc000303400?)
....
....
....
....
gopkg.in/yaml%2ev3.(*encoder).marshalDoc(0xc000303400, {0x0, 0x0}, {0xe7b4e0?, 0xc02a155740?, 0x0?})
        /home/ubuntu/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/encode.go:105 +0x185 fp=0xc017275ab8 sp=0xc017275838 pc=0x8c2765
gopkg.in/yaml%2ev3.Marshal({0xe7b4e0?, 0xc02a155740})
        /home/ubuntu/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/yaml.go:222 +0x370 fp=0xc017275de8 sp=0xc017275ab8 pc=0x8e2e70
gopkg.in/yaml%2ev3.Marshal({0xe7b4e0?, 0xc02a155740})
        /home/ubuntu/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/yaml.go:222 +0x370 fp=0xc017275de8 sp=0xc017275ab8 pc=0x8e2e70
github.com/megaease/easeprobe/probe.SaveDataToFile({0xc0015858c0, 0x18})
        /home/ubuntu/hchen/easeprobe/probe/data.go:102 +0x85 fp=0xc017275e50 sp=0xc017275de8 pc=0x8e6ba5
main.saveData.func1()
        /home/ubuntu/hchen/easeprobe/cmd/easeprobe/report.go:34 +0x3d fp=0xc017275ee8 sp=0xc017275e50 pc=0xda975d
main.saveData(0xc000090780)
        /home/ubuntu/hchen/easeprobe/cmd/easeprobe/report.go:59 +0x16f fp=0xc017275fc8 sp=0xc017275ee8 pc=0xda958f
main.main.func3()
        /home/ubuntu/hchen/easeprobe/cmd/easeprobe/main.go:159 +0x26 fp=0xc017275fe0 sp=0xc017275fc8 pc=0xda7f66
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc017275fe8 sp=0xc017275fe0 pc=0x46b6a1
created by main.main
        /home/ubuntu/hchen/easeprobe/cmd/easeprobe/main.go:159 +0x5fd

goroutine 1 [chan receive, 328 minutes]:
main.main()
        /home/ubuntu/hchen/easeprobe/cmd/easeprobe/main.go:205 +0x82b

the crash code line as blow:

// SaveDataToFile save the results to file
func SaveDataToFile(filename string) error {
        metaData.file = filename
        if strings.TrimSpace(filename) == "-" {
                return nil
        }

        dataBuf, err := yaml.Marshal(resultData)   //<========= Panic at this line
        if err != nil {
                return err
        }

        genMetaBuf()
        buf := append(metaBuf, dataBuf...)

        if err := ioutil.WriteFile(filename, []byte(buf), 0644); err != nil {
                return err
        }
        return nil
}

Investigation

The problem here is the prober's result (statistics data) would be consumed by the following modules:

  • Data Saving
  • SLA Report
  • Web Server

All of the above modules only read the statistics data, the probers would update the statistics data.

So, this is the read-write concurrent problem.

Solution

We transfer all of the probers statistics data into the Save Manager (via a channel), and all of the consumers(Saving, SLA Report, Web Server) retrieve the statistics data from Save Manager instead of the probers directly.

  1. Each Prober retrieves the persistent data from Save Manager by a clone object.
  2. Each Prober clones & reports its Result to Save Manager via a go channel.
  3. Web Server and SLA Report would query the result data from Save Manager.
  4. A Read-Write Lock added in Save Manager's Get/Set method.
                                                              ┌────────────┐
                                                   Query      │            │
  Result┌────────┐ retrieve                 ┌────────────────►│ Web Server │
┌───────┤ Prober │◄────────┐                │                 │            │
│       └────────┘         │                │                 └────────────┘
│                          │ Result. ┌──────┴──────┐
│ Result┌────────┐ retrieve│ Clone() │             │          ┌────────────┐
├───────┤ Prober │◄────────┼─────────┤ Save Manager│  Query   │            │
│       └────────┘         │         │   data.go   ├─────────►│ SLA Report │
│                          │    ┌────►             │          │            │
│ Result┌────────┐ retrieve│    │    └─────▲─┬─────┘          └────────────┘
├───────┤ Prober │◄────────┘    │          │ │
│       └────────┘              │          │ │  map[name]*Result
│                               │          │ │
│                               │     Load │ │ Save
│    Result.                    │      ┌───┴─▼────┐
│    Clone()  ┌──────────────┐  │      │          │
└────────────►│    Channel   ├──┘      │   File   │
              └──────────────┘         │          │
                                       └──────────┘

@codecov-commenter
Copy link

codecov-commenter commented Jul 5, 2022

Codecov Report

Merging #162 (97fddcd) into main (00d5519) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #162      +/-   ##
==========================================
+ Coverage   92.17%   92.21%   +0.03%     
==========================================
  Files          46       45       -1     
  Lines        3323     3339      +16     
==========================================
+ Hits         3063     3079      +16     
  Misses        191      191              
  Partials       69       69              
Impacted Files Coverage Δ
probe/result.go 91.66% <ø> (ø)
probe/base/base.go 100.00% <100.00%> (ø)
probe/data.go 65.18% <100.00%> (+1.90%) ⬆️
report/filter.go 94.05% <100.00%> (+0.12%) ⬆️
report/sla.go 94.26% <100.00%> (+0.16%) ⬆️
daemon/daemon_unix.go

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 00d5519...97fddcd. Read the comment docs.

probe/data.go Outdated Show resolved Hide resolved
probe/data.go Outdated Show resolved Hide resolved
probe/data.go Outdated Show resolved Hide resolved
probe/data.go Outdated Show resolved Hide resolved
@haoel
Copy link
Contributor Author

haoel commented Jul 5, 2022

removed the unnecessary lock, and comment on the functions which needn't the lock.

probe/data.go Outdated Show resolved Hide resolved
@haoel
Copy link
Contributor Author

haoel commented Jul 5, 2022

thanks, and I feel there still might be a problem because the result data is still updated by every prober...

@haoel
Copy link
Contributor Author

haoel commented Jul 5, 2022

New Solution

  1. Each Prober retrieves the persistent data from Save Manager by a clone object.
  2. Each Prober clones & reports its Result to Save Manager via a channel.
  3. Save Manager no need for the lock. because there is no data sharing.
          Result
  report┌────────┐ retrieve
┌───────┤ Prober │◄────────┐
│       └────────┘         │
│         Result           │ Result. ┌─────────────┐
│ report┌────────┐ retrieve│ Clone() │             │
├───────┤ Prober │◄────────┼─────────┤ Save Manager│
│       └────────┘         │         │   data.go   │
│         Result           │    ┌────►             │
│ report┌────────┐ retrieve│    │    └─────▲─┬─────┘
├───────┤ Prober │◄────────┘    │          │ │
│       └────────┘              │          │ │  map[name]*Result
│                               │          │ │
│                               │          │ │
│    Result.                    │      ┌───┴─▼────┐
│    Clone()  ┌──────────────┐  │      │          │
└────────────►│  go  channel ├──┘      │   File   │
              └──────────────┘         │          │
                                       └──────────┘

@xxx7xxxx
Copy link

xxx7xxxx commented Jul 5, 2022

LGTM

Copy link
Collaborator

@proditis proditis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@haoel
Copy link
Contributor Author

haoel commented Jul 6, 2022

After carefully checking the code, I found the SLA Report also has the same problem, but it has a super low possibility.

I will do the refactoring for the SLA report after #158 is merged, otherwise, there will be many conflicts.

So, this PR is held at this moment.

@haoel
Copy link
Contributor Author

haoel commented Jul 6, 2022

Note: push force because of syncing up with the main branch.

@haoel
Copy link
Contributor Author

haoel commented Jul 6, 2022

Updated Solution

  1. Each Prober retrieves the persistent data from Save Manager by a clone object.
  2. Each Prober clones & reports its Result to Save Manager via a channel.
  3. Web Server and SLA Report would query the result data from Save Manager.
  4. A Read-Write Lock added in Save Manager's Get/Set method.
                                                              ┌────────────┐
                                                   Query      │            │
  Result┌────────┐ retrieve                 ┌────────────────►│ Web Server │
┌───────┤ Prober │◄────────┐                │                 │            │
│       └────────┘         │                │                 └────────────┘
│                          │ Result. ┌──────┴──────┐
│ Result┌────────┐ retrieve│ Clone() │             │          ┌────────────┐
├───────┤ Prober │◄────────┼─────────┤ Save Manager│  Query   │            │
│       └────────┘         │         │   data.go   ├─────────►│ SLA Report │
│                          │    ┌────►             │          │            │
│ Result┌────────┐ retrieve│    │    └─────▲─┬─────┘          └────────────┘
├───────┤ Prober │◄────────┘    │          │ │
│       └────────┘              │          │ │  map[name]*Result
│                               │          │ │
│                               │     Load │ │ Save
│    Result.                    │      ┌───┴─▼────┐
│    Clone()  ┌──────────────┐  │      │          │
└────────────►│    Channel   ├──┘      │   File   │
              └──────────────┘         │          │
                                       └──────────┘

@xxx7xxxx
Copy link

xxx7xxxx commented Jul 6, 2022

LGTM, the deepcopy is a great way to fix race condition

@localvar
Copy link
Collaborator

localvar commented Jul 6, 2022

/LGTM

@haoel
Copy link
Contributor Author

haoel commented Jul 7, 2022

I've tested this PR for 20 hours as below

  • 20K TCP probe
  • 5s probe interval
  • 10 queries per second for SLA Report Page request.

Everything's fine so far. So, I am going to merge this PR.

@zhao-kun zhao-kun merged commit bc50547 into megaease:main Jul 7, 2022
@samanhappy
Copy link
Collaborator

Is this problem resolved? I got the same error yesterday, the stake trace of the crash as below:

fatal error: concurrent map iteration and map write

goroutine 29 [running]:
runtime.throw({0xfc89c8?, 0x0?})
	/opt/hostedtoolcache/go/1.18.3/x64/src/runtime/panic.go:992 +0x71 fp=0xc0003925f0 sp=0xc0003925c0 pc=0x438b31
runtime.mapiternext(0x0?)
	/opt/hostedtoolcache/go/1.18.3/x64/src/runtime/map.go:871 +0x4eb fp=0xc000392660 sp=0xc0003925f0 pc=0x41058b
runtime.mapiterinit(0xc0003927e8?, 0xc000377b60?, 0xc000392710?)
	/opt/hostedtoolcache/go/1.18.3/x64/src/runtime/map.go:861 +0x228 fp=0xc000392680 sp=0xc000392660 pc=0x410048
reflect.mapiterinit(0xc0003926c8?, 0x8c05b3?, 0xc000535000?)
	/opt/hostedtoolcache/go/1.18.3/x64/src/runtime/map.go:1373 +0x19 fp=0xc0003926a8 sp=0xc000392680 pc=0x4652d9
reflect.Value.MapKeys({0xe70440?, 0xc000307f88?, 0x0?})
	/opt/hostedtoolcache/go/1.18.3/x64/src/reflect/value.go:1644 +0xfd fp=0xc000392780 sp=0xc0003926a8 pc=0x4d2a1d
......

@haoel
Copy link
Contributor Author

haoel commented Jul 26, 2022

can I have a full stack?

@samanhappy
Copy link
Collaborator

full stack is here easeprobe.txt

@haoel
Copy link
Contributor Author

haoel commented Jul 26, 2022

@samanhappy thanks to report this issue.

I can see the following stack

github.com/megaease/easeprobe/probe.SaveDataToFile({0xfa8978, 0xe})
	/home/runner/work/easeprobe/easeprobe/probe/data.go:102 +0x85 fp=0xc000393e50 sp=0xc000393de8 pc=0x8e4ca5
main.saveData.func1()
	/home/runner/work/easeprobe/easeprobe/cmd/easeprobe/report.go:34 +0x3d fp=0xc000393ee8 sp=0xc000393e50 pc=0xda221d
main.saveData(0xc000078540)
	/home/runner/work/easeprobe/easeprobe/cmd/easeprobe/report.go:59 +0x16f fp=0xc000393fc8 sp=0xc000393ee8 pc=0xda204f
main.main.func3()
	/home/runner/work/easeprobe/easeprobe/cmd/easeprobe/main.go:159 +0x26 fp=0xc000393fe0 sp=0xc000393fc8 pc=0xda0cc6
runtime.goexit()
	/opt/hostedtoolcache/go/1.18.3/x64/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc000393fe8 sp=0xc000393fe0 pc=0x46b6a1
created by main.main
	/home/runner/work/easeprobe/easeprobe/cmd/easeprobe/main.go:159 +0x5fd

but the current source code - probe/data.go:102 is not the probe.SaveDataToFile() function.

so, it looks like you use the old source code, could you please use the latest code?

@samanhappy
Copy link
Collaborator

got it, thanks, I was using docker, the image is not as new as github

@haoel
Copy link
Contributor Author

haoel commented Jul 26, 2022

got it, thanks, I was using docker, the image is not as new as github

Oh, this patch hasn't been released officially. You can build the docker image by yourself

make docker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants