Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bundler being unresponsive after first userOp #181

Closed
stefanodecillis opened this issue Aug 4, 2023 · 11 comments · Fixed by #201
Closed

bundler being unresponsive after first userOp #181

stefanodecillis opened this issue Aug 4, 2023 · 11 comments · Fixed by #201
Assignees
Labels
T-bug Type: Something isn't working

Comments

@stefanodecillis
Copy link

Hi,

I hope this problem is only on my side!
I deployed the docker image and it becomes unresponsive after handling one or two userOp. The log trace stops without any error, even if the application is not halting.
Trying on different machines, the behavior is the same.

A typical UserOp I sent to the bundler is:

{"method":"eth_sendUserOperation","params":[{"sender":"0x5481E5F531702E8fb0bCB7c98cBdb22e814F6Acd","nonce":"0x21","initCode":"0x","callData":"0x8d44ad620000000000000000000000002b87f2390e4aef1fc961027982804650fadfaf4000000000000000000000000000000000000000000000000000000000000000a0b08d656e32fc45f4c913e44936538785704ee65d06c3a86ef0b44238cca4b1db0000000000000000000000000000000000000000000000000000000064cccf6900000000000000000000000000000000000000000000000000000000000001e00000000000000000000000000000000000000000000000000000000000000104a5efb2350000000000000000000000005481e5f531702e8fb0bcb7c98cbdb22e814f6acd0000000000000000000000000000000000000000000000000000000000000040000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000200000000000000000000000007a8a552e1305a631e2e1b44ba67b5d45a1a30497000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000600000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000","callGasLimit":"0x0222c0","verificationGasLimit":"0x0249f0","preVerificationGas":"0xea60","maxFeePerGas":"0x3b9aca10","maxPriorityFeePerGas":"0x3b9aca00","paymasterAndData":"0x","signature":"0xbcaf5133b0ad85ada7295b966071b624df0f69782ac6e0729197ffdcd079220a010e04e85ef22b045c5e1b2f0eb863667af297e3c91dd5e91a7ab4710a9a67ca1b"},"0x5FF137D4b0FDCD49DcA30c7CF57E578a026d2789"],"id":46,"jsonrpc":"2.0"}

The bundler handles correctly the userOp and sends the transaction on chain. After the first ones, it stops without any error log and the bundler becomes unresponsive.

Did you ever experience the same?

The only changes I made to the docker file are:

# moving all the bin to run the useropPool, the bundler rpc and bundler
COPY --from=builder /aa-bundler/target/release/bundler /usr/local/bin/bundler
COPY --from=builder /aa-bundler/target/release/bundler-uopool /usr/local/bin/bundler-uopool
COPY --from=builder /aa-bundler/target/release/bundler-rpc /usr/local/bin/bundler-rpc
COPY --from=builder /aa-bundler/target/release/create-wallet /usr/local/bin/create-wallet

EXPOSE 3000
COPY ./mnemonic.txt mnemonic.txt
CMD ["usr/local/bin/bundler", "--eth-client-address" ,"https://polygon-mumbai.g.alchemy.com/v2/XXXXX", "--mnemonic-file", "./mnemonic.txt", "--beneficiary", "XXXXXX" ,"--gas-factor", "600", "--min-balance", "1", "--entry-points", "0x5FF137D4b0FDCD49DcA30c7CF57E578a026d2789", "--min-stake", "1", "--min-unstake-delay", "0", "--min-priority-fee-per-gas", "0", "--max-verification-gas", "1500000", "--rpc-listen-address", "0.0.0.0:8080"]

Thanks in advance for the help!

@zsluedem
Copy link
Collaborator

zsluedem commented Aug 4, 2023

@stefanodecillis Which version are you using here?

Could you please try

curl http://host:port \
  -X POST \
  -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"web3_clientVersion","params":[],"id":1}

and send me the output? Remember to replace the host and port.

@zsluedem zsluedem added the T-bug Type: Something isn't working label Aug 4, 2023
@stefanodecillis
Copy link
Author

Thanks for the help @zsluedem!
The method doesn't reply. I tried both locally and with a remote host. Looking at the logs, it doesn't print errors and, tracing the activities, it looks like it exchanges packages with Alchemy (I'm using it as RPC node provider).

Also, I'm using the bundler in unsafe mode

@Vid201
Copy link
Member

Vid201 commented Aug 7, 2023

Thanks for the help @zsluedem! The method doesn't reply. I tried both locally and with a remote host. Looking at the logs, it doesn't print errors and, tracing the activities, it looks like it exchanges packages with Alchemy (I'm using it as RPC node provider).

Also, I'm using the bundler in unsafe mode

Hello @stefanodecillis , thanks for reporting that. Does this happens on every run or just randomly?

@stefanodecillis
Copy link
Author

Hi @Vid201, 100% of the time the bundler stops answering after one user op handled - I need to restart to make it work with another one.

For the curl call, that was the first time I called it and it also never worked. It keeps waiting for the answer. My guess is that could be a problem with the docker image. Do you ever deploy the bundler remotely with docker? In case, did you ever experience the same?

@zsluedem
Copy link
Collaborator

zsluedem commented Aug 8, 2023

@stefanodecillis Which docker image are you using? Could you docker inspect your container and give me more info about the version.

@stefanodecillis
Copy link
Author

stefanodecillis commented Aug 11, 2023

Sorry for the delayed response @zsluedem! Here you can see the inspection:

[
    {
        "Id": "4608d0e80f2fec1c662ef299861de2c88ccd17874b5ad7a22123b73f336a589c",
        "Created": "2023-08-11T14:05:48.293998339Z",
        "Path": "usr/local/bin/bundler",
        "Args": [
            "--eth-client-address",
            "https://polygon-mumbai.g.alchemy.com/v2/XXXX",
            "--mnemonic-file",
            "./mnemonic.txt",
            "--beneficiary",
            "0xXXX",
            "--gas-factor",
            "600",
            "--min-balance",
            "1",
            "--entry-points",
            "0x5FF137D4b0FDCD49DcA30c7CF57E578a026d2789",
            "--min-stake",
            "1",
            "--min-unstake-delay",
            "0",
            "--min-priority-fee-per-gas",
            "0",
            "--max-verification-gas",
            "1500000",
            "--rpc-listen-address",
            "0.0.0.0:8080"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 8625,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2023-08-11T14:05:49.295045412Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
        "Image": "sha256:671bc4e14e9ce1d392c4e9cd1bf4f3b9ad9a1fc57ca53d1d89e0e778cc1bed68",
        "ResolvConfPath": "/var/lib/docker/containers/4608d0e80f2fec1c662ef299861de2c88ccd17874b5ad7a22123b73f336a589c/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/4608d0e80f2fec1c662ef299861de2c88ccd17874b5ad7a22123b73f336a589c/hostname",
        "HostsPath": "/var/lib/docker/containers/4608d0e80f2fec1c662ef299861de2c88ccd17874b5ad7a22123b73f336a589c/hosts",
        "LogPath": "/var/lib/docker/containers/4608d0e80f2fec1c662ef299861de2c88ccd17874b5ad7a22123b73f336a589c/4608d0e80f2fec1c662ef299861de2c88ccd17874b5ad7a22123b73f336a589c-json.log",
        "Name": "/elegant_khayyam",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "docker-default",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": null,
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "default",
            "PortBindings": {
                "8080/tcp": [
                    {
                        "HostIp": "",
                        "HostPort": "8081"
                    }
                ]
            },
            "RestartPolicy": {
                "Name": "no",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "ConsoleSize": [
                10,
                172
            ],
            "CapAdd": null,
            "CapDrop": null,
            "CgroupnsMode": "private",
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "private",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": null,
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": [],
            "BlkioDeviceReadBps": [],
            "BlkioDeviceWriteBps": [],
            "BlkioDeviceReadIOps": [],
            "BlkioDeviceWriteIOps": [],
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": null,
            "PidsLimit": null,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": [
                "/proc/asound",
                "/proc/acpi",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/proc/scsi",
                "/sys/firmware"
            ],
            "ReadonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/df95d5a868446fa97f787518eb7703796173128c01fb9132d6167e4a9d45fd1d-init/diff:/var/lib/docker/overlay2/l5rbgjpome4doqxbtrxoy0pco/diff:/var/lib/docker/overlay2/wdtt4kp3dmjebk9hzwhzoe0ye/diff:/var/lib/docker/overlay2/dkijiybhvhphic8cnjszgo4nm/diff:/var/lib/docker/overlay2/43a6wxv8c85mcspzfbovmxgl6/diff:/var/lib/docker/overlay2/zaja3jh3a4hglifs67w6dau5x/diff:/var/lib/docker/overlay2/22a5f7835031fd6920d5e581fb65a73c478ba49af71c465cf84016733c7d7576/diff:/var/lib/docker/overlay2/13ee62e09cb8a93dc352904177511672b83fe24b4ab83814fadb4f2bf3d1d713/diff",
                "MergedDir": "/var/lib/docker/overlay2/df95d5a868446fa97f787518eb7703796173128c01fb9132d6167e4a9d45fd1d/merged",
                "UpperDir": "/var/lib/docker/overlay2/df95d5a868446fa97f787518eb7703796173128c01fb9132d6167e4a9d45fd1d/diff",
                "WorkDir": "/var/lib/docker/overlay2/df95d5a868446fa97f787518eb7703796173128c01fb9132d6167e4a9d45fd1d/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [],
        "Config": {
            "Hostname": "4608d0e80f2f",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": true,
            "AttachStderr": true,
            "ExposedPorts": {
                "3000/tcp": {},
                "8080/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "RUSTUP_HOME=/usr/local/rustup",
                "CARGO_HOME=/usr/local/cargo",
                "RUST_VERSION=1.65.0",
                "RUST_LOG=debug"
            ],
            "Cmd": [
                "usr/local/bin/bundler",
                "--eth-client-address",
                "https://polygon-mumbai.g.alchemy.com/v2/K7sz4mnuRdqMJIx-5p76Fa3iWJYCFF1W",
                "--mnemonic-file",
                "./mnemonic.txt",
                "--beneficiary",
                "0x83f1515f0a0CF1503377Cc6906c697aeE4d95Ec7",
                "--gas-factor",
                "600",
                "--min-balance",
                "1",
                "--entry-points",
                "0x5FF137D4b0FDCD49DcA30c7CF57E578a026d2789",
                "--min-stake",
                "1",
                "--min-unstake-delay",
                "0",
                "--min-priority-fee-per-gas",
                "0",
                "--max-verification-gas",
                "1500000",
                "--rpc-listen-address",
                "0.0.0.0:8080"
            ],
            "Image": "registry.digitalocean.com/knobs-registry/indid-bundler:0.6.0",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": null,
            "OnBuild": null,
            "Labels": {}
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "c08ccc8c318864fa1a558a84957b01aa900a27939d99b966b993ce74568c0253",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {
                "3000/tcp": null,
                "8080/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8081"
                    },
                    {
                        "HostIp": "::",
                        "HostPort": "8081"
                    }
                ]
            },
            "SandboxKey": "/var/run/docker/netns/c08ccc8c3188",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "85dfff515a9c3aef6da44fe891e8f8488fdf645c81d23f50dff2c383a9c3e14e",
            "Gateway": "172.17.0.1",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "172.17.0.2",
            "IPPrefixLen": 16,
            "IPv6Gateway": "",
            "MacAddress": "02:42:ac:11:00:02",
            "Networks": {
                "bridge": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "7ced148ae4bc9ca96ee4a1038d36f5b17f577a30f9f9e2e3710a0c501cb73062",
                    "EndpointID": "85dfff515a9c3aef6da44fe891e8f8488fdf645c81d23f50dff2c383a9c3e14e",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "02:42:ac:11:00:02",
                    "DriverOpts": null
                }
            }
        }
    }
]

I'm actually using the default Dockerfile using the Makefile. The only change i made was about running the binaries with the args. I will also add that, when it gets stuck, the GET method on the root path (expecting the message "Used HTTP Method is not allowed. POST or OPTIONS is required") will load infinitely with a timeout too

The image i'm using: rust:1.65-slim

@zsluedem
Copy link
Collaborator

Found the reason why the bundler is not responsive. DashMap is causing the dead lock.
https://github.com/Vid201/silius/blob/33dc6e686fd2fff4b26e0c5ceadce20e3cf5e739/crates/grpc/src/uopool.rs#L80-L81

https://github.com/Vid201/silius/blob/33dc6e686fd2fff4b26e0c5ceadce20e3cf5e739/crates/grpc/src/uopool.rs#L94-L96

These two lines which ref holds through await thread is causing the deadlock. Solution is coming in a pr now.

Related resource:
xacrimon/dashmap#79
xacrimon/dashmap#243

@zsluedem
Copy link
Collaborator

@stefanodecillis A fix is already merged. Would you like to take a try? Feel free to reopen it if you have any problems.

@stefanodecillis
Copy link
Author

@zsluedem thanks for your help! It actually works:)
I'm still reopening this issue for the following reason: the bundler gets stuck and then unstuck. I will explain it better.

If you send 3-4 user operations in a short time, it will get stuck and you will experience the same experience of this topic. However, thanks to your fix, it will get unstuck by itself within a few seconds and you can continue sending user operations.
In this timeframe, neither the cpu or the memory is capped

Given that, I'm pointing out that the core problem could still exist causing several other issues. For instance, while it is unresponsive, you cannot send user operations to the mempool since the entire stack is frozen. In this case, even if there are many bundlers connected to the same altpool, we will just shift the problem.

Now this is just my guess but I should actually read the code: could it be possible that, while bundling, it is waiting on the block creation without spawning any task with tokio? I'm wondering if there is a situation that led to the problem of deadlock you solved since in both cases the rpc (http) stack level was blocked

@zsluedem
Copy link
Collaborator

zsluedem commented Sep 5, 2023

@stefanodecillis Thanks for your report~! You are really helping here!
I got several questions on your comments.

However, thanks to your fix, it will get unstuck by itself within a few seconds and you can continue sending user operations.

Is that very obvious stuck between each user operations? Like 1 or 2 or more seconds?

In this case, even if there are many bundlers connected to the same altpool, we will just shift the problem.

Are you running silius components separately? Like running silius-uopools, silius-rpc, silius component indenpendently and communicate with the grpc endpoint with each other?

Now this is just my guess but I should actually read the code: could it be possible that, while bundling, it is waiting on the block creation without spawning any task with tokio? I'm wondering if there is a situation that led to the problem of deadlock you solved since in both cases the rpc (http) stack level was blocked

I don't think bundling is the reason based on your description above. I think it is higher possibility that the mempool have data race problem with it got several use operation at the same time. But I would take some time to dig in and find out what is the real problem.

@zsluedem zsluedem reopened this Sep 5, 2023
@stefanodecillis
Copy link
Author

@zsluedem i see your point!
I will just say that I'm using the silus bin - this is my command for the docker:

CMD ["usr/local/bin/silius", "--eth-client-address" ,"https://polygon-mumbai.g.alchemy.com/v2/XXXXXXXX", "--mnemonic-file", "./mnemonic.txt", "--beneficiary", "0xXXXXXXXXXXX" , "--min-balance", "1", "--entry-points", "0x5FF137D4b0FDCD49DcA30c7CF57E578a026d2789", "--min-stake", "1", "--min-unstake-delay", "0", "--min-priority-fee-per-gas", "0", "--max-verification-gas", "1500000", "--rpc-listen-address", "0.0.0.0:8080", "--http", "--ws"]

Besides this change, i'm using the dockerfile in the root folder (also exposed the port for rpc).

To give you the real scenario, a colleague sent 3-4 transactions in a short timeframe and then the bundler became unresponsive. When he told me so, I had the time to watch the logs and try to ping it on my own and it was not replying.
After few minutes became responsive again.

That's why I think there is some job/handler on the same thread that handles the RPC methods that get stuck

This was referenced Oct 11, 2023
@Vid201 Vid201 closed this as completed Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-bug Type: Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants