Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need help / tcmu-runner performance very slow #668

Open
lightmans2 opened this issue Aug 30, 2021 · 30 comments
Open

need help / tcmu-runner performance very slow #668

lightmans2 opened this issue Aug 30, 2021 · 30 comments

Comments

@lightmans2
Copy link

Hello together,

we are testing since some weeks ceph and iscsi ...
i have a big iscsi/vmware performance problem and need some help with the tcmu-runner

if i use a linux vm and map over rbd a image then i have a speed 350>550 mb/s
if i use iscsi with the tcmu-runner then i have a max speed of 50>100mb/s

we use as os ubuntu 20.04 and ceph v16.2.5 with kernel 5.4.0-81-generic
tcmu-runner v1.5.2

some infos about the hardware and setup:

  • we have 3 pyhsical nodes as osd ...each node has 20hdds (5.5tb) 4ssd (1tb) and 1nvme (1tb) and build drive groups with lvm/bluestores for high io for the block/rock.db/wal.db etc.
  • antoher 3 physical nodes as monitors and two physical nodes as iscsi gateway
  • everything is connected to a 10Gbit network with bond0 and balanced-rr interfaces
  • all physical nodes have min 16>32 cores and 64gb>128GB ram

at the moment we connected a vmware esxi host hp dl385gen10 to this storage with esxi6.7 for testing luns and performance

can somebody tell me what else i can do to tweak the tcmu-runner to get the same performance like rbd?

i already eperimented with:

1.)

/iscsi-target...ceph.iscsi-gw> info
Target Iqn     .. iqn.2003-01.com.ceph.iscsi-gw
Auth
- mutual_password ..
- mutual_password_encryption_enabled .. False
- mutual_username ..
- password ..
- password_encryption_enabled .. False
- username ..
Control Values
**- cmdsn_depth .. 512 (override)**
- dataout_timeout .. 20
- first_burst_length .. 262144
- immediate_data .. yes
- initial_r2t .. yes
- max_burst_length .. 524288
- max_outstanding_r2t .. 1
- max_recv_data_segment_length .. 262144
- max_xmit_data_segment_length .. 262144
- nopin_response_timeout .. 5
- nopin_timeout .. 5

2.)

hw_max_sectors 1024 or 2048 ot 4096

max_data_area_mb 8 or 32 or 128

but i gain only 10% of speed :-(

thx a lot in advance for some hints and tips

@lxbsz
Copy link
Collaborator

lxbsz commented Aug 31, 2021

There has one old issue to track a similar issue #543, please check could this help ?

@lxbsz
Copy link
Collaborator

lxbsz commented Aug 31, 2021

And also this one #359.

@lightmans2
Copy link
Author

Hello lxbsz

yes thank you, i already read that posts and i tried those settings like node.cmd on client host and cmd.depth on the target.

we use vmware esxi and i can see that the max scsi command are by default set to 128
i already set the target to 128 but the speed is worst
see screenshots
2021-08-31 09_02_47-vSphere - esx-088-3-1 intra camdata de - Speicheradapter
2021-08-31 09_04_22-

i also trying to find the optimal parameter with the max_data_area_mb and hw_max_sectors
and image objekt size from 4mb to 1mb
but speed is not ging better
see screenshots

2021-08-31 09_06_19-Ceph
2021-08-31 09_07_13-Ceph

something elese i could do?

@lxbsz
Copy link
Collaborator

lxbsz commented Aug 31, 2021

can somebody tell me what else i can do to tweak the tcmu-runner to get the same performance like rbd?

In theory the tcmu-runner should always have slower performance than krbd. Because the IO path is much longer than that in krbd.

BTW, could you check the tcmu-runner logs and is there any frequent lock switching logs for the same image ? Such as in case you are using the Round Robin Path Selection Policy, someone also report that sometimes even they used the Most Recently Used policy they could see the EXSI was switching the paths frequently. When doing the active path switching it will affect the performance much too.

@lxbsz
Copy link
Collaborator

lxbsz commented Aug 31, 2021

BTW, what's your test script ? Are u using the fio ? If so what's your parameters ? Have u ever compare the tcmu-runner vs krbd by:
read 4k, write 4k, random read 4k, random write 4k, etc ?

@lightmans2
Copy link
Author

hi lxbsz,

In theory the tcmu-runner should always have slower performance than krbd. Because the IO path is much longer than that in krbd.

yes i now that... but we have only neary 50mb/s read and 40mb/s write speed...
the rbd does an average of min 450mb/s
the recovery if one osd node was down and up is nearly 500 to 600 mb/s.
so i guess that iscsi schould be much higher then 50mb/s read and 40mb/s write speed
or not?

BTW, could you check the tcmu-runner logs and is there any frequent lock switching logs for the same image ?

i have this sometimes and spradically,... dont now why

2021-08-31 15:58:58.850 1105 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_3: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-08-31 15:58:58.851 1105 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_3: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-08-31 15:58:58.901 1105 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-08-31 15:58:58.902 1105 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-08-31 16:03:58.843 1105 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_3: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-08-31 16:03:58.845 1105 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_3: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-08-31 16:03:58.871 1105 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-08-31 16:03:58.873 1105 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.

Such as in case you are using the Round Robin Path Selection Policy, someone also report that sometimes even they used the Most Recently Used policy they could see the EXSI was switching the paths frequently. When doing the active path switching it will affect the performance much too.

i already tested it and i know... but same problem if i set "most recently"

BTW, what's your test script ? Are u using the fio ? If so what's your parameters ? Have u ever compare the tcmu-runner vs krbd by:
read 4k, write 4k, random read 4k, random write 4k, etc ?

i move VMs or images from a fast Fibrechannel storage to the ceph lun and i look into the logs and the ceph dashboard.
i calculate it myself or i see it in the ceph dashboard
2021-08-31 16_09_18-Ceph

@deng-ruixuan
Copy link

@lxbsz @lightmans2
My test results are as follows:(3 nvme osds)
test case 1:
fio -filename=/dev/sdam -direct=1 -iodepth=32 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=10g -numjobs=1 -runtime=300 -time_based -group_reporting -name=test2
res:iops is 22.1k
test case 2:
fio -filename=/dev/sdam -direct=1 -iodepth=32 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=10g -numjobs=4 -runtime=300 -time_based -group_reporting -name=test2
res:iops is 3208

so:
When using multi-threaded testing, tcmu has performance problems. Now,i known is not caused by the rbd_lock,I guess it may be caused by the lock of tcmu's own process.

@lightmans2
you can test your ceph iscsi use test case1 and case2 , confirm whether the problem is caused by multiple processes.

@deng-ruixuan
Copy link

I analyzed the code of tcmu,When tcmu gets messages from uio, it is no lock so it will not distinguish whether it is multi-threaded or not. I guess the performance difference between tcmu and krbd is 10%.
The problem I am currently experiencing is that the performance of ceph-iscsi under multi-threading is very poor. The reason lies in iscsi, not tcmu. I used krbd to test iscsi and this has been confirmed

@lxbsz
Copy link
Collaborator

lxbsz commented Sep 2, 2021

I analyzed the code of tcmu,When tcmu gets messages from uio, it is no lock so it will not distinguish whether it is multi-threaded or not. I guess the performance difference between tcmu and krbd is 10%.
The problem I am currently experiencing is that the performance of ceph-iscsi under multi-threading is very poor. The reason lies in iscsi, not tcmu. I used krbd to test iscsi and this has been confirmed

Yeah, the TCMU and tcmu-runner don't know and also don't care about that, the TCMU will just queue the SCSI cmds in the TCMU's buffer and the tcmu-runner will handle them one by one. No locks in tcmu-runner when handing this.

@lightmans2
Copy link
Author

lightmans2 commented Sep 2, 2021

Hello lxbsz,

test case 1:
fio -filename=/dev/sdam -direct=1 -iodepth=32 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=10g -numjobs=1 -runtime=300 -time_based -group_reporting -name=test2

test case 2:
fio -filename=/dev/sdam -direct=1 -iodepth=32 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=10g -numjobs=4 -runtime=300 -time_based -group_reporting -name=test2

yes i will test this later or tonight

yesterday night i foud now a solution or the right settings for performance, and it was a long night :-)

6969ccc0-5db8-4d3e-bff0-24043f397bc9

with that speed i can live

iscsi runs now with 150mb/s and up tp 250mb/s
with settings:
cmdsn_depth 512
hw_max_sectors 8192
max_data_area_mb 32

and i found out that the performance goes hard down if i have one iscsitarget for ex. iqn.2003-01.com.ceph.iscsi-gw with two or more LUNs then the performance are going down to max 50mb/s

now i have a different problem... if i test the speed during some vms are running on this lun.... the iops goes up to >3k and speed with >150mb/s ... one of the gateways get stucked and the process freeze and crash

looks like the gw who are mapping the rbd image gets stucked.. unfortunately I couldn't find a usable log last night
this makes the gateways instable :-(

in which log could i see it?

on the machine console i found that

2021-09-02 02_25_48-Resolution_1024x768 FPS _20
2021-09-02 02_20_42-iLO Integrated Remote Console - Server_ cd88cephrgw01 _ iLO_ cd88-ceph-rgw-01 mg

@lxbsz
Copy link
Collaborator

lxbsz commented Sep 2, 2021

[...]

looks like the gw who are mapping the rbd image gets stucked.. unfortunately I couldn't find a usable log last night
this makes the gateways instable :-(

in which log could i see it?

on the machine console i found that

2021-09-02 02_25_48-Resolution_1024x768 FPS _20
2021-09-02 02_20_42-iLO Integrated Remote Console - Server_ cd88cephrgw01 _ iLO_ cd88-ceph-rgw-01 mg

What are your tcmu-runner/ceph-iscsi versions ?

Could you upload the tcmu-runner.log and rbd-target-api.log files ? And at the same time could you see any crash errors in the journal log on that gateway node ?

There has one known crash issue, please see #667. In that PR I have fixed several other bugs that also could cause crash.

If your test will cause the active path switching frequently between gateways the above issue could be seen easily.

@lightmans2
Copy link
Author

Hi,

tcmu-runner v1.5.2

root@cd133-ceph-rgw-01:/var/log/rbd-target-api# apt info ceph-iscsi
Package: ceph-iscsi
Version: 3.5-1
Status: install ok installed
Priority: optional
Section: python
Maintainer: Freexian Packaging Team <[email protected]>
Installed-Size: 575 kB
Pre-Depends: init-system-helpers (>= 1.54~)
Depends: python3:any, tcmu-runner (>= 1.4.0), python3-configshell-fb, python3-cryptography, python3-flask, python3-netifaces, python3-openssl, python3-rados, python3-rbd, python3-rtslib-fb, python3-distutils, python3-rpm, python3-requests
Homepage: https://github.com/ceph/ceph-iscsi
Download-Size: unknown
APT-Manual-Installed: yes
APT-Sources: /var/lib/dpkg/status
Description: common logic and CLI tools for creating and managing LIO gateways for Ceph
 It includes the rbd-target-api daemon which is responsible for
 restoring the state of LIO following a gateway reboot/outage and
 exporting a REST API to configure the system using tools like
 gwcli. It replaces the existing 'target' service.
 .
 There is also a second daemon rbd-target-gw which exports a REST API
 to gather statistics.
 .
 It also includes the CLI tool gwcli which can be used to configure
 and manage the Ceph iSCSI gateway, which replaces the existing
 targetcli CLI tool. This CLI tool utilizes the rbd-target-api server
 daemon to configure multiple gateways concurrently.

here are the logs
the crashes was yesterday night between 01/09 and 02/09 and 1am to 3am

cd-88tcmu-runner.log
cd-133rbd-target-api.log
cd-133rbd-target-api.log.1.log
cd-133tcmu-runner.log
cd-88rbd-target-api.log.1.log
cd-88rbd-target-api.log.2.log

@lxbsz
Copy link
Collaborator

lxbsz commented Sep 2, 2021

Hi,

tcmu-runner v1.5.2

root@cd133-ceph-rgw-01:/var/log/rbd-target-api# apt info ceph-iscsi
Package: ceph-iscsi
Version: 3.5-1
Status: install ok installed
Priority: optional
Section: python
Maintainer: Freexian Packaging Team <[email protected]>
Installed-Size: 575 kB
Pre-Depends: init-system-helpers (>= 1.54~)
Depends: python3:any, tcmu-runner (>= 1.4.0), python3-configshell-fb, python3-cryptography, python3-flask, python3-netifaces, python3-openssl, python3-rados, python3-rbd, python3-rtslib-fb, python3-distutils, python3-rpm, python3-requests
Homepage: https://github.com/ceph/ceph-iscsi
Download-Size: unknown
APT-Manual-Installed: yes
APT-Sources: /var/lib/dpkg/status
Description: common logic and CLI tools for creating and managing LIO gateways for Ceph
 It includes the rbd-target-api daemon which is responsible for
 restoring the state of LIO following a gateway reboot/outage and
 exporting a REST API to configure the system using tools like
 gwcli. It replaces the existing 'target' service.
 .
 There is also a second daemon rbd-target-gw which exports a REST API
 to gather statistics.
 .
 It also includes the CLI tool gwcli which can be used to configure
 and manage the Ceph iSCSI gateway, which replaces the existing
 targetcli CLI tool. This CLI tool utilizes the rbd-target-api server
 daemon to configure multiple gateways concurrently.

here are the logs
the crashes was yesterday night between 01/09 and 02/09 and 1am to 3am

cd-88tcmu-runner.log
cd-133rbd-target-api.log
cd-133rbd-target-api.log.1.log
cd-133tcmu-runner.log
cd-88rbd-target-api.log.1.log
cd-88rbd-target-api.log.2.log

Could you also upload the journal logs during the crash ?

$ journalctl -r > journal.log

From the above logs I couldn't know which service crashed.

@lxbsz
Copy link
Collaborator

lxbsz commented Sep 2, 2021

From one of the tcmu-runner.log:

2021-09-02 01:20:29.923 1236 [WARN] tcmu_rbd_lock:762 rbd/rbd.disk_1: Acquired exclusive lock.
2021-09-02 01:20:32.683 1236 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-09-02 01:20:32.686 1236 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-09-02 01:22:08.320 1236 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-09-02 01:22:08.322 1236 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-09-02 01:43:56.224 1231 [INFO] load_our_module:547: Inserted module 'target_core_user'

This should be the same issue which has been fixed by b48eeeb in #667.

Is that possible to test that PR in you setups ? Thanks.

@lightmans2
Copy link
Author

yes we can test it
the problem is only i can do it only at the afernoons because the storage medium productive :-) because our other datastores are full and we need urgent and fast new additional space.

if you can tell me or explain me how i upgrade or patch , then no problem
and of course i need a way back to the prior version as fallback

@lightmans2
Copy link
Author

@lxbsz @lightmans2
My test results are as follows:(3 nvme osds)
test case 1:
fio -filename=/dev/sdam -direct=1 -iodepth=32 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=10g -numjobs=1 -runtime=300 -time_based -group_reporting -name=test2
res:iops is 22.1k
test case 2:
fio -filename=/dev/sdam -direct=1 -iodepth=32 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=10g -numjobs=4 -runtime=300 -time_based -group_reporting -name=test2
res:iops is 3208

so:
When using multi-threaded testing, tcmu has performance problems. Now,i known is not caused by the rbd_lock,I guess it may be caused by the lock of tcmu's own process.

@lightmans2
you can test your ceph iscsi use test case1 and case2 , confirm whether the problem is caused by multiple processes.

root@kvm:~# fio -filename=/dev/rdb0 -direct=0 -iodepth=32 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=10g -numjobs=1 -runtime=300 -time_based -group_reporting -name=test2
test2: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 thread
fio: io_u error on file /dev/rdb0: No space left on device: write offset=7948804096, buflen=4096
fio: pid=2597, err=28/file:io_u.c:1787, func=io_u error, error=No space left on device

test2: (groupid=0, jobs=1): err=28 (file:io_u.c:1787, func=io_u error, error=No space left on device): pid=2597: Thu Sep  2 14:38:32 2021
  write: IOPS=33.0k, BW=4000KiB/s (4096kB/s)(4096B/1msec); 0 zone resets
    slat (nsec): min=1373, max=34785, avg=3668.42, stdev=5975.05
    clat (nsec): min=100228, max=100228, avg=100228.00, stdev= 0.00
     lat (nsec): min=136286, max=136286, avg=136286.00, stdev= 0.00
    clat percentiles (nsec):
     |  1.00th=[99840],  5.00th=[99840], 10.00th=[99840], 20.00th=[99840],
     | 30.00th=[99840], 40.00th=[99840], 50.00th=[99840], 60.00th=[99840],
     | 70.00th=[99840], 80.00th=[99840], 90.00th=[99840], 95.00th=[99840],
     | 99.00th=[99840], 99.50th=[99840], 99.90th=[99840], 99.95th=[99840],
     | 99.99th=[99840]
  lat (usec)   : 250=3.03%
  cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=1
  IO depths    : 1=3.0%, 2=6.1%, 4=12.1%, 8=24.2%, 16=48.5%, 32=6.1%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,33,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=4000KiB/s (4096kB/s), 4000KiB/s-4000KiB/s (4096kB/s-4096kB/s), io=4096B (4096B), run=1-1msec

root@kvm:~# fio -filename=/dev/rdb0 -direct=0 -iodepth=32 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=10g -numjobs=4 -runtime=300 -time_based -group_reporting -name=test2
test2: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.16
Starting 4 threads
fio: io_u error on file /dev/rdb0: No space left on device: write offset=7005753344, buflen=4096
fio: pid=2625, err=28/file:io_u.c:1787, func=io_u error, error=No space left on device
fio: io_u error on file /dev/rdb0: No space left on device: write offset=6449979392, buflen=4096
fio: pid=2623, err=28/file:io_u.c:1787, func=io_u error, error=No space left on device
fio: io_u error on file /dev/rdb0: No space left on device: write offset=7948804096, buflen=4096
fio: io_u error on file /dev/rdb0: No space left on device: write offset=8794103808, buflen=4096
fio: pid=2622, err=28/file:io_u.c:1787, func=io_u error, error=No space left on device
fio: pid=2624, err=28/file:io_u.c:1787, func=io_u error, error=No space left on device

test2: (groupid=0, jobs=4): err=28 (file:io_u.c:1787, func=io_u error, error=No space left on device): pid=2622: Thu Sep  2 14:38:51 2021
  write: IOPS=132k, BW=15.6MiB/s (16.4MB/s)(16.0KiB/1msec); 0 zone resets
    slat (nsec): min=1343, max=130153, avg=14340.23, stdev=14503.69
    clat (usec): min=442, max=450, avg=446.52, stdev= 4.08
     lat (usec): min=455, max=501, avg=479.62, stdev=19.70
    clat percentiles (usec):
     |  1.00th=[  445],  5.00th=[  445], 10.00th=[  445], 20.00th=[  445],
     | 30.00th=[  445], 40.00th=[  445], 50.00th=[  445], 60.00th=[  449],
     | 70.00th=[  449], 80.00th=[  449], 90.00th=[  449], 95.00th=[  449],
     | 99.00th=[  449], 99.50th=[  449], 99.90th=[  449], 99.95th=[  449],
     | 99.99th=[  449]
  lat (usec)   : 500=3.03%
  cpu          : usr=0.00%, sys=0.00%, ctx=12, majf=0, minf=4
  IO depths    : 1=3.0%, 2=6.1%, 4=12.1%, 8=24.2%, 16=48.5%, 32=6.1%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,132,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=15.6MiB/s (16.4MB/s), 15.6MiB/s-15.6MiB/s (16.4MB/s-16.4MB/s), io=16.0KiB (16.4kB), run=1-1msec

@lxbsz
Copy link
Collaborator

lxbsz commented Sep 3, 2021

yes we can test it
the problem is only i can do it only at the afernoons because the storage medium productive :-) because our other datastores are full and we need urgent and fast new additional space.

if you can tell me or explain me how i upgrade or patch , then no problem
and of course i need a way back to the prior version as fallback

Are you using the cephadm/containers for the ceph-iscsi/tcmu-runner service ? If so I am afraid you need to update the package or install the tcmu-runner from source to override it on the node.

@lightmans2
Copy link
Author

hi,

its a physical server with ubuntu 20.04 and ceph-iscsi installed from ceph debian repo as package. so no containers...
i read your documentation about building the gitsource ... but its for rpm systems...
i need help then to compile the packages as .deb or can you build it for me and put the .deb somewhere to download?

@lxbsz
Copy link
Collaborator

lxbsz commented Sep 3, 2021

Currently I didn't have the .deb setups to build the package. I usually build them from source:

$ cmake -Dwith-glfs=false -DSUPPORT_SYSTEMD=ON -DCMAKE_INSTALL_PREFIX=/usr .
$ make && make install

You can try this.

@lightmans2
Copy link
Author

i have this error
what i need ?

root@cd88-ceph-rgw-01:~/test/tcmu-runner# cmake -Dwith-glfs=false -DSUPPORT_SYSTEMD=ON -DCMAKE_INSTALL_PREFIX=/usr .
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
TCMALLOC_LIB
    linked by target "handler_file_zbc" in directory /root/test/tcmu-runner
    linked by target "tcmu_static" in directory /root/test/tcmu-runner
    linked by target "tcmu-runner" in directory /root/test/tcmu-runner
    linked by target "tcmu-synthesizer" in directory /root/test/tcmu-runner
    linked by target "handler_file_optical" in directory /root/test/tcmu-runner
    linked by target "handler_rbd" in directory /root/test/tcmu-runner
    linked by target "tcmu" in directory /root/test/tcmu-runner
    linked by target "handler_qcow" in directory /root/test/tcmu-runner

-- Configuring incomplete, errors occurred!
See also "/root/test/tcmu-runner/CMakeFiles/CMakeOutput.log".

@lightmans2
Copy link
Author

Okay i compiled it and installit , i found the missing -dev package...
version 1.5.4 is installed
is he bugfix in this version?

root@cd88-ceph-rgw-01:~/test/tcmu-runner# cmake -Dwith-glfs=false -DSUPPORT_SYSTEMD=ON -DCMAKE_INSTALL_PREFIX=/usr .
-- Configuring done
-- Generating done
-- Build files have been written to: /root/test/tcmu-runner
root@cd88-ceph-rgw-01:~/test/tcmu-runner# make
Scanning dependencies of target handler_file_zbc
[  1%] Building C object CMakeFiles/handler_file_zbc.dir/scsi.c.o
[  3%] Building C object CMakeFiles/handler_file_zbc.dir/file_zbc.c.o
[  5%] Linking C shared library handler_file_zbc.so
[  5%] Built target handler_file_zbc
[  7%] Generating tcmuhandler-generated.c, tcmuhandler-generated.h
Scanning dependencies of target tcmu_static
[  9%] Building C object CMakeFiles/tcmu_static.dir/strlcpy.c.o
[ 11%] Building C object CMakeFiles/tcmu_static.dir/configfs.c.o
[ 13%] Building C object CMakeFiles/tcmu_static.dir/api.c.o
[ 15%] Building C object CMakeFiles/tcmu_static.dir/libtcmu.c.o
[ 17%] Building C object CMakeFiles/tcmu_static.dir/libtcmu-register.c.o
[ 19%] Building C object CMakeFiles/tcmu_static.dir/tcmuhandler-generated.c.o
[ 21%] Building C object CMakeFiles/tcmu_static.dir/libtcmu_log.c.o
[ 23%] Building C object CMakeFiles/tcmu_static.dir/libtcmu_config.c.o
[ 25%] Building C object CMakeFiles/tcmu_static.dir/libtcmu_time.c.o
[ 27%] Linking C static library libtcmu_static.a
[ 27%] Built target tcmu_static
Scanning dependencies of target tcmu
[ 29%] Building C object CMakeFiles/tcmu.dir/strlcpy.c.o
[ 31%] Building C object CMakeFiles/tcmu.dir/configfs.c.o
[ 33%] Building C object CMakeFiles/tcmu.dir/api.c.o
[ 35%] Building C object CMakeFiles/tcmu.dir/libtcmu.c.o
[ 37%] Building C object CMakeFiles/tcmu.dir/libtcmu-register.c.o
[ 39%] Building C object CMakeFiles/tcmu.dir/tcmuhandler-generated.c.o
[ 41%] Building C object CMakeFiles/tcmu.dir/libtcmu_log.c.o
[ 43%] Building C object CMakeFiles/tcmu.dir/libtcmu_config.c.o
[ 45%] Building C object CMakeFiles/tcmu.dir/libtcmu_time.c.o
[ 47%] Linking C shared library libtcmu.so
[ 49%] Built target tcmu
Scanning dependencies of target tcmu-runner
[ 50%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_work.c.o
[ 52%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_cmd_handler.c.o
[ 54%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_aio.c.o
[ 56%] Building C object CMakeFiles/tcmu-runner.dir/tcmur_device.c.o
[ 58%] Building C object CMakeFiles/tcmu-runner.dir/target.c.o
[ 60%] Building C object CMakeFiles/tcmu-runner.dir/alua.c.o
[ 62%] Building C object CMakeFiles/tcmu-runner.dir/scsi.c.o
[ 64%] Building C object CMakeFiles/tcmu-runner.dir/main.c.o
[ 66%] Building C object CMakeFiles/tcmu-runner.dir/tcmuhandler-generated.c.o
[ 68%] Linking C executable tcmu-runner
[ 70%] Built target tcmu-runner
Scanning dependencies of target tcmu-synthesizer
[ 72%] Building C object CMakeFiles/tcmu-synthesizer.dir/scsi.c.o
[ 74%] Building C object CMakeFiles/tcmu-synthesizer.dir/tcmu-synthesizer.c.o
[ 76%] Linking C executable tcmu-synthesizer
[ 76%] Built target tcmu-synthesizer
Scanning dependencies of target handler_file
[ 78%] Building C object CMakeFiles/handler_file.dir/file_example.c.o
[ 80%] Linking C shared library handler_file.so
[ 80%] Built target handler_file
Scanning dependencies of target consumer
[ 82%] Building C object CMakeFiles/consumer.dir/scsi.c.o
[ 84%] Building C object CMakeFiles/consumer.dir/consumer.c.o
[ 86%] Linking C executable consumer
[ 86%] Built target consumer
Scanning dependencies of target handler_file_optical
[ 88%] Building C object CMakeFiles/handler_file_optical.dir/scsi.c.o
[ 90%] Building C object CMakeFiles/handler_file_optical.dir/file_optical.c.o
[ 92%] Linking C shared library handler_file_optical.so
[ 92%] Built target handler_file_optical
Scanning dependencies of target handler_rbd
[ 94%] Building C object CMakeFiles/handler_rbd.dir/rbd.c.o
[ 96%] Linking C shared library handler_rbd.so
[ 96%] Built target handler_rbd
Scanning dependencies of target handler_qcow
[ 98%] Building C object CMakeFiles/handler_qcow.dir/qcow.c.o
[100%] Linking C shared library handler_qcow.so
[100%] Built target handler_qcow


root@cd88-ceph-rgw-01:~/test/tcmu-runner# ./tcmu-runner --version
tcmu-runner 1.5.4
root@cd88-ceph-rgw-01:~/test/tcmu-runner# make install
[  5%] Built target handler_file_zbc
[ 27%] Built target tcmu_static
[ 49%] Built target tcmu
[ 70%] Built target tcmu-runner
[ 76%] Built target tcmu-synthesizer
[ 80%] Built target handler_file
[ 86%] Built target consumer
[ 92%] Built target handler_file_optical
[ 96%] Built target handler_rbd
[100%] Built target handler_qcow
Install the project...
-- Install configuration: ""
-- Installing: /usr/lib/x86_64-linux-gnu/libtcmu.so.2.2
-- Up-to-date: /usr/lib/x86_64-linux-gnu/libtcmu.so.2
-- Installing: /usr/lib/x86_64-linux-gnu/libtcmu.so
-- Installing: /usr/bin/tcmu-runner
-- Set runtime path of "/usr/bin/tcmu-runner" to ""
-- Installing: /usr/lib/x86_64-linux-gnu/tcmu-runner/handler_file_optical.so
-- Installing: /usr/lib/x86_64-linux-gnu/tcmu-runner/handler_file_zbc.so
-- Installing: /usr/lib/x86_64-linux-gnu/tcmu-runner/handler_rbd.so
-- Installing: /usr/lib/x86_64-linux-gnu/tcmu-runner/handler_qcow.so
-- Installing: /etc/tcmu//tcmu.conf.old
-- Installing: /etc/tcmu//tcmu.conf
-- Installing: /etc/logrotate.d/tcmu-runner.bak/tcmu-runner
-- Installing: /etc/logrotate.d/tcmu-runner
-- Installing: /usr/share/dbus-1/system-services/org.kernel.TCMUService1.service
-- Installing: /etc/dbus-1/system.d/tcmu-runner.conf
-- Installing: /usr/lib/systemd/system/tcmu-runner.service
-- Installing: /usr/share/man/man8/tcmu-runner.8


root@cd88-ceph-rgw-01:~# systemctl daemon-reload
root@cd88-ceph-rgw-01:~# systemctl restart tcmu-runner.service
root@cd88-ceph-rgw-01:~# systemctl status tcmu-runner.service
● tcmu-runner.service - LIO Userspace-passthrough daemon
     Loaded: loaded (/lib/systemd/system/tcmu-runner.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2021-09-03 12:49:42 CEST; 6s ago
       Docs: man:tcmu-runner(8)
   Main PID: 12172 (tcmu-runner)
      Tasks: 22 (limit: 256201)
     Memory: 28.3M
     CGroup: /system.slice/tcmu-runner.service
             └─12172 /usr/bin/tcmu-runner

@lightmans2
Copy link
Author

looks better now

iscsi gateway 1:

root@cd88-ceph-rgw-01:~# tail -f /var/log/tcmu-runner.log
2021-09-03 13:01:02.722 12172 [WARN] tcmu_rbd_lock:947 rbd/rbd.disk_1: Acquired exclusive lock.
2021-09-03 13:01:02.722 12172 [INFO] tcmu_acquire_dev_lock:489 rbd/rbd.disk_1: Write lock acquisition successful
2021-09-03 13:01:06.868 12172 [WARN] tcmu_notify_lock_lost:271 rbd/rbd.disk_1: Async lock drop. Old state 5
2021-09-03 13:01:06.882 12172 [INFO] alua_implicit_transition:581 rbd/rbd.disk_1: Starting read lock acquisition operation.
2021-09-03 13:01:06.965 12172 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:340 rbd/rbd.disk_1: removing addrs: {10.50.50.30:0/1398731605}
2021-09-03 13:01:07.919 12172 [INFO] tcmu_rbd_open:1162 rbd/rbd.disk_1: address: {10.50.50.30:0/583028032}
2021-09-03 13:01:07.919 12172 [INFO] tcmu_acquire_dev_lock:486 rbd/rbd.disk_1: Read lock acquisition successful
2021-09-03 13:01:07.923 12172 [INFO] alua_implicit_transition:592 rbd/rbd.disk_1: Starting write lock acquisition operation.
2021-09-03 13:01:09.073 12172 [WARN] tcmu_rbd_lock:947 rbd/rbd.disk_1: Acquired exclusive lock.
2021-09-03 13:01:09.073 12172 [INFO] tcmu_acquire_dev_lock:489 rbd/rbd.disk_1: Write lock acquisition successful

iscsi gateway 2

root@cd133-ceph-rgw-01:~/test/tcmu-runner# tail -f /var/log/tcmu-runner.log
2021-09-03 13:01:05.854 1230 [INFO] alua_implicit_transition:570 rbd/rbd.disk_1: Starting lock acquisition operation.
2021-09-03 13:01:06.941 1230 [WARN] tcmu_rbd_lock:762 rbd/rbd.disk_1: Acquired exclusive lock.
2021-09-03 13:01:24.890 1230 [ERROR] tcmu_rbd_has_lock:515 rbd/rbd.disk_1: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
2021-09-03 13:01:51.297 1230 [CRIT] main:1390: Exiting...
2021-09-03 13:01:51.363 830507 [CRIT] main:1302: Starting...
2021-09-03 13:01:51.363 830507 [INFO] tcmur_register_handler:92: Handler fbo is registered
2021-09-03 13:01:51.363 830507 [INFO] tcmur_register_handler:92: Handler zbc is registered
2021-09-03 13:01:51.364 830507 [INFO] tcmur_register_handler:92: Handler qcow is registered
2021-09-03 13:01:51.403 830507 [INFO] tcmur_register_handler:92: Handler rbd is registered
2021-09-03 13:01:51.512 830507 [INFO] tcmu_rbd_open:1162 rbd/rbd.disk_1: address: {10.50.50.31:0/1876153824}
2021-09-03 13:03:00.842 830507 [INFO] alua_implicit_transition:581 rbd/rbd.disk_1: Starting read lock acquisition operation.
2021-09-03 13:03:00.842 830507 [INFO] tcmu_acquire_dev_lock:486 rbd/rbd.disk_1: Read lock acquisition successful


@lxbsz
Copy link
Collaborator

lxbsz commented Sep 6, 2021

Hey @lightmans2

Looks cool. Thanks very much for your tests.

For the tcmalloc issue we may need to fix this in extra/install_dep.sh.

BTW, with this PR have hit any other issue ? Thanks.

@lxbsz
Copy link
Collaborator

lxbsz commented Sep 6, 2021

I am afraid you were using the 1.5.2 instead not the 1.5.4, we can see tcmu_rbd_has_lock:515, in the version 1.5.4 it not Line#515 any more.

This will only happen in version 1.5.2:

 508 static int tcmu_rbd_has_lock(struct tcmu_device *dev)
 509 { 
 510         struct tcmu_rbd_state *state = tcmur_dev_get_private(dev);
 511         int ret, is_owner;
 512   
 513         ret = rbd_is_exclusive_lock_owner(state->image, &is_owner);
 514         if (ret < 0) {
 515                 tcmu_dev_err(dev, "Could not check lock ownership. Error: %s.\n",                                                                 
 516                              strerror(-ret));
 517                 if (ret == -ESHUTDOWN || ret == -ETIMEDOUT) 
 518                         return ret;
 519   
 520                 /* let initiator figure things out */
 521                 return -EIO;
 522         } else if (is_owner) {
 523                 tcmu_dev_dbg(dev, "Is owner\n");
 524                 return 1;
 525         }
 526         tcmu_dev_dbg(dev, "Not owner\n");
 527   
 528         return 0;
 529 } 

I am afraid your installing from source didn't work. Or the path you choose to install is not the same with the package did, so you didn't override the ones of the package.

@lightmans2
Copy link
Author

lightmans2 commented Sep 6, 2021

hi lxbsz,

hmm okay... can you help me to fix that? iam not a developer so dont know how to fix that or to correct that.

the gateways are now stable and the performance are not between 100 - 200 mb/s

@lxbsz
Copy link
Collaborator

lxbsz commented Sep 6, 2021

hi lxbsz,

hmm okay... can you help me to fix that? iam not a developer so dont know how to fix that or to correct that.

Do you mean the install_deps.sh, right ? Yeah I will fix this later.

the gateways are now stable and the performance are not between 100 - 200 mb/s

For building from source issue, you can check what's the install paths for all the binaries in the tcmu-runner 1.5.2 deb package, then you can specify the install prefix in -DCMAKE_INSTALL_PREFIX= :

# cmake -Dwith-glfs=false -DSUPPORT_SYSTEMD=ON -DCMAKE_INSTALL_PREFIX=/usr .

@lxbsz
Copy link
Collaborator

lxbsz commented Sep 6, 2021

If the #667 works well for you I will merge it and do the new release soon.

@lightmans2
Copy link
Author

lightmans2 commented Sep 16, 2021

hi lxbsz,
hmm okay... can you help me to fix that? iam not a developer so dont know how to fix that or to correct that.

Do you mean the install_deps.sh, right ? Yeah I will fix this later.

the gateways are now stable and the performance are not between 100 - 200 mb/s

For building from source issue, you can check what's the install paths for all the binaries in the tcmu-runner 1.5.2 deb package, then you can specify the install prefix in -DCMAKE_INSTALL_PREFIX= :

# cmake -Dwith-glfs=false -DSUPPORT_SYSTEMD=ON -DCMAKE_INSTALL_PREFIX=/usr .

Hi lxbsz,

you sayed that the install prefix are maybe wrong...
i analyzed the dpkg files/package of the ceph-iscsi in debian/ubuntu.

how should the path look like?
cmake -Dwith-glfs=false -DSUPPORT_SYSTEMD=ON -DCMAKE_INSTALL_PREFIX=/usr .

the original looks like that:

root@cd133-ceph-mon-01:~# apt-file list ceph-iscsi
ceph-iscsi: /etc/init.d/rbd-target-api
ceph-iscsi: /etc/init.d/rbd-target-gw
ceph-iscsi: /lib/systemd/system/rbd-target-api.service
ceph-iscsi: /lib/systemd/system/rbd-target-gw.service
ceph-iscsi: /usr/bin/gwcli
ceph-iscsi: /usr/bin/rbd-target-api
ceph-iscsi: /usr/bin/rbd-target-gw
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi-3.4.egg-info/PKG-INFO
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi-3.4.egg-info/dependency_links.txt
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi-3.4.egg-info/top_level.txt
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/__init__.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/alua.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/backstore.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/client.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/common.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/discovery.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/gateway.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/gateway_object.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/gateway_setting.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/group.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/lio.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/lun.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/metrics.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/settings.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/target.py
ceph-iscsi: /usr/lib/python3/dist-packages/ceph_iscsi_config/utils.py
ceph-iscsi: /usr/lib/python3/dist-packages/gwcli/__init__.py
ceph-iscsi: /usr/lib/python3/dist-packages/gwcli/ceph.py
ceph-iscsi: /usr/lib/python3/dist-packages/gwcli/client.py
ceph-iscsi: /usr/lib/python3/dist-packages/gwcli/gateway.py
ceph-iscsi: /usr/lib/python3/dist-packages/gwcli/hostgroup.py
ceph-iscsi: /usr/lib/python3/dist-packages/gwcli/node.py
ceph-iscsi: /usr/lib/python3/dist-packages/gwcli/storage.py
ceph-iscsi: /usr/lib/python3/dist-packages/gwcli/utils.py
ceph-iscsi: /usr/share/doc/ceph-iscsi/changelog.Debian.gz
ceph-iscsi: /usr/share/doc/ceph-iscsi/copyright
ceph-iscsi: /usr/share/man/man8/gwcli.8.gz

@lxbsz
Copy link
Collaborator

lxbsz commented Sep 17, 2021

hi lxbsz,
hmm okay... can you help me to fix that? iam not a developer so dont know how to fix that or to correct that.

Do you mean the install_deps.sh, right ? Yeah I will fix this later.

the gateways are now stable and the performance are not between 100 - 200 mb/s

For building from source issue, you can check what's the install paths for all the binaries in the tcmu-runner 1.5.2 deb package, then you can specify the install prefix in -DCMAKE_INSTALL_PREFIX= :

# cmake -Dwith-glfs=false -DSUPPORT_SYSTEMD=ON -DCMAKE_INSTALL_PREFIX=/usr .

Hi lxbsz,

you sayed that the install prefix are maybe wrong...
i analyzed the dpkg files/package of the ceph-iscsi in debian/ubuntu.

how should the path look like?
cmake -Dwith-glfs=false -DSUPPORT_SYSTEMD=ON -DCMAKE_INSTALL_PREFIX=/usr .

the original looks like that:

I didn't try this in Debian yet, the above will always be true for Centos/Fedora/RHEL.

root@cd133-ceph-mon-01:~# apt-file list ceph-iscsi

It should be for tcmu-runner package here.

But it seems the abbove prefix is correct, then I have no idea why you were still running the old tcmu-runner version.

After you installed it did you reload and restart it ?

@lightmans2
Copy link
Author

Hi lxbsz,

yes i rebooted booth gatway servers and checked the tcmu-runner version its 1.5.4

  • we have build 3x osd nodes with 60 bluestore osd with and 60x6TB spinning disks, 12 ssd´s and 3nvme.
  • osd nodes have 32cores and 256gb Ram
  • the osd disk are connected to a scsi raid controller ... each disk is configured as raid0 and with write back enabled to use the raid controller cache etc.
  • we have 3x mons and 2x iscsi gateways
  • all servers are connected on a 10Gbit network (switches)
  • all servers have two 10gbit network adapter configured as bond-rr
  • we created one rbd pool with autoscaling and 128pg (at the moment)
  • in the pool are at the moment 5 rbd images... 2x 10tb and 3x500gb with feature exlusic lock and striping v2 (4mb obj / 1mb stipe / count 4)
  • All the images are attached to the two iscsi gateays running tcmu-runner 1.5.4 and exposed as iscsi target
  • we have 6 esxi 6.7u3 servers as computed node connected to the ceph iscsi target

esxi iscsi config:
esxcli system settings advanced set -o /ISCSI/MaxIoSizeKB -i 512
esxcli system module parameters set -m iscsi_vmk -p iscsivmk_LunQDepth=64
esxcli system module parameters set -m iscsi_vmk -p iscsivmk_HostQDepth=64
esxcli system settings advanced set --int-value 1 --option /DataMover/HardwareAcceleratedMove

the osd nodes, mons, rgw/iscsi gateways and esxi nodes are all connected to the 10gbit network with bond-rr

rbd benchmark test:

root@cd133-ceph-osdh-01:~# rados bench -p rbd 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cd133-ceph-osdh-01_87894
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        69        53   211.987       212    0.250578    0.249261
    2      16       129       113   225.976       240    0.296519    0.266439
    3      16       183       167   222.641       216    0.219422    0.273838
    4      16       237       221   220.974       216    0.469045     0.28091
    5      16       292       276   220.773       220    0.249321     0.27565
    6      16       339       323   215.307       188    0.205553     0.28624
    7      16       390       374   213.688       204    0.188404    0.290426
    8      16       457       441   220.472       268    0.181254    0.286525
    9      16       509       493   219.083       208    0.250538    0.286832
   10      16       568       552   220.772       236    0.307829    0.286076
Total time run:         10.2833
Total writes made:      568
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     220.941
Stddev Bandwidth:       22.295
Max bandwidth (MB/sec): 268
Min bandwidth (MB/sec): 188
Average IOPS:           55
Stddev IOPS:            5.57375
Max IOPS:               67
Min IOPS:               47
Average Latency(s):     0.285903
Stddev Latency(s):      0.115162
Max latency(s):         0.88187
Min latency(s):         0.119276
Cleaning up (deleting benchmark objects)
Removed 568 objects
Clean up completed and total clean up time :3.18627

the rbd benchmark says that min 250 mb/s is possible... but i saw realy much more... up to 550mb/s

if i start iftop on one osd node i see the ceph iscsi gw names as rgw and the traffic is nearly 80mb/s
grafik

the ceph dashboard shows that the write iscsi performance are only 40mb/s
the max value i saw was between 40 and 60mb/s.. very poor
grafik

if i look into the vcenter and esxi datastore performance i see very high storage device latencys between 50 and 100ms... very bad
grafik

root@cd133-ceph-mon-01:/home/cephadm# ceph config dump
WHO                                               MASK       LEVEL     OPTION                                       VALUE                                                                                        RO
global                                                       basic     container_image                              docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb  *
global                                                       advanced  journal_max_write_bytes                      1073714824
global                                                       advanced  journal_max_write_entries                    10000
global                                                       advanced  mon_osd_cache_size                           1024
global                                                       dev       osd_client_watch_timeout                     15
global                                                       dev       osd_heartbeat_interval                       5
global                                                       advanced  osd_map_cache_size                           128
global                                                       advanced  osd_max_write_size                           512
global                                                       advanced  rados_osd_op_timeout                         5
global                                                       advanced  rbd_cache_max_dirty                          134217728
global                                                       advanced  rbd_cache_max_dirty_age                      5.000000
global                                                       advanced  rbd_cache_size                               268435456
global                                                       advanced  rbd_op_threads                               2
  mon                                                        advanced  auth_allow_insecure_global_id_reclaim        false
  mon                                                        advanced  cluster_network                              10.50.50.0/24                                                                                *
  mon                                                        advanced  public_network                               10.50.50.0/24                                                                                *
  mgr                                                        advanced  mgr/cephadm/container_init                   True                                                                                         *
  mgr                                                        advanced  mgr/cephadm/device_enhanced_scan             true                                                                                         *
  mgr                                                        advanced  mgr/cephadm/migration_current                2                                                                                            *
  mgr                                                        advanced  mgr/cephadm/warn_on_stray_daemons            false                                                                                        *
  mgr                                                        advanced  mgr/cephadm/warn_on_stray_hosts              false                                                                                        *
  mgr                                                        advanced  mgr/dashboard/10.50.50.21/server_addr                                                                                                     *
  mgr                                                        advanced  mgr/dashboard/ALERTMANAGER_API_HOST          http://10.221.133.161:9093                                                                   *
  mgr                                                        advanced  mgr/dashboard/GRAFANA_API_SSL_VERIFY         false                                                                                        *
  mgr                                                        advanced  mgr/dashboard/GRAFANA_API_URL                https://10.221.133.161:3000                                                                  *
  mgr                                                        advanced  mgr/dashboard/ISCSI_API_SSL_VERIFICATION     true                                                                                         *
  mgr                                                        advanced  mgr/dashboard/NAME/server_port               80                                                                                           *
  mgr                                                        advanced  mgr/dashboard/PROMETHEUS_API_HOST            http://10.221.133.161:9095                                                                   *
  mgr                                                        advanced  mgr/dashboard/PROMETHEUS_API_SSL_VERIFY      false                                                                                        *
  mgr                                                        advanced  mgr/dashboard/RGW_API_ACCESS_KEY             W8VEKVFDK1RH5IH2Q3GN                                                                         *
  mgr                                                        advanced  mgr/dashboard/RGW_API_SECRET_KEY             IkIjmjfh3bMLrPOlAFbMfpigSIALAQoKGEHzZgxv                                                     *
  mgr                                                        advanced  mgr/dashboard/camdatadash/server_addr        10.251.133.161                                                                               *
  mgr                                                        advanced  mgr/dashboard/camdatadash/ssl_server_port    8443                                                                                         *
  mgr                                                        advanced  mgr/dashboard/cd133-ceph-mon-01/server_addr                                                                                               *
  mgr                                                        advanced  mgr/dashboard/dasboard/server_port           80                                                                                           *
  mgr                                                        advanced  mgr/dashboard/dashboard/server_addr          10.251.133.161                                                                               *
  mgr                                                        advanced  mgr/dashboard/dashboard/ssl_server_port      8443                                                                                         *
  mgr                                                        advanced  mgr/dashboard/server_addr                    0.0.0.0                                                                                      *
  mgr                                                        advanced  mgr/dashboard/server_port                    8080                                                                                         *
  mgr                                                        advanced  mgr/dashboard/ssl                            false                                                                                        *
  mgr                                                        advanced  mgr/dashboard/ssl_server_port                8443                                                                                         *
  mgr                                                        advanced  mgr/orchestrator/orchestrator                cephadm
  mgr                                                        advanced  mgr/prometheus/server_addr                   0.0.0.0                                                                                      *
  mgr                                                        advanced  mgr/telemetry/channel_ident                  true                                                                                         *
  mgr                                                        advanced  mgr/telemetry/contact                        [email protected]                                                                                *
  mgr                                                        advanced  mgr/telemetry/description                    ceph cluster                                                                         *
  mgr                                                        advanced  mgr/telemetry/enabled                        true                                                                                         *
  mgr                                                        advanced  mgr/telemetry/last_opt_revision              3                                                                                            *
  osd                                                        dev       bluestore_cache_autotune                     false
  osd                                             class:ssd  dev       bluestore_cache_autotune                     false
  osd                                                        dev       bluestore_cache_size                         4000000000
  osd                                             class:ssd  dev       bluestore_cache_size                         4000000000
  osd                                                        dev       bluestore_cache_size_hdd                     4000000000
  osd                                                        dev       bluestore_cache_size_ssd                     4000000000
  osd                                             class:ssd  dev       bluestore_cache_size_ssd                     4000000000
  osd                                                        advanced  bluestore_default_buffered_write             true
  osd                                             class:ssd  advanced  bluestore_default_buffered_write             true
  osd                                                        advanced  osd_max_backfills                            1
  osd                                             class:ssd  dev       osd_memory_cache_min                         4000000000
  osd                                             class:hdd  basic     osd_memory_target                            6000000000
  osd                                             class:ssd  basic     osd_memory_target                            6000000000
  osd                                                        advanced  osd_recovery_max_active                      3
  osd                                                        advanced  osd_recovery_max_single_start                1
  osd                                                        advanced  osd_recovery_sleep                           0.000000
    client.rgw.ceph-rgw.cd133-ceph-rgw-01.klvrwk             basic     rgw_frontends                                beast port=8000                                                                              *
    client.rgw.ceph-rgw.cd133-ceph-rgw-01.ptmqcm             basic     rgw_frontends                                beast port=8001                                                                              *
    client.rgw.ceph-rgw.cd88-ceph-rgw-01.czajah              basic     rgw_frontends                                beast port=8000                                                                              *
    client.rgw.ceph-rgw.cd88-ceph-rgw-01.pdknfg              basic     rgw_frontends                                beast port=8000                                                                              *
    client.rgw.ceph-rgw.cd88-ceph-rgw-01.qkdlfl              basic     rgw_frontends                                beast port=8001                                                                              *
    client.rgw.ceph-rgw.cd88-ceph-rgw-01.tdsxpb              basic     rgw_frontends                                beast port=8001                                                                              *
    client.rgw.ceph-rgw.cd88-ceph-rgw-01.xnadfr              basic     rgw_frontends                                beast port=8001                                                                              *

can somebody explain me what i am doing wrong or what can i do to get a better performance with ceph-iscsi?
doesnt matter what i do or what i tweak the write performance will not get better.

i already experimented with gwcli and the iscsi queue and other settings.
actually i set:
hw_max_sectors 8192
max_data_area_mb 32
cmdsn_depth 64 / the esxi nodes are alredy set fixed to 64 max iscsi commands

everything is fine and multipathing is workind and the recovery is fast ... but the iscsi very slow and i dont know why.
can somebody help me maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants