Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Performance for SRS #1673

Closed
winlinvip opened this issue Mar 26, 2020 · 4 comments
Closed

Improve Performance for SRS #1673

winlinvip opened this issue Mar 26, 2020 · 4 comments
Assignees
Labels
Enhancement Improvement or enhancement. TransByAI Translated by AI/GPT.
Milestone

Comments

@winlinvip
Copy link
Member

winlinvip commented Mar 26, 2020

Performance optimization is an endless topic that requires continuous improvement. SRS2 has undergone a significant performance optimization, increasing from 3k to 7k. Further optimizations are needed, and the optimization process and data will be posted in this issue.

Previously, SRS2 had undergone some optimizations, as referenced below:

Play RTMP benchmark

The data for playing RTMP was benchmarked by [SB][srs-bench]:

Update SRS Clients Type CPU Memory Commit
2014-12-07 2.0.67 10k(10000) players 95% 656MB code
2014-12-05 2.0.57 9.0k(9000) players 90% 468MB code
2014-12-05 2.0.55 8.0k(8000) players 89% 360MB code
2014-11-22 2.0.30 7.5k(7500) players 87% 320MB code
2014-11-13 2.0.15 6.0k(6000) players 82% 203MB code
2014-11-12 2.0.14 3.5k(3500) players 95% 78MB code
2014-11-12 2.0.14 2.7k(2700) players 69% 59MB -
2014-11-11 2.0.12 2.7k(2700) players 85% 66MB -
2014-11-11 1.0.5 2.7k(2700) players 85% 66MB -
2014-07-12 0.9.156 2.7k(2700) players 89% 61MB code
2014-07-12 0.9.156 1.8k(1800) players 68% 38MB -
2013-11-28 0.5.0 1.8k(1800) players 90% 41M -

Publish RTMP benchmark

The data for publishing RTMP was benchmarked by [SB][srs-bench]:

Update SRS Clients Type CPU Memory Commit
2014-12-04 2.0.52 4.0k(4000) publishers 80% 331MB code
2014-12-04 2.0.51 2.5k(2500) publishers 91% 259MB code
2014-12-04 2.0.49 2.5k(2500) publishers 95% 404MB code
2014-12-04 2.0.49 1.4k(1400) publishers 68% 144MB -
2014-12-03 2.0.48 1.4k(1400) publishers 95% 140MB code
2014-12-03 2.0.47 1.4k(1400) publishers 95% 140MB -
2014-12-03 2.0.47 1.2k(1200) publishers 84% 76MB code
2014-12-03 2.0.12 1.2k(1200) publishers 96% 43MB -
2014-12-03 1.0.10 1.2k(1200) publishers 96% 43MB -

Play HTTP FLV benchmark

The data for playing HTTP FLV was benchmarked by [SB][srs-bench]:

Update SRS Clients Type CPU Memory Commit
2014-05-25 2.0.171 6.0k(6000) players 84% 297MB code
2014-05-24 2.0.170 3.0k(3000) players 89% 96MB code
2014-05-24 2.0.169 3.0k(3000) players 94% 188MB code
2014-05-24 2.0.168 2.3k(2300) players 92% 276MB code
2014-05-24 2.0.167 1.0k(1000) players 82% 86MB -

Latency benchmark

The latency between encoder and player with realtime config([CN][v3_CN_LowLatency], [EN][v3_EN_LowLatency]):
|

Update SRS VP6 H.264 VP6+MP3 H.264+MP3
2014-12-16 2.0.72 0.1s 0.4s 0.8s 0.6s
2014-12-12 2.0.70 0.1s 0.4s 1.0s 0.9s
2014-12-03 1.0.10 0.4s 0.4s 0.9s 1.2s
@winlinvip winlinvip added the Enhancement Improvement or enhancement. label Mar 26, 2020
@winlinvip winlinvip added this to the SRS 3.0 release milestone Mar 26, 2020
@winlinvip
Copy link
Member Author

winlinvip commented Mar 26, 2020

SRS4: Refine ST Iterate Coroutines Performance

There is an optimization in ST that could potentially improve performance by 5% to 10%. This mainly addresses the issue of iterating coroutines. Data reference: ossrs/state-threads#5 (comment)

This optimization involves significant changes, so it will not be implemented in SRS3, but is expected to be in SRS4.

MacPro information:

  • macOS Mojave
  • Version 10.14.6 (18G3020)
  • MacBook Pro (Retina, 15-inch, Mid 2015)
  • Processor: 2.2 GHz Intel Core i7
  • Memory: 16 GB 1600 MHz DDR3

Docker information:

  • Docker Desktop 2.2.0.3(42716)
  • Engine: 19.03.5
  • Resources: CPUs 4, Memory 2GB, Swap 1GB

Note: SRS is bound to CPU0, and SB is bound to CPU2-3.

SRS3 for Playing Baseline

SRS3, without this optimization, can serve as a performance baseline to see how much this PR has improved relative to it.

Mac:trunk chengli.ycl$ docker exec -it git top
top - 03:44:38 up 14:03,  0 users,  load average: 1.72, 1.71, 1.74
Tasks:  12 total,   1 running,  11 sleeping,   0 stopped,   0 zombie
%Cpu0  : 44.7 us, 14.9 sy,  0.0 ni, 32.5 id,  0.0 wa,  0.0 hi,  7.8 si,  0.0 st
%Cpu1  :  1.5 us,  2.9 sy,  0.0 ni, 95.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu2  : 21.2 us, 11.2 sy,  0.0 ni, 67.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu3  : 16.0 us,  8.4 sy,  0.0 ni, 75.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem :  2037260 total,   490352 free,  1188940 used,   357968 buff/cache
KiB Swap:  1048572 total,  1028092 free,    20480 used.   704796 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
 6654 root      20   0  463540 331388   2960 S  24.6 16.3  21:00.42 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
 6606 root      20   0  449600 317332   2824 S  20.6 15.6  20:56.26 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
11191 root      20   0 1339072 194020   5440 S  64.1  9.5   1:43.16 ./gprof.srs_3_baseline -c console.conf 

Mac:trunk chengli.ycl$ docker exec git netstat -anp|grep srs|wc -l
    4002

Mac:trunk chengli.ycl$ docker exec git dstat -N lo
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- ---net/lo-- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
 19   9  70   0   0   2|   0     0 | 134M  134M|   0     0 |4500  6374 
 24  14  58   0   0   4|   0     0 | 184M  184M|   0     0 |4829  5833 

[root@de6e1cac0533 trunk]# gprof -b gprof.srs_3_baseline gmon.out |more
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 19.71      8.35     8.35                             _st_epoll_dispatch
 16.91     15.52     7.17 45118865     0.00     0.00  SrsConsumer::enqueue(SrsSharedPtrMessage*, bool, SrsRtmpJitterAlgorithm)
 10.29     19.88     4.36  1857259     0.00     0.00  SrsProtocol::do_send_messages(SrsSharedPtrMessage**, int)
  9.33     23.83     3.96 45118865     0.00     0.00  SrsFastVector::push_back(SrsSharedPtrMessage*)
  4.65     25.80     1.97     4000     0.49     3.17  SrsRtmpConn::do_playing(SrsSource*, SrsConsumer*, SrsQueueRecvThread*)
  3.54     27.30     1.50     7295     0.21     1.47  SrsSource::on_audio_imp(SrsSharedPtrMessage*)
  3.42     28.75     1.45  1857259     0.00     0.00  SrsProtocol::send_and_free_messages(SrsSharedPtrMessage**, int, int)
  3.16     30.09     1.34 45086840     0.00     0.00  srs_chunk_header_c0(int, unsigned int, int, signed char, int, char*, int)
  2.36     31.09     1.00 45118865     0.00     0.00  SrsRtmpJitter::correct(SrsSharedPtrMessage*, SrsRtmpJitterAlgorithm)

Interpretation:

  • CPU usage is 64%, with 44% in user space and 14% in system space.
  • Functions in user space mainly include _st_epoll_dispatch and RTMP Messages processing logic.

SRS3 for Playing with ST Refined

SRS3, with this PR merged, optimizes the ST iteration logic.

Mac:trunk chengli.ycl$ docker exec -it git top
top - 04:00:43 up 14:19,  0 users,  load average: 1.47, 1.57, 1.62
Tasks:  13 total,   3 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu0  : 40.6 us, 10.2 sy,  0.0 ni, 43.3 id,  0.0 wa,  0.0 hi,  5.8 si,  0.0 st
%Cpu1  :  1.0 us,  2.1 sy,  0.0 ni, 96.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 17.7 us, 11.8 sy,  0.0 ni, 70.1 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu3  : 16.8 us,  9.5 sy,  0.0 ni, 73.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2037260 total,   429264 free,  1226620 used,   381376 buff/cache
KiB Swap:  1048572 total,  1028092 free,    20480 used.   667064 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
 6606 root      20   0  449356 317088   2824 S  19.3 15.6  24:59.70 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
 6654 root      20   0  448304 316176   2960 R  19.9 15.5  25:11.48 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
11352 root      20   0 1357608 241384   5344 R  54.8 11.8   2:25.22 ./gprof.srs_3_st -c console.conf

Mac:trunk chengli.ycl$ docker exec git netstat -anp|grep srs|wc -l
    4003

Mac:trunk chengli.ycl$ docker exec git dstat -N lo
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- ---net/lo-- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
 21  10  67   0   0   2|   0     0 | 111M  111M|   0     0 |4563  6364 
 23   9  66   0   0   2|   0     0 | 121M  121M|   0     0 |4505  6306 
 20   9  69   0   0   2|   0     0 | 130M  130M|   0     0 |4812  6843 

[root@de6e1cac0533 trunk]# gprof -b gprof.srs_3_st gmon.out |more
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 22.33     14.96    14.96 82024549     0.00     0.00  SrsConsumer::enqueue(SrsSharedPtrMessage*, bool, SrsRtmpJitterAlgorithm)
 13.08     23.73     8.77 82024549     0.00     0.00  SrsFastVector::push_back(SrsSharedPtrMessage*)
 12.30     31.97     8.24  3312993     0.00     0.00  SrsProtocol::do_send_messages(SrsSharedPtrMessage**, int)
  5.25     35.49     3.52     4001     0.88     5.96  SrsRtmpConn::do_playing(SrsSource*, SrsConsumer*, SrsQueueRecvThread*)
  5.07     38.89     3.40    13188     0.26     1.73  SrsSource::on_audio_imp(SrsSharedPtrMessage*)
  4.54     41.93     3.04  3312993     0.00     0.01  SrsProtocol::send_and_free_messages(SrsSharedPtrMessage**, int, int)
  3.49     44.27     2.34 82013595     0.00     0.00  srs_chunk_header_c0(int, unsigned int, int, signed char, int, char*, int)
  2.63     46.03     1.76 82024549     0.00     0.00  SrsRtmpJitter::correct(SrsSharedPtrMessage*, SrsRtmpJitterAlgorithm)
  2.28     47.56     1.53     7656     0.20     1.68  SrsSource::on_video_imp(SrsSharedPtrMessage*)
  2.13     48.99     1.43                             st_writev

Interpretation:

  • CPU usage is 54%, with 40% in user space and 10% in system space.
  • Functions in user space mainly include RTMP Messages processing logic.

Note: After optimizing ST, there is a certain improvement in performance, and _st_epoll_dispatch is no longer a hotspot function.

@winlinvip
Copy link
Member Author

winlinvip commented Mar 26, 2020

SRS3: Use Compiler O2 To Improve Performance

SRS1,2,3 have always used O0 by default, disabling compiler optimization. Data can be compared after enabling optimization.

MacPro information:

  • macOS Mojave
  • Version 10.14.6 (18G3020)
  • MacBook Pro (Retina, 15-inch, Mid 2015)
  • Processor: 2.2 GHz Intel Core i7
  • Memory: 16 GB 1600 MHz DDR3

Docker information:

  • Docker Desktop 2.2.0.3(42716)
  • Engine: 19.03.5
  • Resources: CPUs 4, Memory 2GB, Swap 1GB

Note: SRS is bound to CPU0, and SB is bound to CPU2-3.

SRS3 Play Baseline

First, let's look at the baseline data, with an average CPU usage of 66%, 39% in user space, and 22% in system space.

Mac:trunk chengli.ycl$ docker exec -it git top
top - 01:03:30 up 1 day, 14 min,  0 users,  load average: 1.53, 1.39, 1.12
Tasks:   5 total,   3 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu0  : 39.6 us, 22.9 sy,  0.0 ni, 28.7 id,  0.0 wa,  0.0 hi,  8.9 si,  0.0 st
%Cpu1  :  0.3 us,  1.7 sy,  0.0 ni, 97.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 21.3 us, 11.8 sy,  0.0 ni, 66.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 26.7 us, 15.2 sy,  0.0 ni, 58.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2037260 total,   412404 free,  1260192 used,   364664 buff/cache
KiB Swap:  1048572 total,   939260 free,   109312 used.   640028 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
88041 root      20   0  555112 393012   3056 S  26.7 19.3   4:58.08 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88046 root      20   0  555004 392828   3000 R  35.3 19.3   5:34.46 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88034 root      20   0 1651656 218748   5484 R  66.3 10.7  12:38.10 ./srs_3_baseline -c console.conf                                          
88035 root      20   0   58284   3716   3196 R   0.0  0.2   0:00.46 top                                                                       
    1 root      20   0   11944   2628   2336 S   0.0  0.1   0:01.51 bash 

SRS3 Play with Compiler O2

After enabling the O2 compiler option for SRS3, performance can be improved by about 10%, with CPU usage around 52%, 26% in user space, and 17% in system space.

Mac:trunk chengli.ycl$ docker exec -it git top
top - 01:09:24 up 1 day, 20 min,  0 users,  load average: 1.23, 1.38, 1.20
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu0  : 26.7 us, 17.8 sy,  0.0 ni, 46.2 id,  0.0 wa,  0.0 hi,  9.2 si,  0.0 st
%Cpu1  :  1.8 us,  4.8 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu2  : 24.3 us, 11.4 sy,  0.0 ni, 64.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 20.6 us, 10.7 sy,  0.0 ni, 68.4 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem :  2037260 total,   375336 free,  1307788 used,   354136 buff/cache
KiB Swap:  1048572 total,   939260 free,   109312 used.   594752 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
88041 root      20   0  550440 388408   3056 S  31.2 19.1   6:55.76 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88046 root      20   0  545716 383624   3000 S  24.6 18.8   7:27.84 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88085 root      20   0 1713060 290732   5040 S  52.5 14.3   2:38.46 ./srs_3_o2 -c console.conf                                                
88035 root      20   0   58284   3716   3196 R   0.0  0.2   0:00.60 top                                                                       
    1 root      20   0   11944   2628   2336 S   0.0  0.1   0:01.54 bash 

c47b9e46

@winlinvip
Copy link
Member Author

winlinvip commented Mar 27, 2020

It was found that the Docker environment may have unstable baseline issues, sometimes high and sometimes low, with significant differences, as shown in the following figure:

image

image

Some optimizations have been made, some of which are expected to improve, such as enabling O2, but due to the unstable baseline, they will be put on hold for now and tested on a physical machine later. The following are the optimization branches:

  • compiler O2 Enable O2 optimization during compilation.
  • inline Enable inline optimization for hot spot functions.
  • tcmalloc Use tcmalloc for memory allocation.
  • st Merge ST improvement #5 to optimize busy coroutine scheduling performance.
  • large iovs Increase the number of mw_msgs combined for writing.
  • perf stat Count the number of mw messages.
  • fast vector Optimize the queue for each consumer.
  • mr always Always enable mr read waiting.
  • mr buffer Always read a fixed length of data.
  • small buffer Using a small buffer may provide better performance.
  • vector queue Using vector directly is also an option.

@winlinvip
Copy link
Member Author

winlinvip commented Apr 19, 2020

Regarding ST optimization, the points that can be optimized are:

  1. Use of timer and cond, refer to Refine SRS timer and cond for performance issue. #1711
  2. IO event processing requires traversing io_q, refer to Support MSG_ZEROCOPY for streaming server. state-threads#13 (comment)

For analysis on ST, refer to: https://github.com/ossrs/state-threads/tree/srs#analysis

  1. About setjmp and longjmp, read setjmp.
  2. About the stack structure, read stack
  3. About asm code comments, read #91d530e.
  4. About the scheduler, read #13-scheduler.
  5. About the IO event system, read #13-IO.

@ossrs ossrs locked and limited conversation to collaborators Jul 18, 2023
@winlinvip winlinvip converted this issue into discussion #3666 Jul 18, 2023
@winlinvip winlinvip added the TransByAI Translated by AI/GPT. label Jul 28, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Enhancement Improvement or enhancement. TransByAI Translated by AI/GPT.
Projects
None yet
Development

No branches or pull requests

2 participants