Ubuntu 12.04, gcc-4.7.3, 32-bit, with fftw 3.3.3 (built with --enable-neon), on a 1.2GHz ARM Cortex A9 (Tegra 3)
Built with:
gcc-4.7 -O3 -DHAVE_FFTW -march=armv7-a -mtune=cortex-a9 -mfloat-abi=hard -mfpu=neon -ffast-math test_pffft.c pffft.c -o test_pffft_arm fftpack.c -lm -I/usr/local/include/ -L/usr/local/lib/ -lfftw3f
input len | real FFTPack | real FFTW | real PFFFT | cplx FFTPack | cplx FFTW | cplx PFFFT |
---|---|---|---|---|---|---|
64 | 549 | 452 | 731 | 512 | 602 | 640 |
96 | 421 | 272 | 702 | 496 | 571 | 602 |
128 | 498 | 512 | 815 | 597 | 618 | 652 |
160 | 521 | 536 | 815 | 586 | 669 | 625 |
192 | 539 | 571 | 883 | 485 | 597 | 626 |
256 | 640 | 539 | 975 | 569 | 611 | 671 |
384 | 499 | 610 | 879 | 499 | 602 | 637 |
480 | 518 | 507 | 877 | 496 | 661 | 616 |
512 | 524 | 591 | 1002 | 549 | 678 | 668 |
640 | 542 | 612 | 955 | 568 | 663 | 645 |
768 | 557 | 613 | 981 | 491 | 663 | 598 |
800 | 514 | 353 | 882 | 514 | 360 | 574 |
1024 | 640 | 640 | 1067 | 492 | 683 | 602 |
2048 | 587 | 640 | 908 | 486 | 640 | 552 |
2400 | 479 | 368 | 777 | 422 | 376 | 518 |
4096 | 511 | 614 | 853 | 426 | 640 | 534 |
8192 | 415 | 584 | 708 | 386 | 622 | 516 |
9216 | 419 | 571 | 687 | 364 | 586 | 506 |
16384 | 426 | 577 | 716 | 398 | 606 | 530 |
32768 | 417 | 572 | 673 | 399 | 572 | 468 |
262144 | 219 | 380 | 293 | 255 | 431 | 343 |
1048576 | 202 | 274 | 237 | 265 | 282 | 355 |
Same platform as above, but this time pffft and fftpack are built with clang 3.2:
clang -O3 -DHAVE_FFTW -march=armv7-a -mtune=cortex-a9 -mfloat-abi=hard -mfpu=neon -ffast-math test_pffft.c pffft.c -o test_pffft_arm fftpack.c -lm -I/usr/local/include/ -L/usr/local/lib/ -lfftw3f
input len | real FFTPack | real FFTW | real PFFFT | cplx FFTPack | cplx FFTW | cplx PFFFT |
---|---|---|---|---|---|---|
64 | 427 | 452 | 853 | 427 | 602 | 1024 |
96 | 351 | 276 | 843 | 337 | 571 | 963 |
128 | 373 | 512 | 996 | 390 | 618 | 1054 |
160 | 426 | 536 | 987 | 375 | 669 | 914 |
192 | 404 | 571 | 1079 | 388 | 588 | 1079 |
256 | 465 | 539 | 1205 | 445 | 602 | 1170 |
384 | 366 | 610 | 1099 | 343 | 594 | 1099 |
480 | 356 | 507 | 1140 | 335 | 651 | 931 |
512 | 411 | 591 | 1213 | 384 | 649 | 1124 |
640 | 398 | 612 | 1193 | 373 | 654 | 901 |
768 | 409 | 613 | 1227 | 383 | 663 | 1044 |
800 | 411 | 348 | 1073 | 353 | 358 | 809 |
1024 | 427 | 640 | 1280 | 413 | 692 | 1004 |
2048 | 414 | 626 | 1126 | 371 | 640 | 853 |
2400 | 399 | 373 | 898 | 319 | 368 | 653 |
4096 | 404 | 602 | 1059 | 357 | 633 | 778 |
8192 | 332 | 584 | 792 | 308 | 616 | 716 |
9216 | 322 | 561 | 783 | 299 | 586 | 687 |
16384 | 344 | 568 | 778 | 314 | 617 | 745 |
32768 | 342 | 564 | 737 | 314 | 552 | 629 |
262144 | 201 | 383 | 313 | 227 | 435 | 413 |
1048576 | 187 | 262 | 251 | 228 | 281 | 409 |
So it looks like, on ARM, gcc 4.7 is the best at scalar floating point (the fftpack performance numbers are better with gcc), while clang is the best with neon intrinsics (see how pffft perf has improved with clang 3.2).