Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: runtime detection of CRC32 and ARM64 support. #82

Open
wants to merge 7 commits into
base: v2.1-agentzh
Choose a base branch
from

Conversation

thibaultcha
Copy link
Member

@thibaultcha thibaultcha commented Oct 10, 2019

This PR refines the implementation of the CRC32 runtime detection implemented by @isage in #20, and previously refined by @siddhesh in #75. It also brings in refined ARM64 support originally implemented by @debayang in #21.

The refined implementation compares itself with the original ones as such:

  • It does not rely on new global variables.
  • It does not rely on the __attribute__((constructor)) attribute to detect/initialize CRC32 support.
  • It has a smaller footprint on the upstream codebase and less potential for future merge conflicts.
  • It can be detected and enabled even if LuaJIT is compiled without JIT support (-DLUAJIT_DISABLE_JIT).
  • It implements new jit.* APIs allowing users some introspection into this optimization.
  • It adds a minimum amount of new macros, and allows for granularly disabling this optimization via a new LJ_OR_* macros (as proposed in Move openresty-specific language extensions under a special configuration? #63).
  • It reorganizes the core to encapsulate string hashing implementations in lj_str_hash.c and leaves the room open to new string hashing implementations in the future.

In addition, this PR offers to greatly improve the CI tests matrix by running all 3 test suites against many different builds, on both x64 and arm64 architectures (thanks to the recent Travis-CI updates on arm64 support).

CC for reviews: @yangshuxin, @agentzh, @siddhesh, @dndx.


language: c

arch:
- amd64
#- arm64
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The luajit2-test-suite is disabled for now on arm64 since it currently has a few failing test cases.

.travis.yml Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@siddhesh
Copy link
Collaborator

I skimmed through the patches and overall they seem to be in the direction I wanted to take, thanks! I can do a more thorough review in some time if you like. One thing I noticed for example is the _OR_ in the name for the feature check macro. Please keep it generic luajit since I am working on merging it into my fork of the default luajit as well.

@dndx
Copy link
Member

dndx commented Oct 10, 2019

Ping @javierguerragiraldez.

Copy link
Collaborator

@siddhesh siddhesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a closer look and it looks like you've excluded the actual dynamic bits from my patch because of which this is not really runtime detection at all. You still have to build the whole binary with crc32 support and that kinda defeats the purpose, at least from the distribution point of view.

In summary, you have to make sure that lj_str_hash.c only has the optimised hashing code and build that with sse4.2/crc32 and unconditionally build it into the binary. Then at load time you look at CPU capabilities and decide to either call into those functions or not.

We do this a lot in glibc for dozens of microarchitectures in x86, ppc, s390x and arm64 and it works very well in practice.

src/lj_arch.h Outdated Show resolved Hide resolved
src/lj_str.c Show resolved Hide resolved
src/lj_str.c Show resolved Hide resolved
@siddhesh
Copy link
Collaborator

Also as a side note, I am beginning to wonder how much we need to bother with merge compatibility with the original LuaJIT project. It has been almost a year and it is about time that we move on. Only thing I need from Fedora's perspective is regular release tarballs and I'll work on merging all of my patches here :)

@thibaultcha
Copy link
Member Author

Hi all, I have been away from the computer for some time.

@siddhesh

I had a closer look and it looks like you've excluded the actual dynamic bits from my patch because of which this is not really runtime detection at all. You still have to build the whole binary with crc32 support and that kinda defeats the purpose, at least from the distribution point of view.

That is one thing that I forgot to mention in this PR: this patch voluntarily leaves out the bits forcibly asking the binary to be built with SSE4.2/CRC32 support:

  1. @agentzh's comment questioning the compatibility with older ARM architectures is a concern that I share, and I was not able to get my hands on said architectures to properly test this change (I only was able to test this PR on armv8 for now).
  2. In absolute, such a change should be introduced in a subsequent commit (in this PR or a subsequent one) to guarantee the atomicity of the git history; thus ensuring we separate: a) runtime CRC32 detection, b) ARM64 support, and c) auto-enablement of the string hashing optimization at runtime depending on a).

For the time being, this PR is safer as-is by being fully backwards-compatible (i.e. OpenResty already specifies -msse4.2 in its LuaJIT build system on x86).
Besides, it still needs to be decided how exactly this repository should balance OpenResty's tailored use-cases vs. providing a more un-opinionated LuaJIT fork; this is tangential to other on-going discussions, some of which you have been a part of: preserving merge-compatibility, naming of the build macros, release scheduling, and of course, whether or not to enable OpenResty optimizations by default (including this one).

Also as a side note, I am beginning to wonder how much we need to bother with merge compatibility with the original LuaJIT project.

This is a great segway now that the topic has been mentioned in my above paragraph, and my take on it is that for the time being (and probably for the foreseeable future), we should, when possible, ensure that merge-compatibility be preserved as the upstream v2.1 branch may still receive bugfixes for as long as the project is not officially declared unmaintained.

@siddhesh
Copy link
Collaborator

Hi all, I have been away from the computer for some time.

My turn to apologize; I'm still kinda away but stole some time for this :)

That is one thing that I forgot to mention in this PR: this patch voluntarily leaves out the bits forcibly asking the binary to be built with SSE4.2/CRC32 support:

Got it. Would it be clearer if the PR title is reworded to remove the 'runtime' bit?

1. [@agentzh's comment](https://github.com/openresty/luajit2/pull/21#discussion_r186247541) questioning the compatibility with older ARM architectures is a concern that I share, and I was not able to get my hands on said architectures to properly test this change (I only was able to test this PR on armv8 for now).

I can help answer that question. Other than the very old xgene1 processor (the APM Mustang processors), all other cores that gcc knows about has CRC support. So while from a theoretical point of view it is unsafe, from a practical standpoint I think the assumption is safe because I don't think anybody actually would think of using xgene to run openresty on production. It is definitely safer than assuming SSE4.2 since there are far more machines in active operation that don't have SSE4.2.

2. In absolute, such a change should be introduced in a subsequent commit (in this PR or a subsequent one) to guarantee the atomicity of the git history; thus ensuring we separate: a) runtime CRC32 detection, b) ARM64 support, and c) auto-enablement of the string hashing optimization at runtime depending on a).

Fair point, if that is the intention I think the change is fine with an updated PR title.

Also as a side note, I am beginning to wonder how much we need to bother with merge compatibility with the original LuaJIT project.

This is a great segway now that the topic has been mentioned in my above paragraph, and my take on it is that for the time being (and probably for the foreseeable future), we should, when possible, ensure that merge-compatibility be preserved as the upstream v2.1 branch may still receive bugfixes for as long as the project is not officially declared unmaintained.

I'm not holding my breath anymore on an official announcement :)

@agentzh
Copy link
Member

agentzh commented Jan 13, 2020

This PR needs to rebase to the latest master.

@agentzh
Copy link
Member

agentzh commented Jan 13, 2020

@siddhesh Thanks for looking at this one. You definitely know about the actual arm hardware usage better than me. We can always enable it for the aarch64 world then.

We still need to be careful about merging conflicts. I just spent a lot of time a few days ago to solve the merge conflicts from upstream LuaJIT repo :)

@agentzh
Copy link
Member

agentzh commented Jan 13, 2020

We'd better hurry on this PR, we're about to cut a new release. Hopefully we can have it in the next OpenResty version :)

@agentzh
Copy link
Member

agentzh commented Jan 13, 2020

So with this PR the same x64 openresty binary can work on modern intel chips and amd chips at the same time now?

@siddhesh
Copy link
Collaborator

My understanding is that it's not the goal (hence my request to remove the dynamic word from the PR title) of this PR to have the same binary work on all chips. This PR only takes the lj_str_hash cleanup stuff from my patchset and leaves dynamic detection for a later date.

As for the rebase, the only thing to do here is to drop the _original_hash function since Mike has rewritten those bits.

@agentzh
Copy link
Member

agentzh commented Jan 13, 2020

@siddhesh I see. Thanks for the clarification. I'm really looking forward to binary-level auto-scheduling for different microarchitectures.

@siddhesh
Copy link
Collaborator

That's what #75 does; you can test it in action in moonjit. I think @thibaultcha has a plan to stage the changes in though (I don't fully understand how things fit into OpenResty, etc) so y'all might want to discuss that.

@thibaultcha
Copy link
Member Author

@siddhesh Please have a look at the latest commit which adds the bits we previously mentioned to unconditionally build the binary with SSE4.2/CRC32.

* Added missing targets to .PHONY
* Made the unit_test.sh script executable
* Fixed minor styling issues
@siddhesh
Copy link
Collaborator

@thibaultcha I'm afraid that still does not help. I just skimmed through the patchset and building the entire binary with -msse4.2 is incorrect because it will not run on CPUs that don't have SSE4.2; the compiler is free to use SSE4.2 instructions wherever it sees fit throughout the code base whereas we want to use it only for the string hash function. That is why we need to build only the SSE4.2 bits with that flag. Likewise for the aarc64 crc32 function, although that's much less of a problem since crc32 is common on aarch64 servers.

src/Makefile Outdated Show resolved Hide resolved
@thibaultcha thibaultcha force-pushed the dynamic-sse42 branch 4 times, most recently from 5d3d960 to f8a7ea0 Compare January 14, 2020 20:27
@siddhesh
Copy link
Collaborator

@thibaultcha I haven't had a chance to do a thorough review, but a quick skim gives me the impression that this should work correctly. I'll sync this into moonjit this weekend (since the patchset in moonjit is slightly different, only in some minor details and tests) and let you know if I find issues.

src/lj_str.h Outdated Show resolved Hide resolved
thibaultcha and others added 4 commits January 15, 2020 11:32
…'-msse4.2'.

Co-authored-by: iSage <[email protected]>
Co-authored-by: Siddhesh Poyarekar <[email protected]>
Only, available in ARMv8, the CRC32 instructions are enabled when LuaJIT
is compiled with `-march=armv8-a+crc`.

Co-authored-by: Debayan Ghosh <[email protected]>
@yangshuxin
Copy link

As a general question, in order to be "dynamic", we have to replace direct, possibly inlinable, function calls to be indirectly function calls. Unfortunately, call-sites involved are pretty frequent. Do you see any performance penalty in your stress testing?

{
return (const uint32_t*)(void*)str;
}

/* hash string with len in [1, 4) */
static LJ_AINLINE uint32_t lj_str_hash_1_4(const char* str, uint32_t len)
static LJ_NOINLINE uint32_t lj_str_hash_1_4(const char* str, uint32_t len)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comment somehow is gone. I try to understand this change again. How come you don't want to have this function inlined? As far as I can tell, it's perfect candidate for inlining.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because previous versions of this patch have a particular treatment for LJ_AINLINE:

#undef LJ_AINLINE
#define LJ_AINLINE

I agree that it seems like a good candidate for inlining, but my reasoning was to wait until we have some micro benchmarks to run before making this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants