implement lzcnt and tzcnt #145

Ekleog-NEAR · 2022-10-06T14:12:40Z

Performance changes are:

For i64.clz, compile-time gets 40% faster and runtime gets 5x faster
For i64.ctz, compile-time gets 35-40% faster and runtime gets 5-6x faster

(The difference between results probably comes from variance in measurement)

We should add BMI1 and LZCNT to at least our recommended feature list on https://near-nodes.io/validator/hardware ; and probably our required feature list.

Performance changes are: - For i32.clz, compile-time gets 40% faster and runtime gets 5x faster - For i32.ctz, compile-time gets 35-40% faster and runtime gets 5-6x faster (The difference between results probably comes from variance in measurement)

nagisa · 2022-10-10T11:23:49Z

lib/compiler-singlepass/src/emitter_x64.rs

+    fn arch_has_xzcnt(&self) -> bool {
+        std::is_x86_feature_detected!("bmi1") // tzcnt
+            && std::is_x86_feature_detected!("lzcnt")
+    }


We cannot/should not do this – we don’t always run the code we generate on the same machine. Instead should consult with compiler::target::CpuFeatures

I originally tried to use it, but I can’t figure out how to access CpuFeature from this impl, which is on top of VecAssembler<X64Relocation>.

I can add a CpuFeature argument to arch_has_xzcnt, and add such a field to compiler_singlepass::config::Singlepass so that we add support for it; but then it’d just be CpuFeature::for_host() everywhere anyway. Maybe it’s better than using ambient authority like is_x86_feature_detected! I guess

However, I’m curious, when do we ever share artifacts between machines? Overall it would seem like a quite bad idea, especially as aarch64 is picking up steam

However, I’m curious, when do we ever share artifacts between machines? Overall it would seem like a quite bad idea, especially as aarch64 is picking up steam

We store compiled artifacts into the RocksDB store. This database can be moved or shared, I believe. Interestingly our cache key does not depend on the CPU features, but I’d imagine we have enough foresight to change the seed here when we change this.

A most straightforward example of a database move would be a GCP instance restarting with a different CPU.

As for accessing CpuFeature from within VecAssembler, I imagine it should be sufficient to thread Target up to the FuncGen which could then check for the features stored within Target before deciding whether to use the fallback or the simple implementation.

Hmm do we actually use no_cpu_compatibility_check in production? I’d expect us to actually use CPU compatibility checks.

I guess my main issue is that if people compile with lzcnt/tzcnt enabled and then migrate to a machine without that (which seems unlikely but still), then it’ll result in mis-execution, as lzcnt is rep bsr and that’d execute as bsr there, thus giving out wrong answers if the value under count was 0. And this could lead to slashing.

I guess one solution would be to assert when creating wasmer2_runner, in new_for_target, that bmi1/lzcnt are enabled; and add bmi1/lzcnt as required features in our docs. This way we could make sure that no misuse happens. That said, I’m not sure we should support use cases involving moving a rocksdb store to an older CPU… going to start a zulip thread on it I guess!

Either way, this change makes sense, I’ve just pushed it :)

nagisa · 2022-10-10T11:28:10Z

Overall, it seems reasonable to me for us to emit lzcnt/tzcnt. These instructions are quite commonplace nowadays.

3302: Some Refactor of Singlepass compiler to have better error and cpu features handling r=ptitSeb a=ptitSeb # Description * Some refactor of Singlepass compiler to have a better handling of Target CPU Feature and Error propagation * Added support for TZCNT and LZCNT opcode on x86_64 backend (based on near/wasmer#145) For #3305 Co-authored-by: ptitSeb <[email protected]>

implement lzcnt and tzcnt

7a569bc

Performance changes are: - For i32.clz, compile-time gets 40% faster and runtime gets 5x faster - For i32.ctz, compile-time gets 35-40% faster and runtime gets 5-6x faster (The difference between results probably comes from variance in measurement)

Ekleog-NEAR requested review from matklad and nagisa October 6, 2022 14:20

nagisa reviewed Oct 10, 2022

View reviewed changes

match CPU features from declared target

73d798b

nagisa approved these changes Oct 14, 2022

View reviewed changes

Ekleog-NEAR merged commit bd153a6 into near:near-main Oct 14, 2022

Ekleog-NEAR mentioned this pull request Oct 14, 2022

Also require BMI1 and LZCNT instructions near/node-docs#55

Open

This was referenced Nov 14, 2022

Some Refactor of Singlepass compiler to have better error and cpu features handling wasmerio/wasmer#3302

Merged

Backport Singlepass optimisations from Near wasmerio/wasmer#3305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement lzcnt and tzcnt #145

implement lzcnt and tzcnt #145

Ekleog-NEAR commented Oct 6, 2022 •

edited

Loading

nagisa Oct 10, 2022

Ekleog-NEAR Oct 10, 2022

nagisa Oct 11, 2022

nagisa Oct 11, 2022

nagisa Oct 11, 2022

Ekleog-NEAR Oct 12, 2022

nagisa commented Oct 10, 2022

implement lzcnt and tzcnt #145

implement lzcnt and tzcnt #145

Conversation

Ekleog-NEAR commented Oct 6, 2022 • edited Loading

nagisa Oct 10, 2022

Choose a reason for hiding this comment

Ekleog-NEAR Oct 10, 2022

Choose a reason for hiding this comment

nagisa Oct 11, 2022

Choose a reason for hiding this comment

nagisa Oct 11, 2022

Choose a reason for hiding this comment

nagisa Oct 11, 2022

Choose a reason for hiding this comment

Ekleog-NEAR Oct 12, 2022

Choose a reason for hiding this comment

nagisa commented Oct 10, 2022

Ekleog-NEAR commented Oct 6, 2022 •

edited

Loading