Skip to content

Conversation

@mdedetrich
Copy link
Contributor

DNS id generated should be random, see https://datatracker.ietf.org/doc/html/rfc5452#section-4.3. One additional thing that needs to be confirmed is whether we need to handle collisions

@He-Pin
Copy link
Member

He-Pin commented Jun 7, 2023

  1. collisions should be handled too.
  2. the call to nextId is not thread safe, does this pr handle this?

@mdedetrich
Copy link
Contributor Author

@Claudenw Pinging

@mdedetrich
Copy link
Contributor Author

collisions should be handled too.

So this isn't entirely clear because I have seen other implementations of DNS resolvers in other software (i.e. systemd-resolved) and they just use standard random which is also prone to collisions.

the call to nextId is not thread safe, does this pr handle this?

I don't think this is an issue, the entire implementation is private and I don't think it can be called from multiple threads at once. Even if it can then its not a problem, you might get a slightly "wrong" result but it doesn't matter because we just want random numbers.

@pjfanning
Copy link
Member

The old code wasn't thread safe either - why do we need to worry about this code being thread safe?

To make the old code thread safe, we'd need to use an AtomicInteger or something like that.

@pjfanning
Copy link
Member

I'm not an expert but any idea why the value is a short. If we used a type like int or long, the risk of collisions reduces.

@mdedetrich
Copy link
Contributor Author

I'm not an expert but any idea why the value is a short. If we used a type like int or long, the risk of collisions reduces.

It has to be according to the standard, see https://datatracker.ietf.org/doc/html/rfc5452#section-4.3. The generated ID's fill a octet (i.e. 2 byte) range which in Java/Scala is a Short.

Actually this new implementation is strictly better than the current one because the current one only used 1 byte range since it just incremented from 0 and Short.MaxValue + 1 doesn't overflow into Short.MinValue.

@He-Pin
Copy link
Member

He-Pin commented Jun 7, 2023

The collision can happen in DnsClient.scala, if there are two flight queries with same id.

And i was.submited a pr in akka for this, but can be done with a fsm like thing too.

@mdedetrich
Copy link
Contributor Author

The collision can happen in DnsClient.scala, if there are two flight queries with same id.

And i was.submited a pr in akka for this, but can be done with a fsm like thing too.

If this is regarding it not being thread safe we can solve this separately. Let's fix one problem at a time, the RC is on our doorstep.

On another note CI is failing because the tests need to be updated. I'll do this in a bit.

@pjfanning
Copy link
Member

pjfanning commented Jun 7, 2023

Do we know what happens if we change pekko.io.dns.resolver to use inet-address instead of async-dns?

@mdedetrich
Copy link
Contributor Author

Do we know what happens if we change pekko.io.dns.resolver to use inet-address instead of async-dns?

On the top of my head this uses a completely separate implementation from Java stdlib so it shouldn't be an issue.

@He-Pin
Copy link
Member

He-Pin commented Jun 7, 2023

Do we know what happens if we change pekko.io.dns.resolver to use inet-address instead of async-dns?

Java lib based one may be slower and blocking?

@mdedetrich
Copy link
Contributor Author

mdedetrich commented Jun 7, 2023

Do we know what happens if we change pekko.io.dns.resolver to use inet-address instead of async-dns?

Java lib based one may be slower and blocking?

From what @raboof said, the reason why the async-dns was implemented in the first place is that there are caching issues with the java implementation.

It being slower due to blocking is likely a secondary reason.

@mdedetrich mdedetrich force-pushed the generate-random-dns-ids branch from 9efc07b to 8c346e2 Compare June 7, 2023 12:06
@pjfanning
Copy link
Member

I raised #373 for the lower priority issue of tracking ids

@mdedetrich
Copy link
Contributor Author

So I have just updated the tests and rebased/force pushed the changes. The reason why the tests are failing is that they assumed the nextId was incrementally increasing and hence checking against raw id's. I have updated the tests so that we don't test against the id's but the properties for the id, i.e. typically speaking the test expects an initial id from the dnsClient1 i.e.

val firstId = dnsClient1.expectMsgPF() {
  case q4: Question4 if q4.name == "cats.com" =>
    q4.id
}

and then that secondId is being reused in later on, such as

val secondId = dnsClient2.expectMsgPF() {
  case q4: Question4 if q4.name == "cats.com" && q4.id != firstId =>
    q4.id
}

i.e. in this specific case we can't test for hardcoded id's but we can still make sure that the secondId is not the same as the firstId.

@mdedetrich
Copy link
Contributor Author

I raised #373 for the lower priority issue of tracking ids

I am going to confirm whether collisions are an issue. Note that one advantage of the algorithm that @Claudenw provided is that its deliberately designed to minimize the amount of collisions, from https://en.wikipedia.org/wiki/Double_hashing#Enhanced_double_hashing

the interval depends on the data, so that values mapping to the same location have different bucket sequences; this minimizes repeated collisions and the effects of clustering.

@mdedetrich
Copy link
Contributor Author

So I just checked systemd-resolved and it does appear to handle hash collisions, it does this by tracking the state of current transactions (see https://github.com/systemd/systemd/blob/main/src/resolve/resolved-dns-transaction.c#L195-L196). Will work on this now.

idIncrement -= {
idCount += 1; idCount - 1
}
result
Copy link
Member

@He-Pin He-Pin Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextId can be called concurrently, so I think two thread can get the same result.
idIncrement and idIndex should be protected with AtomicLong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using #288

pjfanning
pjfanning previously approved these changes Jun 7, 2023
Copy link
Member

@pjfanning pjfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - we can continue to look at fully collision-proof implementations but this PR is probably enough to unblock an RC

@pjfanning pjfanning added the bug Something isn't working label Jun 7, 2023
@mdedetrich
Copy link
Contributor Author

Lets wait a bit before merging this, I think I was wrong in what I said before about collisions and we do need to handle this.

@pjfanning pjfanning self-requested a review June 7, 2023 13:00
@mdedetrich
Copy link
Contributor Author

mdedetrich commented Jun 7, 2023

nextId can be called concurrently, so I think two thread can get the same result. idIncrement and idIndex should be protected with AtomicLong

So having gone through the codebase, nextId() is only called in a receive block (indirectly) which means that its only going to be called in a single threaded context (this is something that actor guarantees).

Did I miss something?

@pjfanning
Copy link
Member

I tried a solution with an AtomicReference around the state but its performance was slower than the synchronization and Semaphore locking variants. Approximately in SecureRandom territory in the 8 threads microbenchmark.

This isn't too surprising, I suspect that if you wanted to get any tangible benefits of AtomicLong/volatile you would need to use it on the individual variables (i.e. index/counter etc etc) where its needed which is hard to do correctly

Unfortunately, the 3 numbers in the state are all important. And they should be updated together, so I find it hard to see any solution based on AtomicLongs or AtomicReferences.

Using the Semaphore locking has the best result in my testing. Could we just go with that for RC1? We are very likely to need an RC2 so there is time for someone to find a better solution.

@mdedetrich
Copy link
Contributor Author

mdedetrich commented Jun 15, 2023

Unfortunately, the 3 numbers in the state are all important. And they should be updated together, so I find it hard to see any solution based on AtomicLongs or AtomicReferences.

Ordinarily yes, but we aren't looking for complete correctness here but I can't think of any way to abuse this atm

Using the Semaphore locking has the best result in my testing. Could we just go with that for RC1? We are very likely to need an RC2 so there is time for someone to find a better solution.

Fine with me unless there are some other real objections. Will wait till EOD.

@mdedetrich mdedetrich force-pushed the generate-random-dns-ids branch 2 times, most recently from 307bd25 to 6ae1d29 Compare June 15, 2023 19:21
@mdedetrich
Copy link
Contributor Author

So I have just pushed the branch using the simple version of enhanced double hash with the synchronized keyword. Here are the benchmarks

[info] IdGeneratorBanchmark.measureEnhancedDoubleHash                 thrpt    3  0,377 ±  0,032  ops/ns
[info] IdGeneratorBanchmark.measureSecureRandom                       thrpt    3  0,001 ±  0,001  ops/ns
[info] IdGeneratorBanchmark.measureThreadLocalRandom                  thrpt    3  0,308 ±  0,057  ops/ns
[info] IdGeneratorBanchmark.multipleThreadsMeasureEnhancedDoubleHash  thrpt    3  0,025 ±  0,013  ops/ns
[info] IdGeneratorBanchmark.multipleThreadsMeasureSecureRandom        thrpt    3  0,001 ±  0,001  ops/ns
[info] IdGeneratorBanchmark.multipleThreadsMeasureThreadLocalRandom   thrpt    3  0,609 ±  0,014  ops/ns

Note that I have modified the benchmarks to be more indicative of what we are actually testing in reality, i.e. the original benchmarks designated 8 threads however its unlikely in reality that we will have more than 8 threads hitting the nextId() function at once with such high contention, so instead I set the default to just a single thread but added another set of benchmarks for multiple threads.

@pjfanning @IainHull Let me know if I may have missed something

@mdedetrich mdedetrich force-pushed the generate-random-dns-ids branch from 6ae1d29 to b0bc719 Compare June 15, 2023 19:28
@mdedetrich mdedetrich force-pushed the generate-random-dns-ids branch from b0bc719 to b0e3ac1 Compare June 15, 2023 19:37
private var count = 1L

def nextId(): Short = synchronized {
val result = (0xFFFFFFFF & index).asInstanceOf[Short]
Copy link
Member

@pjfanning pjfanning Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to consider https://github.com/mdedetrich/incubator-pekko/blob/514ac493c3d530aea16f86d702b7d2ed358229e1/actor/src/main/scala/org/apache/pekko/util/UniqueRandomShortProvider.scala ? @alexandru suggested that 'synchronized' is a bad solution and the semaphore can timeout (while synchronized can just hang).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure ill look at it tomorrow, will be busy tonight

Copy link
Contributor Author

@mdedetrich mdedetrich Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have just update the version to use a semaphore as well as making the lock timeout configurable plus I did some other improvements to make the config reading/parsing consistent with the rest of Pekko.

I used the exact same version that you suggested (with a minor performance improvement to avoid creating a new class in the failure case of acquiring a lock) and the performance nosedived.

[info] IdGeneratorBanchmark.measureEnhancedDoubleHash                 thrpt    3  ≈ 10⁻¹⁰            ops/ns
[info] IdGeneratorBanchmark.measureSecureRandom                       thrpt    3    0,001 ±   0,001  ops/ns
[info] IdGeneratorBanchmark.measureThreadLocalRandom                  thrpt    3    0,311 ±   0,031  ops/ns
[info] IdGeneratorBanchmark.multipleThreadsMeasureEnhancedDoubleHash  thrpt    3  ≈ 10⁻¹⁰            ops/ns
[info] IdGeneratorBanchmark.multipleThreadsMeasureSecureRandom        thrpt    3    0,001 ±   0,001  ops/ns
[info] IdGeneratorBanchmark.multipleThreadsMeasureThreadLocalRandom   thrpt    3    0,608 ±   0,036  ops/ns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So using a much shorter timeout the improvements are better (i.e. this is the results with 10 nanos)

[info] IdGeneratorBanchmark.measureEnhancedDoubleHash                 thrpt    3  0,001 ±  0,001  ops/ns
[info] IdGeneratorBanchmark.measureSecureRandom                       thrpt    3  0,001 ±  0,001  ops/ns
[info] IdGeneratorBanchmark.measureThreadLocalRandom                  thrpt    3  0,312 ±  0,074  ops/ns
[info] IdGeneratorBanchmark.multipleThreadsMeasureEnhancedDoubleHash  thrpt    3  0,001 ±  0,001  ops/ns
[info] IdGeneratorBanchmark.multipleThreadsMeasureSecureRandom        thrpt    3  0,001 ±  0,001  ops/ns
[info] IdGeneratorBanchmark.multipleThreadsMeasureThreadLocalRandom   thrpt    3  0,622 ±  0,018  ops/ns

I am not sure what problem we are trying to achieve with a semaphore though, in such a hot loop with only a single permit its actually slower than synchronized which because the JVM know its only single threaded (i.e. permit of 1) is hyper optimized for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjfanning You just approved PR, did you read this?

Copy link
Member

@He-Pin He-Pin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, and it's cpu bound and so ok to use the synchronized.

And there is already have a util, let's make use of that one.

And for virtual thread friendly, use lock instead which is fast too.

@mdedetrich mdedetrich force-pushed the generate-random-dns-ids branch 2 times, most recently from e015bbc to 2500db1 Compare June 16, 2023 07:38
Copy link
Member

@pjfanning pjfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

private val onAcquireFailure = IdGenerator.random(seed)

override final def nextId(): Short = {
if (lock.tryAcquire(semaphoreTimeout.length, semaphoreTimeout.unit)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert to synchronized if you find it is working better for you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I will do so

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I just merged a PR fixing a small issue in the negative ID fix - I have resolved the conflict in your PR/branch - so you might want to pull the latest code

@mdedetrich mdedetrich force-pushed the generate-random-dns-ids branch 4 times, most recently from 8513519 to 045df1b Compare June 16, 2023 09:07
@pjfanning
Copy link
Member

@mdedetrich this PR is showing a Merge Conflict in GitHub

@mdedetrich mdedetrich force-pushed the generate-random-dns-ids branch from 045df1b to e9be00d Compare June 16, 2023 09:09
@mdedetrich
Copy link
Contributor Author

@mdedetrich this PR is showing a Merge Conflict in GitHub

Fixed

@IainHull
Copy link
Contributor

@mdedetrich Sorry for the delay getting back, this looks good.

@mdedetrich
Copy link
Contributor Author

@IainHull Note that I made a ticket for anyone that wants to further improve performance of double enhanced hasher #411

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

async-dns-resolver Track issues in AsyncDnsResolver and related classes bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants