Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Jul 28, 2020

What changes were proposed in this pull request?

When https is enabled for Spark UI, an HTTP request will be redirected as an encoded HTTPS URL: https://github.com/apache/spark/pull/10238/files#diff-f79a5ead735b3d0b34b6b94486918e1cR312

When we create the redirect url, we will call getRequestURI and getQueryString. Both two methods may return an encoded string. However, we pass them directly to the following URI constructor

URI(String scheme, String authority, String path, String query, String fragment)

As this URI constructor assumes both path and query parameters are decoded strings, it will encode them again. This makes the redirect URL encoded twice.

This problem is on stage page with HTTPS enabled. The URL of "/taskTable" contains query parameter order%5B0%5D%5Bcolumn%5D. After encoded it becomes order%255B0%255D%255Bcolumn%255D and it will be decoded as order%5B0%5D%5Bcolumn%5D instead of order[0][dir]. When the parameter order[0][dir] is missing, there will be an excetpion from:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176
and the stage page fail to load.

To fix the problem, we can try decoding the query parameters before encoding it. This is to make sure we encode the URL

Why are the changes needed?

Fix a UI issue when HTTPS is enabled

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new Unit test + manually test on a cluster

@SparkQA
Copy link

SparkQA commented Jul 28, 2020

Test build #126720 has finished for PR 29271 at commit 3a4c844.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the original HTTP URL could be encoded already, so that the HTTPS result will be encoded twice.

It would be great if we can explain the issue more explicitly. Maybe something like this:

When we create the redirect url, we will call getRequestURI and getQueryString. These two methods return an encoded string. However, we pass them directly to the following URI constructor

URI(String scheme,
               String authority,
               String path, String query, String fragment)

As this URI constructor assumes both path and query parameters are decoded strings, it will encode them again. This makes the redirect url to be encoded twice.

}
}

test("Avoid encoding URL twice on https redirect") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: miss jira number

}
val authority = s"$redirectServer:$port"
new URI(scheme, authority, path, query, null).toString
val queryEncoding = if (request.getQueryEncoding != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to use the same approach in createProxyURI? Looks like we can try to create a method to share in createRedirectURI and createProxyURI.

    val rewrittenURI = URI.create(uri.toString())
    if (query != null) {
      return new URI(
          rewrittenURI.getScheme(),
          rewrittenURI.getAuthority(),
          rewrittenURI.getPath(),
          query,
          rewrittenURI.getFragment()
        ).normalize()
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but it doesn't do the decoding with URI.toString

scala> new URI("https://g.com/ctx%281%29?a%5B0%5D=b").normalize().toString
res28: String = https://g.com/ctx%281%29?a%5B0%5D=b

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can build a URI and call get getPath and getQuery, and then build a new URI string from the decode result. ```
scala> new URI("https://g.com/ctx%281%29?a%5B0%5D=b").getPath
res31: String = /ctx(1)

scala> new URI("https://g.com/ctx%281%29?a%5B0%5D=b").getQuery
res33: String = a[0]=b

But it seems equivalent of the current code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but it doesn't do the decoding with URI.toString

URI.toString doesn't do the decoding. It will encode characters if necessary. For example,

scala> val uri = new java.net.URI(null, null, "foo bar", "foo bar", null).toString
uri: String = foo%20bar?foo%20bar

Anyway, this doesn't matter. I tried to consolidate these two method but it didn't look better.

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126879 has finished for PR 29271 at commit ef6cb87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
val authority = s"$redirectServer:$port"
new URI(scheme, authority, path, query, null).toString
val queryEncoding = if (request.getQueryEncoding != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but it doesn't do the decoding with URI.toString

URI.toString doesn't do the decoding. It will encode characters if necessary. For example,

scala> val uri = new java.net.URI(null, null, "foo bar", "foo bar", null).toString
uri: String = foo%20bar?foo%20bar

Anyway, this doesn't matter. I tried to consolidate these two method but it didn't look better.

@gengliangwang
Copy link
Member Author

@cloud-fan @srowen @zsxwing Thanks for the review!

@gengliangwang
Copy link
Member Author

gengliangwang commented Aug 1, 2020

Merging to master and 3.0

gengliangwang added a commit that referenced this pull request Aug 1, 2020
### What changes were proposed in this pull request?

When https is enabled for Spark UI, an HTTP request will be redirected as an encoded HTTPS URL: https://github.com/apache/spark/pull/10238/files#diff-f79a5ead735b3d0b34b6b94486918e1cR312

When we create the redirect url, we will call getRequestURI and getQueryString. Both two methods may return an encoded string. However, we pass them directly to the following URI constructor
```
URI(String scheme, String authority, String path, String query, String fragment)
```
As this URI constructor assumes both path and query parameters are decoded strings, it will encode them again. This makes the redirect URL encoded twice.

This problem is on stage page with HTTPS enabled. The URL of "/taskTable" contains query parameter `order%5B0%5D%5Bcolumn%5D`. After encoded it becomes  `order%255B0%255D%255Bcolumn%255D` and it will be decoded as `order%5B0%5D%5Bcolumn%5D` instead of `order[0][dir]`.  When the parameter `order[0][dir]` is missing, there will be an excetpion from:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176
and the stage page fail to load.

To fix the problem, we can try decoding the query parameters before encoding it. This is to make sure we encode the URL

### Why are the changes needed?

Fix a UI issue when HTTPS is enabled

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new Unit test + manually test on a cluster

Closes #29271 from gengliangwang/urlEncode.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
(cherry picked from commit 71aea02)
Signed-off-by: Gengliang Wang <[email protected]>
gengliangwang added a commit that referenced this pull request Dec 1, 2020
…itten proxy URL

### What changes were proposed in this pull request?

When running Spark behind a reverse proxy(e.g. Nginx, Apache HTTP server), the request URL can be encoded twice if we pass the query string directly to the constructor of `java.net.URI`:
```
> val uri = "http://localhost:8081/test"
> val query = "order%5B0%5D%5Bcolumn%5D=0"  // query string of URL from the reverse proxy
> val rewrittenURI = URI.create(uri.toString())

> new URI(rewrittenURI.getScheme(),
      rewrittenURI.getAuthority(),
      rewrittenURI.getPath(),
      query,
      rewrittenURI.getFragment()).toString
result: http://localhost:8081/test?order%255B0%255D%255Bcolumn%255D=0
```

In Spark's stage page, the URL of "/taskTable" contains query parameter order[0][dir]. After encoding twice, the query parameter becomes `order%255B0%255D%255Bdir%255D` and it will be decoded as `order%5B0%5D%5Bdir%5D` instead of `order[0][dir]`. As a result, there will be NullPointerException from https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176
Other than that, the other parameter may not work as expected after encoded twice.

This PR is to fix the bug by calling the method `URI.create(String URL)` directly. This convenience method can avoid encoding twice on the query parameter.
```
> val uri = "http://localhost:8081/test"
> val query = "order%5B0%5D%5Bcolumn%5D=0"
> URI.create(s"$uri?$query").toString
result: http://localhost:8081/test?order%5B0%5D%5Bcolumn%5D=0

> URI.create(s"$uri?$query").getQuery
result: order[0][column]=0
```

### Why are the changes needed?

Fix a potential bug when Spark's reverse proxy is enabled.
The bug itself is similar to #29271.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add a new unit test.
Also, Manual UI testing for master, worker and app UI with an nginx proxy

Spark config:
```
spark.ui.port 8080
spark.ui.reverseProxy=true
spark.ui.reverseProxyUrl=/path/to/spark/
```
nginx config:
```
server {
    listen 9000;
    set $SPARK_MASTER http://127.0.0.1:8080;
    # split spark UI path into prefix and local path within master UI
    location ~ ^(/path/to/spark/) {
        # strip prefix when forwarding request
        rewrite /path/to/spark(/.*) $1  break;
        #rewrite /path/to/spark/ "/" ;
        # forward to spark master UI
        proxy_pass $SPARK_MASTER;
        proxy_intercept_errors on;
        error_page 301 302 307 = handle_redirects;
    }
    location handle_redirects {
        set $saved_redirect_location '$upstream_http_location';
        proxy_pass $saved_redirect_location;
    }
}
```

Closes #30552 from gengliangwang/decodeProxyRedirect.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
gengliangwang added a commit that referenced this pull request Dec 1, 2020
…itten proxy URL

### What changes were proposed in this pull request?

When running Spark behind a reverse proxy(e.g. Nginx, Apache HTTP server), the request URL can be encoded twice if we pass the query string directly to the constructor of `java.net.URI`:
```
> val uri = "http://localhost:8081/test"
> val query = "order%5B0%5D%5Bcolumn%5D=0"  // query string of URL from the reverse proxy
> val rewrittenURI = URI.create(uri.toString())

> new URI(rewrittenURI.getScheme(),
      rewrittenURI.getAuthority(),
      rewrittenURI.getPath(),
      query,
      rewrittenURI.getFragment()).toString
result: http://localhost:8081/test?order%255B0%255D%255Bcolumn%255D=0
```

In Spark's stage page, the URL of "/taskTable" contains query parameter order[0][dir]. After encoding twice, the query parameter becomes `order%255B0%255D%255Bdir%255D` and it will be decoded as `order%5B0%5D%5Bdir%5D` instead of `order[0][dir]`. As a result, there will be NullPointerException from https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176
Other than that, the other parameter may not work as expected after encoded twice.

This PR is to fix the bug by calling the method `URI.create(String URL)` directly. This convenience method can avoid encoding twice on the query parameter.
```
> val uri = "http://localhost:8081/test"
> val query = "order%5B0%5D%5Bcolumn%5D=0"
> URI.create(s"$uri?$query").toString
result: http://localhost:8081/test?order%5B0%5D%5Bcolumn%5D=0

> URI.create(s"$uri?$query").getQuery
result: order[0][column]=0
```

### Why are the changes needed?

Fix a potential bug when Spark's reverse proxy is enabled.
The bug itself is similar to #29271.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add a new unit test.
Also, Manual UI testing for master, worker and app UI with an nginx proxy

Spark config:
```
spark.ui.port 8080
spark.ui.reverseProxy=true
spark.ui.reverseProxyUrl=/path/to/spark/
```
nginx config:
```
server {
    listen 9000;
    set $SPARK_MASTER http://127.0.0.1:8080;
    # split spark UI path into prefix and local path within master UI
    location ~ ^(/path/to/spark/) {
        # strip prefix when forwarding request
        rewrite /path/to/spark(/.*) $1  break;
        #rewrite /path/to/spark/ "/" ;
        # forward to spark master UI
        proxy_pass $SPARK_MASTER;
        proxy_intercept_errors on;
        error_page 301 302 307 = handle_redirects;
    }
    location handle_redirects {
        set $saved_redirect_location '$upstream_http_location';
        proxy_pass $saved_redirect_location;
    }
}
```

Closes #30552 from gengliangwang/decodeProxyRedirect.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
(cherry picked from commit 5d0045e)
Signed-off-by: Gengliang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants