Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Nov 30, 2020

What changes were proposed in this pull request?

When running Spark behind a reverse proxy(e.g. Nginx, Apache HTTP server), the request URL can be encoded twice if we pass the query string directly to the constructor of java.net.URI:

> val uri = "http://localhost:8081/test"
> val query = "order%5B0%5D%5Bcolumn%5D=0"  // query string of URL from the reverse proxy
> val rewrittenURI = URI.create(uri.toString())

> new URI(rewrittenURI.getScheme(),
      rewrittenURI.getAuthority(), 
      rewrittenURI.getPath(),
      query, 
      rewrittenURI.getFragment()).toString
result: http://localhost:8081/test?order%255B0%255D%255Bcolumn%255D=0

In Spark's stage page, the URL of "/taskTable" contains query parameter order[0][dir]. After encoding twice, the query parameter becomes order%255B0%255D%255Bdir%255D and it will be decoded as order%5B0%5D%5Bdir%5D instead of order[0][dir]. As a result, there will be NullPointerException from https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176
Other than that, the other parameter may not work as expected after encoded twice.

This PR is to fix the bug by calling the method URI.create(String URL) directly. This convenience method can avoid encoding twice on the query parameter.

> val uri = "http://localhost:8081/test"
> val query = "order%5B0%5D%5Bcolumn%5D=0"
> URI.create(s"$uri?$query").toString
result: http://localhost:8081/test?order%5B0%5D%5Bcolumn%5D=0

> URI.create(s"$uri?$query").getQuery
result: order[0][column]=0

Why are the changes needed?

Fix a potential bug when Spark's reverse proxy is enabled.
The bug itself is similar to #29271.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add a new unit test.
Also, Manual UI testing for master, worker and app UI with an nginx proxy

Spark config:

spark.ui.port 8080
spark.ui.reverseProxy=true
spark.ui.reverseProxyUrl=/path/to/spark/

nginx config:

server {
    listen 9000;
    set $SPARK_MASTER http://127.0.0.1:8080;
    # split spark UI path into prefix and local path within master UI
    location ~ ^(/path/to/spark/) {
        # strip prefix when forwarding request
        rewrite /path/to/spark(/.*) $1  break;
        #rewrite /path/to/spark/ "/" ;
        # forward to spark master UI
        proxy_pass $SPARK_MASTER;
        proxy_intercept_errors on;
        error_page 301 302 307 = @handle_redirects;
    }
    location @handle_redirects {
        set $saved_redirect_location '$upstream_http_location';
        proxy_pass $saved_redirect_location;
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also try decoding twice here. It works as well.

Copy link
Member Author

@gengliangwang gengliangwang Nov 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, I am not very confident to decode the URI, which may break the authority part of the URI.
To be safe, let's decode the parameter only in this PR.

@SparkQA
Copy link

SparkQA commented Nov 30, 2020

Test build #131995 has finished for PR 30552 at commit 0b84816.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [SPARK-33611][UI] Decode Query parameters of the redirect URL for reverse proxy [SPARK-33611][UI] Decode the redirect URL for reverse proxy if percent-encoded twice Dec 1, 2020
@gengliangwang gengliangwang changed the title [SPARK-33611][UI] Decode the redirect URL for reverse proxy if percent-encoded twice [WIP][SPARK-33611][UI] Decode the redirect URL for reverse proxy if percent-encoded twice Dec 1, 2020
@gengliangwang gengliangwang changed the title [WIP][SPARK-33611][UI] Decode the redirect URL for reverse proxy if percent-encoded twice [SPARK-33611][UI] Avoid encoding twice on the query parameter of rewritten proxy URL Dec 1, 2020
}
rewrittenURI.normalize()
// SPARK-33611: use method `URI.create` to avoid percent-encoding twice on the query string.
URI.create(uri.toString() + queryString).normalize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toString() is unnecesary here BTW.
So this relies on query already being escaped. That may be fine for the logic of this code, although in another world it would have been better to let this method take unescaped query strings and manage it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, Sean.

toString() is unnecesary here BTW.

uri is StringBuilder here. The original code also use toString().

So this relies on query already being escaped. That may be fine for the logic of this code, although in another world it would have been better to let this method take unescaped query strings and manage it here.

The query string here should be encoded already. I am not very familiar with this topic. Is there any better solution to handle both escaped/unescaped query string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, OK, ignore the toString() comment.
I just mean that semantically, encoding is a URI detail. The program should probably not worry about it except when going to and from URIs. But that's not how this code is structured. It clearly already expects and escaped string as argument, which then means callers have to deal with it. But, I don't think it's necessary to change. If it is escaped then yes this is a fix.

@gengliangwang
Copy link
Member Author

@cloud-fan @srowen Thanks for the review!

@gengliangwang
Copy link
Member Author

Merging to master and 3.0

gengliangwang added a commit that referenced this pull request Dec 1, 2020
…itten proxy URL

### What changes were proposed in this pull request?

When running Spark behind a reverse proxy(e.g. Nginx, Apache HTTP server), the request URL can be encoded twice if we pass the query string directly to the constructor of `java.net.URI`:
```
> val uri = "http://localhost:8081/test"
> val query = "order%5B0%5D%5Bcolumn%5D=0"  // query string of URL from the reverse proxy
> val rewrittenURI = URI.create(uri.toString())

> new URI(rewrittenURI.getScheme(),
      rewrittenURI.getAuthority(),
      rewrittenURI.getPath(),
      query,
      rewrittenURI.getFragment()).toString
result: http://localhost:8081/test?order%255B0%255D%255Bcolumn%255D=0
```

In Spark's stage page, the URL of "/taskTable" contains query parameter order[0][dir]. After encoding twice, the query parameter becomes `order%255B0%255D%255Bdir%255D` and it will be decoded as `order%5B0%5D%5Bdir%5D` instead of `order[0][dir]`. As a result, there will be NullPointerException from https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176
Other than that, the other parameter may not work as expected after encoded twice.

This PR is to fix the bug by calling the method `URI.create(String URL)` directly. This convenience method can avoid encoding twice on the query parameter.
```
> val uri = "http://localhost:8081/test"
> val query = "order%5B0%5D%5Bcolumn%5D=0"
> URI.create(s"$uri?$query").toString
result: http://localhost:8081/test?order%5B0%5D%5Bcolumn%5D=0

> URI.create(s"$uri?$query").getQuery
result: order[0][column]=0
```

### Why are the changes needed?

Fix a potential bug when Spark's reverse proxy is enabled.
The bug itself is similar to #29271.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add a new unit test.
Also, Manual UI testing for master, worker and app UI with an nginx proxy

Spark config:
```
spark.ui.port 8080
spark.ui.reverseProxy=true
spark.ui.reverseProxyUrl=/path/to/spark/
```
nginx config:
```
server {
    listen 9000;
    set $SPARK_MASTER http://127.0.0.1:8080;
    # split spark UI path into prefix and local path within master UI
    location ~ ^(/path/to/spark/) {
        # strip prefix when forwarding request
        rewrite /path/to/spark(/.*) $1  break;
        #rewrite /path/to/spark/ "/" ;
        # forward to spark master UI
        proxy_pass $SPARK_MASTER;
        proxy_intercept_errors on;
        error_page 301 302 307 = handle_redirects;
    }
    location handle_redirects {
        set $saved_redirect_location '$upstream_http_location';
        proxy_pass $saved_redirect_location;
    }
}
```

Closes #30552 from gengliangwang/decodeProxyRedirect.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
(cherry picked from commit 5d0045e)
Signed-off-by: Gengliang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants