Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-request proxies in HttpClient #35992

Open
LeaFrock opened this issue May 22, 2019 · 35 comments
Open

Per-request proxies in HttpClient #35992

LeaFrock opened this issue May 22, 2019 · 35 comments
Labels
area-System.Net.Http design-discussion Ongoing discussion about design without consensus
Milestone

Comments

@LeaFrock
Copy link
Contributor

I'm developing a web crawler framework. As usual, the web crawler needs a pool of proxies to send HTTP messages.
For example.

                public async Task<string> GetRspTextByProxyAsync(IWebProxy proxy)
		{
			var handler = new SocketsHttpHandler()
			{
				UseProxy = true,
				Proxy = proxy
			};
			using (var client = new HttpClient(handler))
			{
				return await client.GetStringAsync("https://github.com");//just for example
			}
		}

As we all know, 'new HttpClient()' is not a recommended way of creating clients and it'll cause exceptions related to exhaustion of sockets while too many clients are created.
Therefore, i want to use IHttpFactory instead. But there're still some problems. As I know, DefaultHttpFactory realizes the reuse of HttpMessageHandler. But one handler must be created for one proxy. If I have thousands of proxies, it means I have to create thousands of handlers.
Also, I don't know how to use DI at this situation. I've searched StackOverFlow for solutions, like below,

                                        services.AddHttpClient("proxy1").ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler()
					{
						UseProxy = true,
						Proxy = new WebProxy("http://localhost", 8888)
					});
					services.AddHttpClient("proxy2").ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler()
					{
						UseProxy = true,
						Proxy = new WebProxy("http://localhost", 8889)
					});
					services.AddHttpClient("proxy3").ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler()
					{
						UseProxy = true,
						Proxy = new WebProxy("http://localhost", 8890)
					});
					services.AddHostedService<MyHostedService>();

That's too ugly and I need to create clients dynamically because the amount of proxies is unknown before running. I can't write these hard codes.
What shall we do to enjoy the benefits of HttpClient(with thousands of webproxies) and avoid the side effect? I'm looking forward to suggestions and guidance.
Thank you in advance.

@myFirstway
Copy link

What should I do about this?

@huoshan12345
Copy link

same issue here. Think about this case:
I have a web proxy pool and I have a thread which is adding new proxy to the pool continuously.
Now if I want to use that proxies, I have to create httpclients dynamically(get a proxy from the pool and create a httpclient with the proxy).
How to achieve this with HttpClientFactory?

@myFirstway
Copy link

same issue here. Think about this case:
I have a web proxy pool and I have a thread which is adding new proxy to the pool continuously.
Now if I want to use that proxies, I have to create httpclients dynamically(get a proxy from the pool and create a httpclient with the proxy).
How to achieve this with HttpClientFactory?

Now how do you solve this problem?

@rynowak
Copy link
Member

rynowak commented Nov 25, 2019

See discussion here: dotnet/extensions#521 (comment)

This is a fundamental limitation of HttpClient's design. We're going to look at a solution for this during 5.0

@agertenbach
Copy link

agertenbach commented Apr 21, 2020

I ran into a similar problem related to contextual application of primary handler properties at runtime. I had a robust typed HttpClient with Polly and delegating handlers chained, but some property of the primary handler needed to be different based on a runtime attribute (i.e. different certificate based on URL, endpoint, user, etc.) and trying to register a half dozen versions of the same typed client pipeline did not make sense, even if I did want to set up all the handler configurations and certs within the composition root.

I wrote a library to extend the DefaultHttpClientFactory to resolve this:
https://github.com/agertenbach/Ringleader
https://www.nuget.org/packages/Ringleader/

It takes advantage of the fact that the primary handler management and pooling uses the name of the upstream named/typed client. By wrapping the options and tacking in an IHttpMessageHandlerBuilderFilter, we can parse out the original client name to resolve the client's pipeline and options, but then create/use a more granular primary handler in the handler pool based on some provided context when the typed client is requested. The library has two interfaces that have to be implemented, one to define the typed client and the context that you'll pass in (and translate it to a string), and then a second that can use that translated string to return a well-formed primary handler when a new one is requested by the DefaultHttpClientFactory.

Hope this helps -- feedback or comments welcome.

@analogrelay analogrelay transferred this issue from dotnet/extensions May 7, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Net.Http untriaged New issue has not been triaged by the area owner labels May 7, 2020
@ghost
Copy link

ghost commented May 7, 2020

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

@ericstj ericstj added design-discussion Ongoing discussion about design without consensus and removed untriaged New issue has not been triaged by the area owner labels Jul 6, 2020
@ericstj ericstj added this to the 6.0.0 milestone Jul 6, 2020
@ericstj
Copy link
Member

ericstj commented Jul 6, 2020

I still think this is important but it doesn't seem like we have the time to address this in 5.0. What do you think @dotnet/ncl? cc @davidfowl

@scalablecory
Copy link
Contributor

You can implement a custom IWebProxy today that does round-robin proxy selection. Would this work for you, or do you need very explicit control over which proxy each request gets?

@scalablecory scalablecory modified the milestones: 6.0.0, Future Jul 6, 2020
@LeaFrock
Copy link
Contributor Author

LeaFrock commented Jul 7, 2020

You can implement a custom IWebProxy today that does round-robin proxy selection. Would this work for you, or do you need very explicit control over which proxy each request gets?

Using IWebProxy is far from being enough. For example, I hope to remove unavailable proxies, to select better proxies as a spare proxy pool, it requires the ‘explicit’ control of each proxy and flexibility of the strategy developers take.

@LeaFrock
Copy link
Contributor Author

LeaFrock commented Jul 7, 2020

I still think this is important but it doesn't seem like we have the time to address this in 5.0. What do you think @dotnet/ncl? cc @davidfowl

Sorry to see that, but 'latter' usually brings 'better'. The design needs a careful thinking indeed.

See discussion here: dotnet/extensions#521 (comment)

This is a fundamental limitation of HttpClient's design. We're going to look at a solution for this during 5.0

@scalablecory
Copy link
Contributor

You can implement a custom IWebProxy today that does round-robin proxy selection. Would this work for you, or do you need very explicit control over which proxy each request gets?

Using IWebProxy is far from being enough. For example, I hope to remove unavailable proxies, to select better proxies as a spare proxy pool, it requires the ‘explicit’ control of each proxy and flexibility of the strategy developers take.

Thanks for the example, that makes sense. We won't have anything for this in .NET 5, but can consider it for .NET 6. I'd love to see some more upvotes on the top comment to see interest.

@scalablecory scalablecory changed the title [Question]Use HttpClient/HttpClientFactory with different proxies Per-request proxies in HttpClient Jul 7, 2020
@ghost
Copy link

ghost commented Jul 7, 2020

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

@GF-Huang
Copy link

GF-Huang commented May 2, 2021

This is why more people use python instead of c# to write crawlers, because .net httpclient is really lame.

@Aleksej-Shherbak
Copy link

Aleksej-Shherbak commented Dec 17, 2021

@scalablecory please, please add this feature! I remember I did a websites' parser. It' was very confusing for me to realize that there was no a simple, straight forward way to do that. One of the requirements was to use a new one proxy for each request. There was also a dynamic proxy adding demand, 'cause it's obviously if we are talking about web parsers. If the proxies stop working, does it mean that I have to update the application config (that contains that list of proxies) and redeploy the app? My customer just asked me: "Give me an admin panel where I could add or remove proxies".

Dotnet is perfect, but web parsing with proxy is still convoluted.

@wfurt
Copy link
Member

wfurt commented Dec 17, 2021

Doing anything per-request is going to be expensive. This may force creation of new connection and that is basically similar to creating new HttpClient.
Going back to initial comment from @LeaFrock: "If I have thousands of proxies" -> that is going to create thousands of connections regardless.

Would people mostly use multiple proxies for load balancing or is it functional e.g. different subdomains/destinations need different exit points?

@LeaFrock
Copy link
Contributor Author

that is going to create thousands of connections regardless.

Excuse me, that's not the point. The point is that, when you have many proxies, you must prefer only one connection to one proxy, which means reusable.

For now, if I have a proxy pool, before sending a request with some proxy, I need to create a HttpClient instance. Then, if the proxy fails, I have to dispose the instance and remove references of it, right?

If the proxy works, I hope that next time, when my app uses it again, I can re-use the same connection last time as possible. So now the problem becomes that I have to create a HttpClient pool, which may cause larger waste of resources.

If I just create a transient instance for one proxy every time, will HttpClient try to use the same connection? What I know about is no, so you may create connection multiple times for the same proxy request. As I said above, this way may cause a SocketsExhaust exception after your app runs for a while. Fortunately, now we can use SocketsHttpHandler and set PooledConnectionIdleTimeout&PooledConnectionLifetime to control the lifetime of pooled connections, which reduce the frequency of the problem.

In the past time, I didn't know a lot about HttpClient and thought IHttpClientFactory could help. For now, I'm still wondering that if there is a better solution than what I mention above.

Suppose that HttpClient(or something else) provides an inner way of handling proxies, since it's closer to underlying hardware connections, the developers will get a simpler solution to build apps with a better performance.

Would people mostly use multiple proxies for...

.NET works well for almost all areas, and web crawler is a relatively cold area. What I report is the most common dev-scenario for web crawlers, but I think .NET does not provide enough best practice about it.

@scalablecory
Copy link
Contributor

I appreciate the problem, @LeaFrock.

One possibility is that we update HttpRequestMessage with a Proxy member. A common problem with abstractions, this would mean any existing implementations would silently ignore it. Not horrible but also not ideal.

@Thomas-GH-CA
Copy link

A workaround to this issue for my use case is potentially to make a new named client with different credentials set.
Does anyone know what a reasonable limit for named clients would be ?
My use case could require around 50 and i am worried that would cause problems.

@wfurt
Copy link
Member

wfurt commented Jun 6, 2022

I don't think 50 clients would create problem. This is far bellow common OS limits and GC capabilities. It should be fairly easy IMHO to create test setup and try it @Thomas-GH-CA.

@Thomas-GH-CA
Copy link

Thanks @wfurt , thanks for the response. Looking back my comment is vague with little details so apologies but I just wanted to get an idea if 50 seemed wacky or not and seems it isn't so i will go ahead and try it out.

@davidfowl
Copy link
Member

davidfowl commented Jun 13, 2022

Generally, setting the proxy, cookie container or client certs requires managing handler instances on your own. You can no longer follow the guidance of "use a single handler for your application" as you're now storing "per user" state on the handler instance itself.

@LeaFrock
Copy link
Contributor Author

LeaFrock commented Jun 16, 2022

You can no longer follow the guidance of "use a single handler for your application" as you're now storing "per user" state on the handler instance itself.

This is true in essence. But I hope there will be some improvements here, to offload upper-level developers and help them move in the correct direction of best practices.

@bugproof
Copy link

bugproof commented Jun 26, 2022

Does any workaround for this issue exist? It seems like either re-create the client before making a new request or change WebProxy Address property before next request. Is there anything wrong with the second approach?

I think if I add client per proxy to IHttpClientFactory and change the used HttpClient instance when needed it will be fine.

@Vijay-Nirmal
Copy link

@bugproof I did a POC on using a single HttpClient instance with multiple proxy configurations in a round-robin faction, it works perfectly. But I haven't tested it in my actual application yet.

@bugproof
Copy link

@Vijay-Nirmal By changing Address property or implementing IWebProxy?

@Vijay-Nirmal
Copy link

Vijay-Nirmal commented Jun 26, 2022

@bugproof Implementing IWebProxy

The below code is just a quick POC, there might be issues in the code. Let me know if anyone knows how to improve the below code.

public class ManagedProxy : IWebProxy
{
    private readonly IProxyProvider _proxyProvider;

    public ManagedProxy(IProxyProvider proxyProvider)
    {
        _proxyProvider = proxyProvider;
    }

    public ICredentials? Credentials { get; set; }

    public Uri? GetProxy(Uri destination)
    {
        return _proxyProvider.GetProxy();
    }

    public bool IsBypassed(Uri host)
    {
        return false;
    }
}

public class ProxyProvider : IProxyProvider
{
    private Object proxyLock = new Object();
    private int _currentProxyIndex = 0;
    private readonly List<Uri?> _proxies = new List<Uri?>();

    public Uri? GetProxy()
    {
        lock (proxyLock) // May affect the performance
        {
            var proxy = _proxies.Count > 0 ? _proxies[_currentProxyIndex] : null;
            _currentProxyIndex = (_currentProxyIndex + 1) >= _proxies.Count ? 0 : _currentProxyIndex + 1;
            return proxy;
        }
    }

    // Codes to add proxies to the list. I had a code to web scrape proxies from a website and add it to the list
}

@davidfowl
Copy link
Member

@Vijay-Nirmal what does this accomplish? How does one set the proxy?

@Vijay-Nirmal
Copy link

@davidfowl I had a background service that periodically add new proxies to the list and check for bad proxies from the list and removed them.

what does this accomplish? How does one set the proxy?

I need to web scrape a website which has IP based rate limiter. So, using this method, I could overcome that :)

@gitlsl
Copy link

gitlsl commented Sep 17, 2022

I appreciate the problem, can we add proxy, cookie to HttpRequestMessage?
Python's network request library (requests httpx) is very convenient. Maybe we can learn from it

@davidfowl
Copy link
Member

@gitlsl Do you have a concrete API proposal we can work through?

Python's network request library (requests httpx) is very convenient. Maybe we can learn from it

Do you have a direct comparison for this scenario?

@gitlsl
Copy link

gitlsl commented Sep 18, 2022

Doing anything per-request is going to be expensive. This may force creation of new connection and that is basically similar to creating new HttpClient.
Going back to initial comment from @LeaFrock: "If I have thousands of proxies" -> that is going to create thousands of connections regardless.

@davidfowl
As other mentioned above, to achieve this effect, it will cost a lot to generate a lot of connections. In fact, I have not tested it. Under this requirement, I usually use Python directly.

Here is an interesting phenomenon. You can observe that many people in this post are from the same eastern country (including me). If I remember correctly, the people who pr sock5 proxy support are also from this country

For me,I wish the api looks like https://requests.readthedocs.io/en/latest/api/

  1. Independent header(include cookie) and proxy that can be set for each request , header and proxy is a prop of HttpRequestMessage , httpclient just send request and construct response
  2. Create a session from httpclient to maintain cookie in multiple requests

Maybe we can maintain the compatibility of httpclient and add new feature into https://github.com/dotnet/runtimelab/tree/feature/LLHTTP2

@pebezo
Copy link

pebezo commented Mar 3, 2023

@gitlsl Do you have a concrete API proposal we can work through?

Python's network request library (requests httpx) is very convenient. Maybe we can learn from it

Do you have a direct comparison for this scenario?

With Python one can set/configure a proxy per-request at the time of request, for example:

import requests
url = "https://example.com"
proxy = "http://some-proxy.example.com"
proxies = {"http": proxy, "https": proxy}
response = requests.get(url, proxies=proxies, verify=False)

The issue I'm hitting is that the proxy information is NOT available at app startup. It may come from the database or some other way or it should be dynamically selected. As a general rule, if something cannot be configured dynamically (i.e. via a database query later) then it is too restrictive (i.e. not usable for a class of solutions).

The other issue is an architectural one: If I have to configure the proxy in one project while the actual usage is in another one then I'm spreading the logic into multiple places. It would be a lot cleaner (for some use cases) to isolate that logic to a single place.

@NCLnclNCL
Copy link

@gitlsl Do you have a concrete API proposal we can work through?

Python's network request library (requests httpx) is very convenient. Maybe we can learn from it

Do you have a direct comparison for this scenario?

With Python one can set/configure a proxy per-request at the time of request, for example:

import requests
url = "https://example.com"
proxy = "http://some-proxy.example.com"
proxies = {"http": proxy, "https": proxy}
response = requests.get(url, proxies=proxies, verify=False)

The issue I'm hitting is that the proxy information is NOT available at app startup. It may come from the database or some other way or it should be dynamically selected. As a general rule, if something cannot be configured dynamically (i.e. via a database query later) then it is too restrictive (i.e. not usable for a class of solutions).

The other issue is an architectural one: If I have to configure the proxy in one project while the actual usage is in another one then I'm spreading the logic into multiple places. It would be a lot cleaner (for some use cases) to isolate that logic to a single place.

yes

@vinaghost
Copy link

hi, .net is going to have version 9
any update on this issue?

dotnet/extensions#521 (comment)

@wfurt
Copy link
Member

wfurt commented Aug 26, 2024

this is not going to happen in 9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Net.Http design-discussion Ongoing discussion about design without consensus
Projects
None yet
Development

No branches or pull requests