-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference in (url encoding) behaviour compared with HTTPoison #270
Comments
Thank you for the report. Req should have crashed sooner and not give invalid URL to Mint. It will when we switch from URI.parse to URI.new!, #245. In this particular case we'd have received unhelpful error message: iex> URI.new!("http://localhost/A|B")
** (URI.Error) cannot parse due to reason invalid_uri: ":" and I have reported it upstream, erlang/otp#7862. Hackney probably has more relaxed URI parser. I'll do a survey how other clients deal with this, whether they require escaping, because it might be more pragmatic in Req to let it slide. |
Ok, thanks for the confirmation! I have implemented a quick fix and we will generalise it. Apparently we have other things needed even for HTTPoison anyway (https://github.com/etalab/transport-site/blob/02c9c8c213758019cfb6525aac8cb9bfde09d383/apps/transport/lib/transport/import_data.ex#L334-L359). Indeed I wonder how of a good idea it would be to allow this. It could catch people migrating to |
I need to find URL examples where this would be undesirable but seems URL-encoding the whole thing might work too, i.e. assume they are not encoded:
|
I quite seem to recall I stumbled on cases where this lead to troubles. I
will try to report back with précise urls.
Le mar. 14 nov. 2023 à 21:22, Wojtek Mach ***@***.***> a
écrit :
… I need to find URL examples where this would be undesirable but seems
URL-encoding the whole thing might work too, i.e. assume they are not
encoded:
iex> URI.encode("https://example.com/A|B")
"https://example.com/A%7CB"
—
Reply to this email directly, view it on GitHub
<#270 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAACPHNZU53BOEY4PQUWAXLYEPHIHAVCNFSM6AAAAAA7LKZBSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJRGE4DOOJTGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I have extracted a list of urls from our pool of resources, filtering where "encoded url" != "the url itself" (and filtering out the "pipe" case which we have already discussed). Sorry, this is very raw, but at least provide a large set of urls in the wild with various servers implementations. Maybe it can be useful during testing, unsure yet! I have not yet tested which urls are correctly handled by HTTPoison vs Req (did that a few weeks back, would have to run that again). [
{true, 973, "7d162b4d-db84-49e9-b4ab-bd4f9f87b004",
"https://data.ampmetropole.fr/api/explore/v2.1/catalog/datasets/voies-exceptionnelles-zfe-m-marseille/exports/geojson?lang=fr&timezone=Europe%2FBerlin",
"https://data.ampmetropole.fr/api/explore/v2.1/catalog/datasets/voies-exceptionnelles-zfe-m-marseille/exports/geojson?lang=fr&timezone=Europe%252FBerlin"},
{true, 688, "3349a69a-bc9a-4de4-a2c2-9f9d32e4923c",
"https://notify.ratpdev.com/api/networks/RD%20TPM/alerts/gtfsrt",
"https://notify.ratpdev.com/api/networks/RD%2520TPM/alerts/gtfsrt"},
{true, 948, "7f0f6bcc-77a1-4f52-957a-95060a1c6ac7",
"https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%7CTIC_INT%7CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
"https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%257CTIC_INT%257CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA"},
{true, 566, "1d919c39-41d4-4453-998c-9417c45b2f28",
"https://notify.ratpdev.com/api/networks/VIENNE%20MOBI/alerts/gtfsrt",
"https://notify.ratpdev.com/api/networks/VIENNE%2520MOBI/alerts/gtfsrt"},
{true, 956, "7a6bc31b-1507-46cb-8281-9946afbffdeb",
"https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%7CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA",
"https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%257CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"},
{true, 952, "2de1872e-64dd-4234-af73-1eb4cb48b116",
"https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%7CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA",
"https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%257CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"},
{true, 959, "4b3466a1-b586-408b-a262-e9694dbd72fb",
"https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%7CTIC_INT%7CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
"https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%257CTIC_INT%257CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA"},
{true, 825, "afbe14f0-aef5-45f1-8e92-9757d2cf770a",
"https://thapaas-prd-storage.thalys.com/datagouv/gtfs-realtime.bin?sv=2021-10-04&st=2023-03-09T14%3A40%3A38Z&se=2050-03-10T14%3A40%3A00Z&sr=b&sp=r&sig=2xTfLbvsxTzlLavx%2BI1TJ2capp085ArXJYDA7i4IT04%3D",
"https://thapaas-prd-storage.thalys.com/datagouv/gtfs-realtime.bin?sv=2021-10-04&st=2023-03-09T14%253A40%253A38Z&se=2050-03-10T14%253A40%253A00Z&sr=b&sp=r&sig=2xTfLbvsxTzlLavx%252BI1TJ2capp085ArXJYDA7i4IT04%253D"},
{true, 825, "cf7adb62-bbfe-4f1f-93f7-dbbad9fd60e4",
"https://thapaasblobsprod.blob.core.windows.net/datagouv/gtfs_static.zip?sv=2021-08-06&st=2023-01-11T11%3A12%3A00Z&se=2050-01-13T11%3A12%3A00Z&sr=b&sp=r&sig=1CLeDy4QoLgKwRx63BMp%2BDFSnqH1IUi14k8qg1auk%2FU%3D",
"https://thapaasblobsprod.blob.core.windows.net/datagouv/gtfs_static.zip?sv=2021-08-06&st=2023-01-11T11%253A12%253A00Z&se=2050-01-13T11%253A12%253A00Z&sr=b&sp=r&sig=1CLeDy4QoLgKwRx63BMp%252BDFSnqH1IUi14k8qg1auk%252FU%253D"},
{true, 957, "4e91e313-b1ef-444d-a303-898d3e148c67",
"https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%7CTIC_INT%7CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
"https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%257CTIC_INT%257CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA"}
] For my own reference (since this cannot be easily ran in this current form, as it refers to parts of our application), the code used to generate this is: #! mix run
Code.require_file(__DIR__ <> "/irve/req_custom_cache.exs")
import Ecto.Query
defmodule Downloader do
def cache_dir, do: Path.join(__ENV__.file, "../cache-dir") |> Path.expand()
def get!(url) do
req = Req.new() |> CustomCache.attach()
%{body: body, status: 200} = Req.get!(req, url: url, custom_cache_dir: cache_dir())
body
end
end
base_url = "https://www.data.gouv.fr"
task = fn dataset ->
url = base_url <> "/api/1/datasets/#{dataset.datagouv_id}/"
# IO.puts url
{dataset.id, Downloader.get!(url)}
end
DB.Dataset
|> where([d], d.is_active == true)
|> DB.Repo.all()
# |> Enum.take(2)
|> Task.async_stream(task, max_concurrency: 20, timeout: :infinity)
|> Enum.map(fn {:ok, {dataset_id, response}} ->
response["resources"]
|> Enum.map(fn r ->
url = r["url"]
url = String.replace(url, "|", URI.encode("|"))
encoded_url = URI.encode(url)
{encoded_url != url, dataset_id, r["id"], url, encoded_url}
end)
end)
|> List.flatten()
|> Enum.filter(fn {diff, _, _, _, _} -> diff end)
|> IO.inspect(IEx.inspect_opts()) |
* Fix broken config * Save useful script * Create a secondary path * Implement req downloader * Extract headers & implement req version * Switch to req * Stop passing the full HTTP body around Optimize for memory consumption, since this is what we want, and the file is already stored on disk. * Run mix format * Run mix format * Fix broken code * Backport S3 streaming from #3560 * Enable streaming * Update go.exs * Rollback a part of the changes * Start adding a Req wrapper (save my research) * Add req to shared (where the wrapper will reside) * Rework behaviour * Setup Mox for testing * Mix format * Call the wrapper to allow testing * Adapt one test * Remove outdated TODO * Extract method before reuse * Extract method before reuse * Adapt other test from http poison to req * Adapt another test * Support non-200 return codes & adapt test * Add moduledoc (credo) * Fix credo warning * Remove TODO (I verified the merge was OK) * Add moduledoc * Remove TODO - there is no problem here The :line_or_bytes option is only relevant when reading a file, after doc (https://hexdocs.pm/elixir/File.html#stream!/3) check. We're only doing a write here. * Change method visibility to help me test without S3 * Start iterating on script to detect url regressions in more detail * You had one job * Add note * Use a state file to properly analyse status codes * Update go.exs * Load & display * Avoid code eval * Parallelize the downloads a bit * Save WIP * Validate that all "OK on both sides" are actually equivalent * Split not all ok and all ok * Finalyse analysis * Update go.exs * Add fix for wojtekmach/req#270 * Update req_httpoison_testing.exs * Apply feedback * Add note * Improve Req headers extraction - support multiple values, by concatenating them - filter them first (reduce unecessary processing) - add new tests - adapt broken existing tests * Remove dead code * Add space after comma --------- Co-authored-by: Antoine Augusti <[email protected]>
@thbar could you double-check this? Maybe Mint changed in the meantime? I can't reproduce the failure anymore:
|
@wojtekmach I did a bit of investigation (first in our app & with the code at the top of this issue), but ultimately I bisected it here (see below). The change is not related to Mint but, I believe, to Req Scriptversion_specifier = System.fetch_env!("REQ_VERSION")
IO.puts("Starting with req #{version_specifier}")
Mix.install([
{:req, version_specifier},
# added because 0.4.10 cannot be installed without plug
{:plug, "~> 1.15"}
])
Req.get!("https://httpbin.org/anything?a[b]=1|2").body["args"]
IO.puts("OK") Behaviour on Req
|
@wojtekmach maybe it would be worth just adding a test in the codebase with "https://httpbin.org/anything?a[b]=1|2", so that we ensure it does not change unexpectedly in the future? Just an idea. All in all, it is an interesting thing to note for us, I will do more testing while upgrading. |
Sure I can add the regression test. Do you think we can close this issue then or you'd rather do some more testing in this area? Btw I cannot reproduce:
Perhaps you had a cache with prior finch or mint version? Could you run the following? Remember in your script you had a dependency on :plug so include it too, it's part of the cache key,
|
@wojtekmach This is not the same URL, the issue happens when there is a pipe.
|
Yes, the change of behaviour is linked to the pipe indeed, as @AntoineAugusti pointed out! (thank you!) |
Right, my bad. I'm able to reproduce this locally:
FWIW it seems you're good on the latest version but yeah I'm really curious what's the root cause. I'll post here when I find out. |
No problem - the url includes too many weird characters for a url anyway :-) Happy to help out if we can btw! Thanks for tracking this. |
OK, this is pretty wild. Turns out this is difference between HTTP/1.1 and HTTP/2. Req since 0.4.11 starts on HTTP/1.1 and tries to switch to HTTP/2 using ALPN and on HTTP/2 this is not a problem. If we force using HTTP/1 we see the crash. So it's something in Mint after all.
|
@wojtekmach thanks for diving into this! Pretty wild indeed, and I would be happy to ensure we have some form of non-regression testing on that. FYI we've upgraded on our side (see above), if anything problematic arises (given that we crawl a decently large number of urls), I will mention this back here! Happy to help out in any case. |
While migrating a part of our snapshot crawler (etalab/transport-site#3585), I did some largish scale testing, comparing the behaviour of HTTPoison and Req in detail.
One thing that came out is that urls with pipes
|
will result in errors, while HTTPoison for some reason (probably linked to how hackney works underneath) download them just fine.Here is a reproduction on production urls:
Typically Req without the replacement will return:
While it will download the file just fine after a replacement.
I'm not sure who is in the right, but this could catch HTTPoison users off-guard!
The text was updated successfully, but these errors were encountered: