Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in (url encoding) behaviour compared with HTTPoison #270

Open
thbar opened this issue Nov 14, 2023 · 15 comments
Open

Difference in (url encoding) behaviour compared with HTTPoison #270

thbar opened this issue Nov 14, 2023 · 15 comments

Comments

@thbar
Copy link
Contributor

thbar commented Nov 14, 2023

While migrating a part of our snapshot crawler (etalab/transport-site#3585), I did some largish scale testing, comparing the behaviour of HTTPoison and Req in detail.

One thing that came out is that urls with pipes | will result in errors, while HTTPoison for some reason (probably linked to how hackney works underneath) download them just fine.

Here is a reproduction on production urls:

data = [
  "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB|TIC_INT|ALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
  "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB|TIC_INT|ALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
  "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB|COROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA",
  "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB|TIC_INT|ALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
  "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB|COROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"
]

data
|> Enum.each(fn x ->
  IO.puts("==========================")
  IO.puts(x)
  IO.inspect(Req.get(x, compressed: false, decode_body: false))
  IO.inspect(Req.get(x |> String.replace("|", URI.encode("|")), compressed: false, decode_body: false))
end)

Typically Req without the replacement will return:

{:error,
 %Mint.HTTPError{
   reason: {:invalid_request_target,
    "/dataflow/offre-tc/download?provider=COROLIS_URB|COROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"},
   module: Mint.HTTP1
 }}

While it will download the file just fine after a replacement.

I'm not sure who is in the right, but this could catch HTTPoison users off-guard!

@wojtekmach
Copy link
Owner

Thank you for the report. | is not allowed verbatim in URLs and needs to be escaped.

Req should have crashed sooner and not give invalid URL to Mint. It will when we switch from URI.parse to URI.new!, #245. In this particular case we'd have received unhelpful error message:

iex> URI.new!("http://localhost/A|B")
** (URI.Error) cannot parse due to reason invalid_uri: ":"

and I have reported it upstream, erlang/otp#7862.

Hackney probably has more relaxed URI parser.

I'll do a survey how other clients deal with this, whether they require escaping, because it might be more pragmatic in Req to let it slide.

@thbar
Copy link
Contributor Author

thbar commented Nov 14, 2023

Ok, thanks for the confirmation! I have implemented a quick fix and we will generalise it.

Apparently we have other things needed even for HTTPoison anyway (https://github.com/etalab/transport-site/blob/02c9c8c213758019cfb6525aac8cb9bfde09d383/apps/transport/lib/transport/import_data.ex#L334-L359).

Indeed I wonder how of a good idea it would be to allow this. curl and browsers will gently download the files.

It could catch people migrating to Req a bit off guard !

thbar added a commit to etalab/transport-site that referenced this issue Nov 14, 2023
@wojtekmach
Copy link
Owner

I need to find URL examples where this would be undesirable but seems URL-encoding the whole thing might work too, i.e. assume they are not encoded:

iex> URI.encode("https://example.com/A|B")
"https://example.com/A%7CB"

@thbar
Copy link
Contributor Author

thbar commented Nov 14, 2023 via email

@thbar
Copy link
Contributor Author

thbar commented Nov 19, 2023

I have extracted a list of urls from our pool of resources, filtering where "encoded url" != "the url itself" (and filtering out the "pipe" case which we have already discussed).

Sorry, this is very raw, but at least provide a large set of urls in the wild with various servers implementations. Maybe it can be useful during testing, unsure yet!

I have not yet tested which urls are correctly handled by HTTPoison vs Req (did that a few weeks back, would have to run that again).

[
  {true, 973, "7d162b4d-db84-49e9-b4ab-bd4f9f87b004",
   "https://data.ampmetropole.fr/api/explore/v2.1/catalog/datasets/voies-exceptionnelles-zfe-m-marseille/exports/geojson?lang=fr&timezone=Europe%2FBerlin",
   "https://data.ampmetropole.fr/api/explore/v2.1/catalog/datasets/voies-exceptionnelles-zfe-m-marseille/exports/geojson?lang=fr&timezone=Europe%252FBerlin"},
  {true, 688, "3349a69a-bc9a-4de4-a2c2-9f9d32e4923c",
   "https://notify.ratpdev.com/api/networks/RD%20TPM/alerts/gtfsrt",
   "https://notify.ratpdev.com/api/networks/RD%2520TPM/alerts/gtfsrt"},
  {true, 948, "7f0f6bcc-77a1-4f52-957a-95060a1c6ac7",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%7CTIC_INT%7CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%257CTIC_INT%257CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA"},
  {true, 566, "1d919c39-41d4-4453-998c-9417c45b2f28",
   "https://notify.ratpdev.com/api/networks/VIENNE%20MOBI/alerts/gtfsrt",
   "https://notify.ratpdev.com/api/networks/VIENNE%2520MOBI/alerts/gtfsrt"},
  {true, 956, "7a6bc31b-1507-46cb-8281-9946afbffdeb",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%7CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%257CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"},
  {true, 952, "2de1872e-64dd-4234-af73-1eb4cb48b116",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%7CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=COROLIS_URB%257CCOROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"},
  {true, 959, "4b3466a1-b586-408b-a262-e9694dbd72fb",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%7CTIC_INT%7CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%257CTIC_INT%257CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA"},
  {true, 825, "afbe14f0-aef5-45f1-8e92-9757d2cf770a",
   "https://thapaas-prd-storage.thalys.com/datagouv/gtfs-realtime.bin?sv=2021-10-04&st=2023-03-09T14%3A40%3A38Z&se=2050-03-10T14%3A40%3A00Z&sr=b&sp=r&sig=2xTfLbvsxTzlLavx%2BI1TJ2capp085ArXJYDA7i4IT04%3D",
   "https://thapaas-prd-storage.thalys.com/datagouv/gtfs-realtime.bin?sv=2021-10-04&st=2023-03-09T14%253A40%253A38Z&se=2050-03-10T14%253A40%253A00Z&sr=b&sp=r&sig=2xTfLbvsxTzlLavx%252BI1TJ2capp085ArXJYDA7i4IT04%253D"},
  {true, 825, "cf7adb62-bbfe-4f1f-93f7-dbbad9fd60e4",
   "https://thapaasblobsprod.blob.core.windows.net/datagouv/gtfs_static.zip?sv=2021-08-06&st=2023-01-11T11%3A12%3A00Z&se=2050-01-13T11%3A12%3A00Z&sr=b&sp=r&sig=1CLeDy4QoLgKwRx63BMp%2BDFSnqH1IUi14k8qg1auk%2FU%3D",
   "https://thapaasblobsprod.blob.core.windows.net/datagouv/gtfs_static.zip?sv=2021-08-06&st=2023-01-11T11%253A12%253A00Z&se=2050-01-13T11%253A12%253A00Z&sr=b&sp=r&sig=1CLeDy4QoLgKwRx63BMp%252BDFSnqH1IUi14k8qg1auk%252FU%253D"},
  {true, 957, "4e91e313-b1ef-444d-a303-898d3e148c67",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%7CTIC_INT%7CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA",
   "https://api.oisemob.cityway.fr/dataflow/offre-tc/download?provider=TIC_URB%257CTIC_INT%257CALLOTIC&dataFormat=Netex&dataProfil=OPENDATA"}
]

For my own reference (since this cannot be easily ran in this current form, as it refers to parts of our application), the code used to generate this is:

#! mix run
Code.require_file(__DIR__ <> "/irve/req_custom_cache.exs")

import Ecto.Query

defmodule Downloader do
  def cache_dir, do: Path.join(__ENV__.file, "../cache-dir") |> Path.expand()

  def get!(url) do
    req = Req.new() |> CustomCache.attach()
    %{body: body, status: 200} = Req.get!(req, url: url, custom_cache_dir: cache_dir())
    body
  end
end

base_url = "https://www.data.gouv.fr"

task = fn dataset ->
  url = base_url <> "/api/1/datasets/#{dataset.datagouv_id}/"
  # IO.puts url
  {dataset.id, Downloader.get!(url)}
end

DB.Dataset
|> where([d], d.is_active == true)
|> DB.Repo.all()
# |> Enum.take(2)
|> Task.async_stream(task, max_concurrency: 20, timeout: :infinity)
|> Enum.map(fn {:ok, {dataset_id, response}} ->
  response["resources"]
  |> Enum.map(fn r ->
    url = r["url"]
    url = String.replace(url, "|", URI.encode("|"))
    encoded_url = URI.encode(url)
    {encoded_url != url, dataset_id, r["id"], url, encoded_url}
  end)
end)
|> List.flatten()
|> Enum.filter(fn {diff, _, _, _, _} -> diff end)
|> IO.inspect(IEx.inspect_opts())

github-merge-queue bot pushed a commit to etalab/transport-site that referenced this issue Nov 20, 2023
* Fix broken config

* Save useful script

* Create a secondary path

* Implement req downloader

* Extract headers & implement req version

* Switch to req

* Stop passing the full HTTP body around

Optimize for memory consumption, since this is what we want, and the file is already stored on disk.

* Run mix format

* Run mix format

* Fix broken code

* Backport S3 streaming from #3560

* Enable streaming

* Update go.exs

* Rollback a part of the changes

* Start adding a Req wrapper (save my research)

* Add req to shared (where the wrapper will reside)

* Rework behaviour

* Setup Mox for testing

* Mix format

* Call the wrapper to allow testing

* Adapt one test

* Remove outdated TODO

* Extract method before reuse

* Extract method before reuse

* Adapt other test from http poison to req

* Adapt another test

* Support non-200 return codes & adapt test

* Add moduledoc (credo)

* Fix credo warning

* Remove TODO (I verified the merge was OK)

* Add moduledoc

* Remove TODO - there is no problem here

The :line_or_bytes option is only relevant when reading a file, after doc (https://hexdocs.pm/elixir/File.html#stream!/3) check. We're only doing a write here.

* Change method visibility to help me test without S3

* Start iterating on script to detect url regressions in more detail

* You had one job

* Add note

* Use a state file to properly analyse status codes

* Update go.exs

* Load & display

* Avoid code eval

* Parallelize the downloads a bit

* Save WIP

* Validate that all "OK on both sides" are actually equivalent

* Split not all ok and all ok

* Finalyse analysis

* Update go.exs

* Add fix for wojtekmach/req#270

* Update req_httpoison_testing.exs

* Apply feedback

* Add note

* Improve Req headers extraction

- support multiple values, by concatenating them
- filter them first (reduce unecessary processing)
- add new tests
- adapt broken existing tests

* Remove dead code

* Add space after comma

---------

Co-authored-by: Antoine Augusti <[email protected]>
@wojtekmach
Copy link
Owner

@thbar could you double-check this? Maybe Mint changed in the meantime? I can't reproduce the failure anymore:

iex> Req.get!("https://httpbin.org/anything?a[b]=1|2").body["args"]
%{"a[b]" => "1|2"}

@thbar thbar changed the title Difference in (encoding) behaviour compared with HTTPoison Difference in (url encoding) behaviour compared with HTTPoison Feb 19, 2024
@thbar
Copy link
Contributor Author

thbar commented Feb 19, 2024

@wojtekmach I did a bit of investigation (first in our app & with the code at the top of this issue), but ultimately I bisected it here (see below).

The change is not related to Mint but, I believe, to Req 0.4.10 switch to URI.parse.

Script

version_specifier = System.fetch_env!("REQ_VERSION")

IO.puts("Starting with req #{version_specifier}")

Mix.install([
  {:req, version_specifier},
  # added because 0.4.10 cannot be installed without plug
  {:plug, "~> 1.15"}
])

Req.get!("https://httpbin.org/anything?a[b]=1|2").body["args"]

IO.puts("OK")

Behaviour on Req 0.4.9

❯ REQ_VERSION=0.4.9 elixir scripts/req.exs
Starting with req 0.4.9
** (Mint.HTTPError) invalid request target: "/anything?a[b]=1|2"
    (req 0.4.9) lib/req.ex:978: Req.request!/2
    scripts/req.exs:10: (file)

Behaviour on Req 0.4.10

❯ REQ_VERSION=0.4.10 elixir scripts/req.exs
Starting with req 0.4.10
OK

@thbar
Copy link
Contributor Author

thbar commented Feb 19, 2024

@wojtekmach maybe it would be worth just adding a test in the codebase with "https://httpbin.org/anything?a[b]=1|2", so that we ensure it does not change unexpectedly in the future? Just an idea.

All in all, it is an interesting thing to note for us, I will do more testing while upgrading.

@wojtekmach
Copy link
Owner

Sure I can add the regression test. Do you think we can close this issue then or you'd rather do some more testing in this area?

Btw I cannot reproduce:

$ elixir -e 'Mix.install([{:req, "0.4.9"}]) ; %{status: 200} = Req.get!("https://httpbin.org/anything?a[b]=1") ; IO.puts("ok")'
ok

Perhaps you had a cache with prior finch or mint version? Could you run the following? Remember in your script you had a dependency on :plug so include it too, it's part of the cache key,

$ elixir -e 'Mix.install([{:req, "0.4.9"}]) ; IO.inspect(finch: Application.spec(:finch, :vsn), mint: Application.spec(:mint, :vsn))'
[finch: ~c"0.17.0", mint: ~c"1.5.2"]

@AntoineAugusti
Copy link

@wojtekmach This is not the same URL, the issue happens when there is a pipe.

elixir -e 'Mix.install([{:req, "0.4.9"}]); %{status: 200} = Req.get!("https://httpbin.org/anything?a[b]=1|2"); IO.puts("ok")'
** (Mint.HTTPError) invalid request target: "/anything?a[b]=1|2"
    (req 0.4.9) lib/req.ex:978: Req.request!/2
    (stdlib 3.17.2.4) erl_eval.erl:685: :erl_eval.do_apply/6
    (stdlib 3.17.2.4) erl_eval.erl:446: :erl_eval.expr/5
    (stdlib 3.17.2.4) erl_eval.erl:123: :erl_eval.exprs/5
    (elixir 1.15.5) lib/code.ex:543: Code.validated_eval_string/3

@thbar
Copy link
Contributor Author

thbar commented Feb 19, 2024

Yes, the change of behaviour is linked to the pipe indeed, as @AntoineAugusti pointed out! (thank you!)

@wojtekmach
Copy link
Owner

Right, my bad. I'm able to reproduce this locally:

~% elixir -e 'Mix.install([{:req, "0.4.9"}]); %{status: 200} = Req.get!("https://httpbin.org/anything?a[b]=1|2"); IO.puts("ok")'
** (Mint.HTTPError) invalid request target: "/anything?a[b]=1|2"
    (req 0.4.9) lib/req.ex:978: Req.request!/2
    (stdlib 5.2) erl_eval.erl:750: :erl_eval.do_apply/7
    (stdlib 5.2) erl_eval.erl:494: :erl_eval.expr/6
    (stdlib 5.2) erl_eval.erl:136: :erl_eval.exprs/6
~% elixir -e 'Mix.install([{:req, "0.4.11"}]); %{status: 200} = Req.get!("https://httpbin.org/anything?a[b]=1|2"); IO.puts("ok")'
ok

FWIW it seems you're good on the latest version but yeah I'm really curious what's the root cause. I'll post here when I find out.

@thbar
Copy link
Contributor Author

thbar commented Feb 19, 2024

Right, my bad. I'm able to reproduce this locally:

No problem - the url includes too many weird characters for a url anyway :-)

Happy to help out if we can btw!

Thanks for tracking this.

@wojtekmach
Copy link
Owner

wojtekmach commented Feb 19, 2024

OK, this is pretty wild. Turns out this is difference between HTTP/1.1 and HTTP/2. Req since 0.4.11 starts on HTTP/1.1 and tries to switch to HTTP/2 using ALPN and on HTTP/2 this is not a problem. If we force using HTTP/1 we see the crash. So it's something in Mint after all.

iex> elixir -e 'Mix.install([{:req, "0.4.11"}]) ; %{status: 200} = Req.get!("https://httpbin.org/anything?a[b]=1|2", connect_options: [protocols: [:http1]])'
** (Mint.HTTPError) invalid request target: "/anything?a[b]=1|2"
    (req 0.4.11) lib/req.ex:978: Req.request!/2
    (stdlib 5.2) erl_eval.erl:750: :erl_eval.do_apply/7
    (stdlib 5.2) erl_eval.erl:494: :erl_eval.expr/6
    (elixir 1.17.0-dev) lib/code.ex:568: Code.validated_eval_string/3

@thbar
Copy link
Contributor Author

thbar commented Feb 26, 2024

@wojtekmach thanks for diving into this! Pretty wild indeed, and I would be happy to ensure we have some form of non-regression testing on that.

FYI we've upgraded on our side (see above), if anything problematic arises (given that we crawl a decently large number of urls), I will mention this back here!

Happy to help out in any case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants