Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GoogleAdsIO for reading from Google Ads #27681

Merged
merged 16 commits into from
Aug 9, 2023

Conversation

sjvanrossum
Copy link
Contributor

@sjvanrossum sjvanrossum commented Jul 26, 2023

Adds a new IO connector to read from Google Ads.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@codecov
Copy link

Codecov bot commented Jul 26, 2023

Codecov Report

Merging #27681 (1035655) into master (3d02b3b) will increase coverage by 0.00%.
Report is 13 commits behind head on master.
The diff coverage is n/a.

❗ Current head 1035655 differs from pull request most recent head 5cd70c4. Consider uploading reports for the commit 5cd70c4 to get more accurate results

@@           Coverage Diff           @@
##           master   #27681   +/-   ##
=======================================
  Coverage   70.59%   70.59%           
=======================================
  Files         857      857           
  Lines      103944   103947    +3     
=======================================
+ Hits        73375    73378    +3     
  Misses      29004    29004           
  Partials     1565     1565           
Flag Coverage Δ
python 79.55% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.
R: @Abacn for label build.
R: @pabloem for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@bvolpato
Copy link
Contributor

Run Java PreCommit

@bvolpato
Copy link
Contributor

Run Python PreCommit

@bvolpato
Copy link
Contributor

Linter failure

./sdks/java/io/google-ads/build.gradle:34:48: W201 Trailing whitespace

I believe ./gradlew spotlessApply should take care of this.

@sjvanrossum
Copy link
Contributor Author

I believe ./gradlew spotlessApply should take care of this.

To my surprise it actually didn't catch that, but I did a search and replace to remove trailing whitespace.
Do we not run spotlessGroovyApply in spotlessApply?

Comment on lines +74 to +82
* <pre>
* --googleAdsClientId=your-client-id
* --googleAdsClientSecret=your-client-secret
* --googleAdsRefreshToken=your-refresh-token
* --googleAdsDeveloperToken=your-developer-token
* </pre>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably link https://developers.google.com/google-ads/api/docs/oauth/overview around here too instead of the options only (for more visibility)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +457 to +446
private static final int MAX_RETRIES = 5;
private static final FluentBackoff BACKOFF =
FluentBackoff.DEFAULT
.withExponent(2.0)
.withInitialBackoff(Duration.standardSeconds(30))
.withMaxRetries(MAX_RETRIES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Are those arbitrary or recommended somewhere? If recommended, probably a reference would make sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry configuration is an adaption of the BQ DTS configuration. I've added a comment to reflect that these aren't entirely arbitrary.

@Override
public PCollection<GoogleAdsRow> expand(PBegin input) {
String query = getQuery();
List<Long> customerIds = getCustomerIds();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is limiting. Because the customer IDs are specified at pipeline construction time, they cannot be dynamic.

Its likely worth reworking this IO to instead consume a PCollection of those IDs, so that they can be dynamically generated by some upstream source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the transform to take a PCollection<String> as input. Read can now be used to issue a single query for many customer IDs and ReadAll still provides the flexibility to alter the query per element (e.g. to support queries of the same output shape with differing WHERE clauses per customer ID).

Customer IDs are typically expressed as long, including in the Google Ads client library example code, but the protobuf schema for API requests defines customer IDs as String (oddly enough, linkedCustomerId/loginCustomerId are both long in the client...). To avoid baking assumptions about the identifier's format into the IO I think it makes sense to use String as the input type. The documentation example reflects this as well.

String customerId = request.getCustomerId();

do {
rateLimitPolicy.onBeforeRequest(developerToken, customerId, request);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the consequences for breaking the rate limit? I ask, because the default rate limiter won't scale with parallel reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A whole lot of trouble and non-compliance fees, see https://developers.google.com/google-ads/api/docs/rate-sheet.
I've given this a bit more thought and I believe that the policy configuration needs to be owned entirely by the user without giving the illusion of safe defaults, since there simply are none that we can safely recommend due to the complexity of the APIs quotas and limits and their global applicability to developer tokens or customer IDs.

I will rename DefaultRateLimitPolicy to FixedRateLimitPolicy, mark it public, and leave rateLimitPolicyFactory set to null on construction. I'll mark FixedRateLimitPolicy as clearly not intended for advanced users who may exceed their global quota across multiple running applications.

Advanced users should implement their own shared token bucket service (e.g. using https://github.com/mailgun/gubernator) as per the advice provided at https://developers.google.com/google-ads/api/docs/best-practices/rate-limits#throttle.

Moving DatastoreIO's AdaptiveThrottler to a util package would offer a helpful building block for users to apply an additional layer of client side protection against exceeding limits. I'm planning a follow up PR to do that.

Upon reviewing DatastoreIO and FirestoreIO it occurred to me that neither of them apply a pipeline global rate limit across all workers, but divide the desired pipeline global QPS by maxNumWorkers, with their default being 1 QPS local to the DoFn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, no policy is set by default but required nonetheless and DefaultRateLimitPolicy has been renamed to SimpleRateLimitPolicy and simply wraps the Guava RateLimiter.
Adequate warnings have been added to the documentation to inform users they should not rely on SimpleRateLimitPolicy for anything but low volume and development use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add those links to the javadoc, to make it easier for users to implement their own rate limiter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@bvolpato bvolpato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Steven!

@bvolpato
Copy link
Contributor

bvolpato commented Aug 8, 2023

Run Java PreCommit

@bvolpato
Copy link
Contributor

bvolpato commented Aug 8, 2023

Run SQL_Java11 PreCommit

Copy link
Contributor

@johnjcasey johnjcasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Lets add that additional documentation, and then I think this is good to go

@johnjcasey johnjcasey added this to the 2.50.0 Release milestone Aug 8, 2023
@sjvanrossum
Copy link
Contributor Author

I just noticed the vendored Guava package version was updated in #27895. I'll add a commit to update Guava imports after #27907 is merged.

@Abacn
Copy link
Contributor

Abacn commented Aug 8, 2023

sorry for inconvenience, could also pick the change of #27907 into this PR to get gcp io test compiles. Whichever (this or #27907) get in first won't cause merge conflict.

@sjvanrossum
Copy link
Contributor Author

Run Java PreCommit

@sjvanrossum
Copy link
Contributor Author

Run SQL_Java17 PreCommit

@johnjcasey johnjcasey merged commit aee0eb7 into apache:master Aug 9, 2023
30 checks passed
@sjvanrossum sjvanrossum deleted the google-ads-io branch August 9, 2023 18:12
bullet03 pushed a commit to akvelon/beam that referenced this pull request Aug 11, 2023
* Add GoogleAdsIO for reading from Google Ads

* Remove trailing whitespace

* Add link to OAuth2 documentation in the Google Ads API

* Document retry configuration

* Capitalize long literal suffix in documentation

* Make GoogleAdsV14.Read consume customer IDs from a PCollection

* Change non-empty to non-null in test assertions

* Use Long.toString in transform construction example

* Use EnsuresNonNull, RequiresNonNull instead of checkStateNotNull

* Remove default rate limit policy

* Refactor DefaultRateLimitPolicy to SimpleRateLimitPolicy

* Synchronize on ReadAllFn to safely modify static variable sleeper

* Add additional links to clarify API limits, quotas, and policies

* Fix Guava imports

* Update CHANGES.md

* Move change to upcoming release
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants