-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable feature store batch serve to BigQuery and GCS for csv and tfrecord #919
feat: enable feature store batch serve to BigQuery and GCS for csv and tfrecord #919
Conversation
… for csv and tfrecord files in Featurestore class
self, | ||
bq_destination_output_uri: str, | ||
entity_type_ids: List[str], | ||
entity_type_destination_fields: Optional[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument name is a bit confusing. It is to specify what features to read, and (optional) the output column name, right? May be change it to smt related to feature than entity type, e.g feature_to_read
or some sort.
Alternatively, we can have a separate struct to define the entity_type and what feature to read within that entity type, and provide a list of that to this function, instead of the separate entity_type_ids and this entity_type_destination_field.
Another option is, we can have a flat list of feature to read, using the full-resource path name. Internally, we can group them by entity type and construct the read request. This could be a little bit repetitive on the calling side, but I think the interface is more intuitive. We'll need to use different struct for 'read_instance' though (e.g Dict instead of List)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't read_instances
the bq uri or gcs uri(s)? How will a Dict work here?
Regarding renaming the entity_type_destination_fields
, how about feature_destination_fields
to match the feature_source_fields
in EntityType
class's ingestion methods?
Regarding the data type and usage of the entity_type_destination_fields
:
- I think the current implementation is trying to implement the first alternative suggested here. The purpose for having a separate
entity_type_ids
andentity_type_destination_fields
is that if for some entity_types that user want to batch serve all features (without a different output feature name), user won't need to include that entity_type id in theentity_type_destination_fields
. However, we still can achieve this and remove the separatedentity_type_ids
field, by requiring having all entity_type ids specified in theentity_type_destination_fields
as keys, with ["*"] value for those all features are to be served without a different output name. For example,
{
"entity_type_1": ["*"],
"entity_type_2": {"feature_1": "feature_1_col"},
"entity_type_3": {"feature_1", "feature_2"},
}
- the second option is a bit redundant but it is indeed very clear and straight forward
{"...path/to/fs/fs1/et/e1/f/f1": "f11_col", "...path/to/fs/fs1/et/e2/f/f1": "f21_col"}
Will the modification of the current implementation in 1) be less confusing and more appealing for usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding [1], you also have the option to support mapping the output column name to something different than feature name. I feel this overloaded argument is hard for users to understand (though I may agree it could be convenient once user get a good grasp of it)
I think it is easier to understand to split into 2 arguments: one to define the feature to read, the other one to define the (overridden) output column name. This also allow user to define the output column only for the feature they want to (i.e they want to read */all features from entity_type_1 and output to the column with the same name, except feature foo
will be output to column bar
)
With the flat-list, the entity type IDs can be extracted from the feature path name. However, unless we put some constraints that the list have to group by entity type, the order of the entity type in the read_instance
could be ambiguous. Thus I think using Dict (e.g {'entity_type_1' : 'user_1234', 'entity_type_2': 'product_xyz' }
is clearer than using array (['user_1234', 'product_xyz']
). The flat list option also requires all to be explicitly listed, i.e "*" is not supported. This is the biggest cons for that option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed entity_type_ids
and entity_type_destination_fields
to avoid confusion, added two dict serving_feature_ids
for an entity_type / features mapping and feature_destination_fields
for feature_id / feature_destination_field_name mapping.
|
||
entity_type_ids = ['my_entity_type_id_1', 'my_entity_type_id_2', 'my_entity_type_id_3'] | ||
|
||
feature_source_fields = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean entity_type_destination_fields
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work!
Co-authored-by: sasha-gitg <[email protected]>
Co-authored-by: sasha-gitg <[email protected]>
Co-authored-by: sasha-gitg <[email protected]>
Co-authored-by: sasha-gitg <[email protected]>
Co-authored-by: sasha-gitg <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…d tfrecord (googleapis#919) * feat: add batch_serve_to_bq for bigquery table and batch_serve_to_gcs for csv and tfrecord files in Featurestore class * fix: change entity_type_ids and entity_type_destination_fields to serving_feature_ids and feature_destination_fields * fix: remove white space * Update google/cloud/aiplatform/featurestore/featurestore.py Co-authored-by: sasha-gitg <[email protected]> * Update google/cloud/aiplatform/featurestore/featurestore.py Co-authored-by: sasha-gitg <[email protected]> * Update google/cloud/aiplatform/featurestore/featurestore.py Co-authored-by: sasha-gitg <[email protected]> * Update google/cloud/aiplatform/featurestore/featurestore.py Co-authored-by: sasha-gitg <[email protected]> * Update google/cloud/aiplatform/featurestore/featurestore.py Co-authored-by: sasha-gitg <[email protected]> * fix: Featurestore create method example usage * fix: get_timestamp_proto for millisecond precision cap * fix: unit tests for get_timestamp_proto Co-authored-by: sasha-gitg <[email protected]>
batch_serve_to_bq
andbatch_serve_to_gcs