-
Notifications
You must be signed in to change notification settings - Fork 86
[ODM] Creating final view with labels #380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3275943 to
c3dcb4f
Compare
46a1376 to
96c843e
Compare
hzding621
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM w/ a few minor suggestions!
Love the unit tests, great job!
| assertEquals(2, latest.count()) | ||
| assertEquals(0, latest.filter(latest("listing_id") === "3").count()) | ||
| assertEquals("2022-11-22", latest.where(latest("ds") === "2022-10-07"). | ||
| select("label_ds").first().get(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider other things to test:
- label_ds is unique per ds.
- if no label_ds existed for a ds from the label_table, we still keep the feature data but leave label_ds as null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still keep the feature data but leave label_ds as null
Based on how we compute the label table & latest view, think this case would only exist in label final labeled_view but not the latest_view since we computed partition info based on label_table, and label_table would always have label_ds even the label columns are null (Here is the fix we had to make sure null label_ds would not happen)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we computed partition info based on label_table
label_table may not have all feature_ds in it. for example, if label_offset is set to 90 and 30 for start and end respectively, and we only run label job for the most recent ds, but the feature table can contain many years of data, then for the majority of feature_ds, they don't have corresponding label yet. but in the latest_view we would want to keep them. this is basically the ELSE TRUE case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right. Just to clarify if no label_ds existing, it would be (0 rows) showing up in query.
----+----------+-------------+------------------------------------
(0 rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Label join is a Left outer join between features and labels. Therefore, when labels are not present (because they were never computed for those feature DS), we should just keep all the rows from features and leave the label columns as NULL. So it shouldn't be 0 rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right. I was mislead by a bad example the ds actually does not exist on the left and result and end up with 0 rows.
| // creating final join view with feature join output table | ||
| println(s"Joining label table : ${outputLabelTable} with joined output table : ${joinConf.metaData.outputTable}") | ||
| val joinKeys: Array[String] = if (joinConf.rowIds != null && !joinConf.rowIds.isEmpty) | ||
| joinConf.rowIds.asScala.toArray else labelJoinConf.rowIdentifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can combine this entire logic into labelJoinConf.rowIdentifier.
there is also a place in unit test where you used labelJoinConf.rowIdentifier to compute the expected df but technically it's better to use the combined logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Updated.
Side question for rowId - The whole label table dedup logic and join logic is depending on this row id. The rowId must be accurate so the label join is doing right thing otherwise could be pretty messy. If we are depending user to provide this info, do we have validation somewhere making sure users passed in the correct list?
e.g. what should happen if user's rowId != rowIdentifier we computed here?
spark/src/test/scala/ai/chronon/spark/test/FeatureWithLabelJoinTest.scala
Outdated
Show resolved
Hide resolved
spark/src/test/scala/ai/chronon/spark/test/FeatureWithLabelJoinTest.scala
Outdated
Show resolved
Hide resolved
8df8c2b to
f0da575
Compare
pengyu-hou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great unit testings!
|
Merging this change since it's been opening a while, will address comments in a follow-up change if any. cc @nikhilsimha |
This is the change for label join step 2 - generating final view with features.
Label table is deduped based on the join key to avoid join duplication
Label table only contains keys + labels to avoid dup columns on the left
label columns in final view would have a prefix "label" for easy identification
Final view generation will run in same "label-join" job but will not be a materialized table
Generate final view which joins features with labels
Generate another "latest label" view to show the latest available labels given a ds
Add underlying table to view properties for source tracking purposes
Next improvement[DONE]:
Use viewProperties to store metadata of underlying feature & label table.
Rebased on #370
End to end tested on real user data. Views generated as expected.
@hzding621 @yunfeng-hao @nikhilsimha