[ODM] Creating final view with labels #380

SophieYu41 · 2023-01-26T02:45:27Z

This is the change for label join step 2 - generating final view with features.

Label table is deduped based on the join key to avoid join duplication
Label table only contains keys + labels to avoid dup columns on the left
label columns in final view would have a prefix "label" for easy identification
Final view generation will run in same "label-join" job but will not be a materialized table
Generate final view which joins features with labels
Generate another "latest label" view to show the latest available labels given a ds
Add underlying table to view properties for source tracking purposes

Next improvement[DONE]:
Use viewProperties to store metadata of underlying feature & label table.

Rebased on #370

End to end tested on real user data. Views generated as expected.

zipline_test_test_label_join_v3 // feature backfill table
 zipline_test_test_label_join_v3_labeled // final joined view
 zipline_test_test_label_join_v3_labeled_latest // latest label view
 zipline_test_test_label_join_v3_labels // label table

@hzding621 @yunfeng-hao @nikhilsimha

api/src/main/scala/ai/chronon/api/Extensions.scala

spark/src/test/scala/ai/chronon/spark/test/LabelJoinTest.scala

api/src/main/scala/ai/chronon/api/Extensions.scala

hzding621

LGTM w/ a few minor suggestions!

Love the unit tests, great job!

api/src/main/scala/ai/chronon/api/Constants.scala

spark/src/main/scala/ai/chronon/spark/JoinUtils.scala

hzding621 · 2023-02-03T23:11:16Z

spark/src/test/scala/ai/chronon/spark/test/JoinUtilsTest.scala

+    assertEquals(2, latest.count())
+    assertEquals(0, latest.filter(latest("listing_id") === "3").count())
+    assertEquals("2022-11-22", latest.where(latest("ds") === "2022-10-07").
+      select("label_ds").first().get(0))


consider other things to test:

label_ds is unique per ds.

if no label_ds existed for a ds from the label_table, we still keep the feature data but leave label_ds as null

we still keep the feature data but leave label_ds as null

Based on how we compute the label table & latest view, think this case would only exist in label final labeled_view but not the latest_view since we computed partition info based on label_table, and label_table would always have label_ds even the label columns are null (Here is the fix we had to make sure null label_ds would not happen)

since we computed partition info based on label_table

label_table may not have all feature_ds in it. for example, if label_offset is set to 90 and 30 for start and end respectively, and we only run label job for the most recent ds, but the feature table can contain many years of data, then for the majority of feature_ds, they don't have corresponding label yet. but in the latest_view we would want to keep them. this is basically the ELSE TRUE case

Ah right. Just to clarify if no label_ds existing, it would be (0 rows) showing up in query.
----+----------+-------------+------------------------------------
(0 rows)

Label join is a Left outer join between features and labels. Therefore, when labels are not present (because they were never computed for those feature DS), we should just keep all the rows from features and leave the label columns as NULL. So it shouldn't be 0 rows.

Ah right. I was mislead by a bad example the ds actually does not exist on the left and result and end up with 0 rows.

hzding621 · 2023-02-03T23:15:30Z

spark/src/main/scala/ai/chronon/spark/LabelJoin.scala

+      // creating final join view with feature join output table
+      println(s"Joining label table : ${outputLabelTable} with joined output table : ${joinConf.metaData.outputTable}")
+      val joinKeys: Array[String] = if (joinConf.rowIds != null && !joinConf.rowIds.isEmpty)
+        joinConf.rowIds.asScala.toArray else labelJoinConf.rowIdentifier


we can combine this entire logic into labelJoinConf.rowIdentifier.

there is also a place in unit test where you used labelJoinConf.rowIdentifier to compute the expected df but technically it's better to use the combined logic.

Agree. Updated.

Side question for rowId - The whole label table dedup logic and join logic is depending on this row id. The rowId must be accurate so the label join is doing right thing otherwise could be pretty messy. If we are depending user to provide this info, do we have validation somewhere making sure users passed in the correct list?
e.g. what should happen if user's rowId != rowIdentifier we computed here?

spark/src/test/scala/ai/chronon/spark/test/FeatureWithLabelJoinTest.scala

pengyu-hou

great unit testings!

SophieYu41 · 2023-02-10T19:45:57Z

Merging this change since it's been opening a while, will address comments in a follow-up change if any. cc @nikhilsimha

SophieYu41 force-pushed the sophie--label-view branch 3 times, most recently from 3275943 to c3dcb4f Compare January 27, 2023 21:53

SophieYu41 commented Jan 30, 2023

View reviewed changes

api/src/main/scala/ai/chronon/api/Extensions.scala Outdated Show resolved Hide resolved

hzding621 reviewed Jan 31, 2023

View reviewed changes

spark/src/test/scala/ai/chronon/spark/test/LabelJoinTest.scala Show resolved Hide resolved

api/src/main/scala/ai/chronon/api/Extensions.scala Outdated Show resolved Hide resolved

SophieYu41 force-pushed the sophie--label-view branch from 46a1376 to 96c843e Compare February 2, 2023 01:06

SophieYu41 requested review from hzding621 and yunfeng-hao February 2, 2023 01:08

hzding621 approved these changes Feb 3, 2023

View reviewed changes

Sophie Wang added 7 commits February 6, 2023 14:02

dedup label table

7d43618

Add final view sql & refactor

d0506e6

rebase

bd0fb65

add latest label utils

92302bc

create latest label view

ce5243b

Add view meatadata in properties

cc4139c

comments

f0da575

SophieYu41 force-pushed the sophie--label-view branch from 8df8c2b to f0da575 Compare February 6, 2023 22:03

SophieYu41 requested review from nikhilsimha and pengyu-hou February 6, 2023 23:28

pengyu-hou approved these changes Feb 9, 2023

View reviewed changes

SophieYu41 merged commit 7476ba9 into master Feb 10, 2023

SophieYu41 deleted the sophie--label-view branch February 10, 2023 19:46

[ODM] Creating final view with labels #380

[ODM] Creating final view with labels #380

Uh oh!

Conversation

SophieYu41 commented Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzding621 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pengyu-hou left a comment

Choose a reason for hiding this comment

Uh oh!

SophieYu41 commented Feb 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SophieYu41 commented Jan 26, 2023 •

edited

Loading