Skip to content

Conversation

@SophieYu41
Copy link
Collaborator

@SophieYu41 SophieYu41 commented Jan 26, 2023

This is the change for label join step 2 - generating final view with features.

  • Label table is deduped based on the join key to avoid join duplication

  • Label table only contains keys + labels to avoid dup columns on the left

  • label columns in final view would have a prefix "label" for easy identification

  • Final view generation will run in same "label-join" job but will not be a materialized table

  • Generate final view which joins features with labels

  • Generate another "latest label" view to show the latest available labels given a ds

  • Add underlying table to view properties for source tracking purposes

Next improvement[DONE]:
Use viewProperties to store metadata of underlying feature & label table.

Rebased on #370

End to end tested on real user data. Views generated as expected.

zipline_test_test_label_join_v3 // feature backfill table
 zipline_test_test_label_join_v3_labeled // final joined view
 zipline_test_test_label_join_v3_labeled_latest // latest label view
 zipline_test_test_label_join_v3_labels // label table

@hzding621 @yunfeng-hao @nikhilsimha

@SophieYu41 SophieYu41 force-pushed the sophie--label-view branch 3 times, most recently from 3275943 to c3dcb4f Compare January 27, 2023 21:53
Copy link
Collaborator

@hzding621 hzding621 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM w/ a few minor suggestions!

Love the unit tests, great job!

assertEquals(2, latest.count())
assertEquals(0, latest.filter(latest("listing_id") === "3").count())
assertEquals("2022-11-22", latest.where(latest("ds") === "2022-10-07").
select("label_ds").first().get(0))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider other things to test:

  • label_ds is unique per ds.
  • if no label_ds existed for a ds from the label_table, we still keep the feature data but leave label_ds as null

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still keep the feature data but leave label_ds as null

Based on how we compute the label table & latest view, think this case would only exist in label final labeled_view but not the latest_view since we computed partition info based on label_table, and label_table would always have label_ds even the label columns are null (Here is the fix we had to make sure null label_ds would not happen)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we computed partition info based on label_table

label_table may not have all feature_ds in it. for example, if label_offset is set to 90 and 30 for start and end respectively, and we only run label job for the most recent ds, but the feature table can contain many years of data, then for the majority of feature_ds, they don't have corresponding label yet. but in the latest_view we would want to keep them. this is basically the ELSE TRUE case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. Just to clarify if no label_ds existing, it would be (0 rows) showing up in query.
----+----------+-------------+------------------------------------
(0 rows)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Label join is a Left outer join between features and labels. Therefore, when labels are not present (because they were never computed for those feature DS), we should just keep all the rows from features and leave the label columns as NULL. So it shouldn't be 0 rows.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. I was mislead by a bad example the ds actually does not exist on the left and result and end up with 0 rows.

// creating final join view with feature join output table
println(s"Joining label table : ${outputLabelTable} with joined output table : ${joinConf.metaData.outputTable}")
val joinKeys: Array[String] = if (joinConf.rowIds != null && !joinConf.rowIds.isEmpty)
joinConf.rowIds.asScala.toArray else labelJoinConf.rowIdentifier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can combine this entire logic into labelJoinConf.rowIdentifier.

there is also a place in unit test where you used labelJoinConf.rowIdentifier to compute the expected df but technically it's better to use the combined logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Updated.

Side question for rowId - The whole label table dedup logic and join logic is depending on this row id. The rowId must be accurate so the label join is doing right thing otherwise could be pretty messy. If we are depending user to provide this info, do we have validation somewhere making sure users passed in the correct list?
e.g. what should happen if user's rowId != rowIdentifier we computed here?

Copy link
Collaborator

@pengyu-hou pengyu-hou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great unit testings!

@SophieYu41
Copy link
Collaborator Author

Merging this change since it's been opening a while, will address comments in a follow-up change if any. cc @nikhilsimha

@SophieYu41 SophieYu41 merged commit 7476ba9 into master Feb 10, 2023
@SophieYu41 SophieYu41 deleted the sophie--label-view branch February 10, 2023 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants