Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add source suggestions for Brave News #25563

Closed
petemill opened this issue Sep 22, 2022 · 14 comments · Fixed by brave/brave-core#15447 or brave/brave-core#15522
Closed

Add source suggestions for Brave News #25563

petemill opened this issue Sep 22, 2022 · 14 comments · Fixed by brave/brave-core#15447 or brave/brave-core#15522

Comments

@petemill
Copy link
Member

Url format https://[hostname]/source-suggestions/source_similarity_t10.[region].json
Where region is, e.g. en_US

The format is:

{
  [key: PublisherID]: {
    source: PublisherID
    score: number
  }[]
}

There is also a human readable file at https://[hostname]/source-suggestions/source_similarity_t10_hr.[region].json, the only purpose of which is to more easily check expected results, where the format is:

{
  [key: PublisherName]: {
    source: PublisherName
    score: number
  }[]
}

Each file provides a lookup for a given PublisherID to a list of similar PublisherIDs with a score ranking for how similar they are to each other (higher score means more similar).

Sources we should compare from, in priority order:

  • Sources the user has directly subscribed to
  • Sources the user has indirectly subscribed to (i.e. as part of a channel) and the user has visited the site recently
  • Sources the user has indirectly subscribed to (i.e. as part of a channel) and we have no interest signal

We will take that source list and use the similarity matrix map to produce a list of "suggested sources".

List we should show, in priority order:

  • Sources that the user is not directly or indirectly subscribed to
  • Sources that the user is indirectly subscribed to (i.e. as part of a channel)
    (We should not show sources that the user is already directly subscribed to)

Note: when talking about "direct" subscriptions above, we refer to any mode of subscription: combined sources or rss feed.

Which similarity region files to download? Any regions which the user has channel or feed subscriptions. i.e. the same regions we download feed.json files for.

When should we download the similarity files? An appropriate time seems to be when downloading feed subscriptions, since that occurs when the user modifies their feed subscriptions, and is also when we calculate which regions to download from. However, there may be a couple benefits to doing it when downloading sources, since that is when we search through history. However, we can search history for publisher matches again at this new "source similarity comparison" time.

@LorenzoMinto
Copy link
Member

I was thinking something like this. For the comparing priority, I would treat indirectly subscribed sources (via Channels) as simple unsubscribed sources, and only consider the following signals:

  • Sources the user has directly subscribed to
  • Sources the user has visited recently a threshold of times (independently of subscription status)

As for showing:

  • Sources that the user is not directly subscribed to and that have strong interest signal (history) (not coming via suggestions)
  • Sources that the user is not directly subscribed to (coming from suggestions)

I wouldn't consider the indirect subscription signal unless it's supported by a stronger interest signal (history), because for some categories/channels there might be sources that the user might entirely ignore and we should not prioritise those (i.e. a user subscribed to Entertainment but that is not interested in Music [Pitchfork, NME] at all).

@mattmcalister
Copy link

The Source Suggestions spec doc has been updated to reflect these details. It says we want to reflect direct subscriptions and history, not channel membership.

@fallaciousreasoning
Copy link

Okay, I'm working on implementing this at the moment, and it would be good to formalize this a bit more (say with some weights we give to everything). Apart from that, I have a few questions:

  1. To confirm, we SHOULD suggest sources similar to ones the user has visited, even if they aren't subscribed to that source?
  2. Should visits to a source that isn't subscribed mean we should suggest subscribing to that sources? How do we calculate the rank here?
  3. How do we handle different locales here? For example, if a user is subscribed to a feed in en_US and es_MX we should show suggestions for both of these locales. Simplest for me, I think would be to download both similarity matrices and merge them into one big similarity matrix. However, I'm not sure what to do if a publisher is in multiple locales with different weights? Just take the highest/lowest score? Average them?
// 0 - 1, depending on whether the publisher is enabled.
const getEnabledWeighting = (publisher) => {
    // Maybe we want to do some extra weighting here for being in channels the user is subscribed to?
	return publisher.subscribed ? 1 : 0;
}

// Completely arbitrary, but 0.4 - 1, based on how often the user has visited this publisher in the past.
// |normalizedVisitWeights| are the visits to each publisher, divided by the visits to the most visited
// publisher in the last 200 days.
const getVisitRating = (publisher, normalizedVisitWeights) => {
    const kMinWeight = 0.4;
	const kMaxWeight = 1;
	return normalizedVisitWeights[publisher] * (kMaxWeight - kMinWeight) + kMinWeight;
}

@mattmcalister
Copy link

cc @LorenzoMinto and @aurangzaib048

@LorenzoMinto
Copy link
Member

LorenzoMinto commented Oct 17, 2022

hey @fallaciousreasoning are you talking about suggestions ranking or feed ranking?

If referring to suggestions ranking, my thought is that we should prioritise visits over similar to visits. Because visits would be the strongest signal there. As for similar to subscribed vs similar to visits I would prioritise the first. Wdyt?

PS: Just noticed the function in the spec is way outdated. Working now on coming up with a new one that reflects the above priorities.

@LorenzoMinto
Copy link
Member

We could have three different regularised contributors:

visited: [0.4, 1], similar_sub: [0, 0.4], similar_visited: [0, 0.3] 

(the scores ranges are pretty much arbitrary, we can discuss). The final score for each source would then be the sum of the three score above

s(i) = visited[i] + similar_sub[i] + similar_visited[i]  # min: 0, max: 1.7

and finally we would sample from the score distribution s to create the actual suggestion list.

Each contributor is normalised independently over the scores from all other sources and projected to that specific range. I would suggest we only look at the top-10 similar sources to compute the similar_sub and similar_visited scores. For example, in pseudo code:

similar_sub[i] = sum([sim(i,j)*getEnabledWeighting(j) for j in top_similar(i, 10)])

the score would then be normalised over the entire similar_sub vector and projected over [0, 0.4] in the case of this predictor, like @fallaciousreasoning did for the getVisitRating.

Let me know what you think, maybe there's a simpler solution to achieve those priorities. This solution (with a decent tuning) would allow a gradual blending of them.

@petemill
Copy link
Member Author

For the questions around which locales to pull suggestions from, if we consider the same sources can appear in multiple locales, then I think we have to consider a user's "locale list" to decide which similarity files to download and pull from.

Here is a scenario I'm thinking about: If the user is subscribed to "XYZ News" and it is a source that is "in" both EN_US and EN_CA, we do not want to suggest other EN_CA news sources to user's which have the EN_US locale.
However, if the user has purposefully subscribed to "ZYX News" which is only in EN_CA then, even if the user has the EN_US locale, we should suggest other EN_CA news sources to the user. This becomes more important where the user has a locale which we don't have a direct list of sources for, e.g. EN_FR.

So perhaps the list of similarity locales to consider for a source are:
If the source has a single locale

  • The source's local
    If the source has multiple locales
  • The locale the user's OS is set to, if there's a match
  • OR any of those locales which the user also has channel subscriptions to
  • OR any of those locales which the user also has single-locale source subscriptions to

If there's still no match, which is possible especially if the user has no channel subscriptions, then perhaps we scan the list of subscriptions for the most common locale, or maybe just combine all the relevant locale similarity matrices.

@mattmcalister
Copy link

If this creates a lot of work then it might be best to identify a single rule that likely covers the most common cases. I think we would cover a lot of ground with "The locale the user's OS is set to".

@petemill
Copy link
Member Author

If this creates a lot of work then it might be best to identify a single rule that likely covers the most common cases. I think we would cover a lot of ground with "The locale the user's OS is set to".

Absolutely fine to at least start with that then build incrementally if needed, since that's contained within the suggestion above.

@fallaciousreasoning
Copy link

Okay, I have a first pass implementation based on our discussion (brave/brave-core#15447). While writing it I came up with a few more questions:

  1. Should sources the user has disabled ever be suggested? (in the PR, they are).
  2. For a source which is similar to one the user has visited before, should that similarity be multiplied by the visit score (i.e. if I visit theatlantic.com lots, should sources similar to it be more recommended than sources similar to fox.com, which I only visited once).

@LorenzoMinto, I agree, visits should probably be our strongest signal, then similar to subscribed then similar to visits.

@mattmcalister
Copy link

  1. We definitely shouldn't suggest a source that a user has chosen to "Hide". And also if they "Unfollow" a source then it would probably be expected that it is not suggested but that's a weaker signal.
  2. good point. and since visits are the signal we value most maybe it should be used to weight the suggestions.

@LorenzoMinto what do you think?

@LorenzoMinto
Copy link
Member

Yes. Fully agree on using the visit score to weight visited domains contributions 👌

@stephendonner
Copy link

stephendonner commented Oct 25, 2022

Verification PASSED using

Brave 1.46.79 Chromium: 107.0.5304.62 (Official Build) dev (x86_64)
Revision 1eec40d3a5764881c92085aaee66d25075c159aa-refs/branch-heads/5304@{#942}
OS macOS Version 11.7.1 (Build 20G918)

Steps:

  1. installed 1.46.79
  2. launched Brave
  3. opened brave://flags/
  4. set brave://flags/#brave-news-v2 to Enabled
  5. clicked Relaunch
  6. opened a new-tab page
  7. scrolled down
  8. clicked on Show Brave News
  9. clicked on Customize
  10. searched for the drive
  11. clicked on the Follow button
  12. clicked on the x to close the Customize dialog
  13. reloaded the Brave News tab
  14. clicked Customize again
  15. examined the Suggestions list

Confirmed I got Car & Driver, PopularMechanics (sic), and Ars Technica suggestions

step 5 step 8 step 9 step 10 step 11 step 15
Screen Shot 2022-10-25 at 2 38 32 PM Screen Shot 2022-10-25 at 3 01 05 PM Screen Shot 2022-10-25 at 3 01 11 PM Screen Shot 2022-10-25 at 3 01 20 PM Screen Shot 2022-10-25 at 2 53 10 PM Screen Shot 2022-10-25 at 2 52 58 PM

@srirambv
Copy link
Contributor

srirambv commented Nov 3, 2022

Removing OS/Android label as the front-end work is not yet done for Android. Logged follow up issue #26497. More info here

@srirambv srirambv removed the OS/Android Fixes related to Android browser functionality label Nov 3, 2022
@rebron rebron changed the title Brave News Source suggestions Add source suggestions for Brave News Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment