Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TMDB search result similarity check #335

Open
2 tasks done
rraymondgh opened this issue Oct 24, 2024 · 0 comments
Open
2 tasks done

TMDB search result similarity check #335

rraymondgh opened this issue Oct 24, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@rraymondgh
Copy link
Contributor

  • I have checked the existing issues to avoid duplicates
  • I have redacted any info hashes and content metadata from any logs or screenshots attached to this issue

Is your feature request related to a problem? Please describe

I have found the following with TMDB search

  • query attribute works better if pre-cleaned, for both getting search results and performing similarity checks
  • a composite of similarity measures works better distance.levenshtein

Describe the solution you'd like

Cleaning

  • remove encoded web addresses from start
    regexp.MustCompile(^[a-zA-Z\d ]* ([a-z]{2,3}|Com|COM|TO|NZ|Org) [- ]{1})
  • remove none ascii characters
    regexp.MustCompile("(?i)[^\x00-\x7F]+")
  • remove season references from end
    regexp.MustCompile((?i) (s|season|season ){1}\d{1,2}e?\d{0,2}$)

Similarity

Use github.com/hbollon/go-edlib

  • OSADamerauLevenshtein
  • Lcs
  • Cosine
  • Jaccard
  • SorensenDice
  • Qgram

Apply similarity targets to min, median and max of these measures.

This reduces false positives and false negatives from use of levenshtein distance of 5.

false negatives distance > 5

image

false positives distance < 5

image

interaction of measures

image

proposed solution

Change bitmagnet to have a configuration such that a proxy is trusted. Proxy has to have these built in similarity checks and only returns one result in array if it passes similarity checks outline above

type Config struct {
	Enabled         bool
	BaseUrl         string
	ApiKey          string
	RateLimit       time.Duration
	RateLimitBurst  int
	SimilarityCheck bool
}

func NewDefaultConfig() Config {
	return Config{
		Enabled:         true,
		BaseUrl:         "https://api.themoviedb.org/3",
		ApiKey:          defaultTmdbApiKey,
		RateLimit:       defaultRateLimit,
		RateLimitBurst:  defaultRateLimitBurst,
		SimilarityCheck: true,
	}
}

levenshteinCheck() is only applied if SimilarityCheck is true

@rraymondgh rraymondgh added the enhancement New feature or request label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant