Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Couple of bugs notices with the usage, listing them down here. #36

Open
arao opened this issue Dec 14, 2024 · 6 comments
Open

Bug: Couple of bugs notices with the usage, listing them down here. #36

arao opened this issue Dec 14, 2024 · 6 comments

Comments

@arao
Copy link

arao commented Dec 14, 2024

** Currently parsed name is used as it is for directory structure. Torrent being torrent, simple case issue cause multiple folder creation for same content, like

'In the Heights (2021)/In the Heights (2021) - [H265][HDR].mkv' -> '/mnt/sv1/volumes/library/zurg/__all__/In the Heights (2021) UHD BDRemux 4K 2160p H.265 HEVC HDR 10+ Dolby Vision Ukr Eng Sub Eng Multilang [Hurtom]/In the Heights (2021) UHD BDRemux 4K 2160p H.265 HEVC HDR 10+ Dolby Vision Ukr Eng Sub Eng Multilang [Hurtom].mkv'

In The Heights (2021)/'In The Heights (2021) - [HEVC][DV][7.1].mkv' -> /mnt/sv1/volumes/library/zurg/__all__/In.The.Heights.2021.UHD.BluRay.2160p.TrueHD.Atmos.7.1.DV.HEVC.REMUX-FraMeSToR/In.The.Heights.2021.UHD.BluRay.2160p.TrueHD.Atmos.7.1.DV.HEVC.REMUX-FraMeSToR.mkv
'500 Days of Summer (2009)'
'500 Days Of Summer (2009)'

** Invalid keyword picked from directory, instead of file name

'sample Mission Impossible  Multi UHD 10bit TrueHD   DTOne (1996)/sample Mission Impossible  Multi UHD 10bit TrueHD   DTOne (1996) - [x265][HDR][5.1].mkv' -> /mnt/sv1/volumes/library/zurg/__all__/Mission.Impossible.Pentalogy1996-2015.Multi.2160p.UHD.BluRay.x265.HDR-DTOne/sample-Mission.Impossible.1996.Multi.2160p.UHD.BluRay.x265.10bit.HDR.TrueHD.5.1-DTOne.mkv

possible solution could be to skip unnecessary keywords, like sample and all, but require further investigation.


wrong parsing

'xXx  Multi TF HDL (2002)/xXx  Multi TF HDL (2002) [Resolution].mkv' -> '/mnt/sv1/volumes/library/zurg/__all__/[ OxTorrent.com ] xXx.2002.Multi TF 1080p HDL/[ OxTorrent.com ] xXx.2002.Multi TF 1080p HDL.mkv'

Can remove non unicode, probably mandrin charactors from file name.

'失踪宝贝 Gone Baby Gone  DDP5  2Audio 4K世界 (2007) - [DDP5.1].mkv' -> '/mnt/sv1/volumes/library/zurg/__all__/[4ksj.com]失踪宝贝.Gone.Baby.Gone.2007.2160p.WEB-DL.H265.DDP5.1.2Audio[国英双音轨_中英双语字幕]-4K世界/[4ksj.com]失踪宝贝.Gone.Baby.Gone.2007.2160p.WEB-DL.H265.DDP5.1.2Audio[国英双音轨_中英双语字幕]-4K世界.mkv'

'[ToishY] Koyomimonogatari - NCED [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCED1a [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCED1b [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCED1c [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCOP1 [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCOP2 [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCOP3 [BDRip'

possible solution, anything "[]" at start of title, probably not work picking.


Shows/1x01/Season 1/1x01 - S1E01 .mkv -> '/mnt/sv1/volumes/library/zurg/__all__/PSYCHO-PASS S01 [BDrip 1080 DTS Multi][vostfr-french-breton]/1x01.mkv'
Shows/[A&C] Dr  Stone S1 [01] [BDRip 1080p] [/Season 1/[A&C] Dr  Stone S1 [01] [BDRip 1080p] [ - S01E01 .mkv/-> '/mnt/sv1/volumes/library/zurg/__all__/[A&C] Dr Stone S01 & S02 [BDRip 1080p]/[A&C] Dr. Stone S1 [01] [BDRip 1080p] [415E7369].mkv'

let me know if directory map is require, I can share same. BTW, great initiative to map the media directory, been looking to something similar for some time.

@sureshfizzy
Copy link
Owner

sureshfizzy commented Dec 14, 2024

This seems interesting, I've never encountered these patterns before. I'll work on enhancing the parser.

edit: Was 'sample Mission Impossible Multi UHD 10bit TrueHD DTOne (1996)/sample Mission Impossible Multi UHD 10bit TrueHD DTOne (1996) - [x265][HDR][5.1].mkv' an actual sample file with a small size, or is it just a naming convention?

@arao
Copy link
Author

arao commented Dec 14, 2024

file is sample, but title should not get affected.

there are couple of libraries as well, you can check on their implemanttion for ideas, listing here

though in js, can easily port into python.

@sureshfizzy
Copy link
Owner

[ToishY] Owarimonogatari 2nd Season - NCED1a [BDRip

This seems good but even with this parser we still need to double filter it,

'A&C] Dr. Stone S1 [01] [BDRip 1080p] [415E7369].mkv'

output:

{
season: 1,
resolution: '1080p',
quality: 'BDRip',
container: 'mkv',
title: 'A&C] Dr. Stone',
excess: [ '[01]', '[', ']', '[415E7369]' ]
}

My current approach is to refine our existing parser and explore the possibility of integrating the JavaScript parser for fallback options. Do you have any alternative suggestions?

@arao
Copy link
Author

arao commented Dec 14, 2024

I was trying following approach.

  • Keep a centralise place for extracting all the metadata and title for given path. Path here refers to directory till actual file. I've seen multiple ocurrances where most information reside in parent folder and actual file has very limited.
  • Once the informations like, title, year, resolution, season, episode, filetype, tags are extracted, use these information to find associated imdb/tmdb/tvdb data. This is crucial, as once we can associate a file to metadata service, we do not have to rely on limited information present in file.
  • Once the id is resolved, we are pretty much through in organising.

I've been trying with multiple filename parsers, overriding with custom regex, but none gives a 100%. Even if we can target above 95%, it's a major win.

Another possibility I've tried is to leverage genAi prompts. This is most promising so far, comes with following limitation.

  • Cost and speed is primary factor.
  • Also models are prone to hallucinations, after couple of execution, they start to deviate from correct parsing. So these require either some sort of guard-rail or constant monitoring.

I've also tried to build one from scratch, leveraging pre-trained model with fine tuning couple of last layers, but it require a lot of data, and it's very hard to get clean data.

Probably if possible we can do one thing,

  • Make a category, correctly associated with metadata service id, and resource which are not correctly associated.
  • Then, resource which are not correctly associated, expose handler or other mechanism to provide additional regex to extract required information, which can help associate to metadata service.
  • In worst case, allow user to provide id.

Doing this recursively build a large tested regex repo for our usecase. Only thing to be cautious of is to organise these regex, so these can be easily debugged. Probably enforcing named-group capturing in regex !!

Let me know if you need assist in planning out the execution of the plan and splitting the work, can help on weekends.

Also sharing a dump of filepath I gathered from my repo and couple of my friends. It's very divers, will helps with boosting regex accuracy and stability.
directorylist.txt

@sureshfizzy
Copy link
Owner

I was trying following approach.

  • Keep a centralise place for extracting all the metadata and title for given path. Path here refers to directory till actual file. I've seen multiple ocurrances where most information reside in parent folder and actual file has very limited.
  • Once the informations like, title, year, resolution, season, episode, filetype, tags are extracted, use these information to find associated imdb/tmdb/tvdb data. This is crucial, as once we can associate a file to metadata service, we do not have to rely on limited information present in file.
  • Once the id is resolved, we are pretty much through in organising.

I've been trying with multiple filename parsers, overriding with custom regex, but none gives a 100%. Even if we can target above 95%, it's a major win.

Another possibility I've tried is to leverage genAi prompts. This is most promising so far, comes with following limitation.

  • Cost and speed is primary factor.
  • Also models are prone to hallucinations, after couple of execution, they start to deviate from correct parsing. So these require either some sort of guard-rail or constant monitoring.

I've also tried to build one from scratch, leveraging pre-trained model with fine tuning couple of last layers, but it require a lot of data, and it's very hard to get clean data.

Probably if possible we can do one thing,

  • Make a category, correctly associated with metadata service id, and resource which are not correctly associated.
  • Then, resource which are not correctly associated, expose handler or other mechanism to provide additional regex to extract required information, which can help associate to metadata service.
  • In worst case, allow user to provide id.

Doing this recursively build a large tested regex repo for our usecase. Only thing to be cautious of is to organise these regex, so these can be easily debugged. Probably enforcing named-group capturing in regex !!

Let me know if you need assist in planning out the execution of the plan and splitting the work, can help on weekends.

Also sharing a dump of filepath I gathered from my repo and couple of my friends. It's very divers, will helps with boosting regex accuracy and stability. directorylist.txt

Thanks! I'll check them out and we can discuss how to proceed with this further . If you have any additional suggestions or updates, reach out on Discord as well.

sureshfizzy added a commit that referenced this issue Dec 15, 2024
- Added logic to handle cases where the result is a tuple.
- fixes imdb/tmdb formatting issue by directly appending ID's to movie name

Addresses #36
sureshfizzy added a commit that referenced this issue Dec 21, 2024
- Add advanced query cleaning function with:
  * Configurable max word limit
  * Better handling of TV/movie title variations
- Expand show episode pattern matching to support "series.X.YofZ" format
- Enhance movie title cleaning with better technical term filtering
- Fix proper name propagation in movie processing results

Addresses #36
@sureshfizzy
Copy link
Owner

@arao, I've updated a few logics to improve accuracy. Please give these changes a try and let me know if they make a difference. The code is available in the anime-fix branch. If you need a Docker image, you can use sureshfizzy/cinesync:anime-fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants