Bug: Couple of bugs notices with the usage, listing them down here. #36

arao · 2024-12-14T10:07:50Z

** Currently parsed name is used as it is for directory structure. Torrent being torrent, simple case issue cause multiple folder creation for same content, like

'In the Heights (2021)/In the Heights (2021) - [H265][HDR].mkv' -> '/mnt/sv1/volumes/library/zurg/__all__/In the Heights (2021) UHD BDRemux 4K 2160p H.265 HEVC HDR 10+ Dolby Vision Ukr Eng Sub Eng Multilang [Hurtom]/In the Heights (2021) UHD BDRemux 4K 2160p H.265 HEVC HDR 10+ Dolby Vision Ukr Eng Sub Eng Multilang [Hurtom].mkv'

In The Heights (2021)/'In The Heights (2021) - [HEVC][DV][7.1].mkv' -> /mnt/sv1/volumes/library/zurg/__all__/In.The.Heights.2021.UHD.BluRay.2160p.TrueHD.Atmos.7.1.DV.HEVC.REMUX-FraMeSToR/In.The.Heights.2021.UHD.BluRay.2160p.TrueHD.Atmos.7.1.DV.HEVC.REMUX-FraMeSToR.mkv

'500 Days of Summer (2009)'
'500 Days Of Summer (2009)'

** Invalid keyword picked from directory, instead of file name

'sample Mission Impossible  Multi UHD 10bit TrueHD   DTOne (1996)/sample Mission Impossible  Multi UHD 10bit TrueHD   DTOne (1996) - [x265][HDR][5.1].mkv' -> /mnt/sv1/volumes/library/zurg/__all__/Mission.Impossible.Pentalogy1996-2015.Multi.2160p.UHD.BluRay.x265.HDR-DTOne/sample-Mission.Impossible.1996.Multi.2160p.UHD.BluRay.x265.10bit.HDR.TrueHD.5.1-DTOne.mkv

possible solution could be to skip unnecessary keywords, like sample and all, but require further investigation.

wrong parsing

'xXx  Multi TF HDL (2002)/xXx  Multi TF HDL (2002) [Resolution].mkv' -> '/mnt/sv1/volumes/library/zurg/__all__/[ OxTorrent.com ] xXx.2002.Multi TF 1080p HDL/[ OxTorrent.com ] xXx.2002.Multi TF 1080p HDL.mkv'

Can remove non unicode, probably mandrin charactors from file name.

'失踪宝贝 Gone Baby Gone  DDP5  2Audio 4K世界 (2007) - [DDP5.1].mkv' -> '/mnt/sv1/volumes/library/zurg/__all__/[4ksj.com]失踪宝贝.Gone.Baby.Gone.2007.2160p.WEB-DL.H265.DDP5.1.2Audio[国英双音轨_中英双语字幕]-4K世界/[4ksj.com]失踪宝贝.Gone.Baby.Gone.2007.2160p.WEB-DL.H265.DDP5.1.2Audio[国英双音轨_中英双语字幕]-4K世界.mkv'

'[ToishY] Koyomimonogatari - NCED [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCED1a [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCED1b [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCED1c [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCOP1 [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCOP2 [BDRip'
'[ToishY] Owarimonogatari 2nd Season - NCOP3 [BDRip'

possible solution, anything "[]" at start of title, probably not work picking.

Shows/1x01/Season 1/1x01 - S1E01 .mkv -> '/mnt/sv1/volumes/library/zurg/__all__/PSYCHO-PASS S01 [BDrip 1080 DTS Multi][vostfr-french-breton]/1x01.mkv'

Shows/[A&C] Dr  Stone S1 [01] [BDRip 1080p] [/Season 1/[A&C] Dr  Stone S1 [01] [BDRip 1080p] [ - S01E01 .mkv/-> '/mnt/sv1/volumes/library/zurg/__all__/[A&C] Dr Stone S01 & S02 [BDRip 1080p]/[A&C] Dr. Stone S1 [01] [BDRip 1080p] [415E7369].mkv'

let me know if directory map is require, I can share same. BTW, great initiative to map the media directory, been looking to something similar for some time.

The text was updated successfully, but these errors were encountered:

sureshfizzy · 2024-12-14T10:22:51Z

This seems interesting, I've never encountered these patterns before. I'll work on enhancing the parser.

edit: Was 'sample Mission Impossible Multi UHD 10bit TrueHD DTOne (1996)/sample Mission Impossible Multi UHD 10bit TrueHD DTOne (1996) - [x265][HDR][5.1].mkv' an actual sample file with a small size, or is it just a naming convention?

arao · 2024-12-14T10:33:26Z

file is sample, but title should not get affected.

there are couple of libraries as well, you can check on their implemanttion for ideas, listing here

though in js, can easily port into python.

sureshfizzy · 2024-12-14T10:53:56Z

[ToishY] Owarimonogatari 2nd Season - NCED1a [BDRip

This seems good but even with this parser we still need to double filter it,

'A&C] Dr. Stone S1 [01] [BDRip 1080p] [415E7369].mkv'

output:

{
season: 1,
resolution: '1080p',
quality: 'BDRip',
container: 'mkv',
title: 'A&C] Dr. Stone',
excess: [ '[01]', '[', ']', '[415E7369]' ]
}

My current approach is to refine our existing parser and explore the possibility of integrating the JavaScript parser for fallback options. Do you have any alternative suggestions?

arao · 2024-12-14T11:26:18Z

I was trying following approach.

Keep a centralise place for extracting all the metadata and title for given path. Path here refers to directory till actual file. I've seen multiple ocurrances where most information reside in parent folder and actual file has very limited.
Once the informations like, title, year, resolution, season, episode, filetype, tags are extracted, use these information to find associated imdb/tmdb/tvdb data. This is crucial, as once we can associate a file to metadata service, we do not have to rely on limited information present in file.
Once the id is resolved, we are pretty much through in organising.

I've been trying with multiple filename parsers, overriding with custom regex, but none gives a 100%. Even if we can target above 95%, it's a major win.

Another possibility I've tried is to leverage genAi prompts. This is most promising so far, comes with following limitation.

Cost and speed is primary factor.
Also models are prone to hallucinations, after couple of execution, they start to deviate from correct parsing. So these require either some sort of guard-rail or constant monitoring.

I've also tried to build one from scratch, leveraging pre-trained model with fine tuning couple of last layers, but it require a lot of data, and it's very hard to get clean data.

Probably if possible we can do one thing,

Make a category, correctly associated with metadata service id, and resource which are not correctly associated.
Then, resource which are not correctly associated, expose handler or other mechanism to provide additional regex to extract required information, which can help associate to metadata service.
In worst case, allow user to provide id.

Doing this recursively build a large tested regex repo for our usecase. Only thing to be cautious of is to organise these regex, so these can be easily debugged. Probably enforcing named-group capturing in regex !!

Let me know if you need assist in planning out the execution of the plan and splitting the work, can help on weekends.

Also sharing a dump of filepath I gathered from my repo and couple of my friends. It's very divers, will helps with boosting regex accuracy and stability.
directorylist.txt

sureshfizzy · 2024-12-14T11:45:34Z

I was trying following approach.

Keep a centralise place for extracting all the metadata and title for given path. Path here refers to directory till actual file. I've seen multiple ocurrances where most information reside in parent folder and actual file has very limited.

Once the informations like, title, year, resolution, season, episode, filetype, tags are extracted, use these information to find associated imdb/tmdb/tvdb data. This is crucial, as once we can associate a file to metadata service, we do not have to rely on limited information present in file.

Once the id is resolved, we are pretty much through in organising.

I've been trying with multiple filename parsers, overriding with custom regex, but none gives a 100%. Even if we can target above 95%, it's a major win.

Another possibility I've tried is to leverage genAi prompts. This is most promising so far, comes with following limitation.

Cost and speed is primary factor.

Also models are prone to hallucinations, after couple of execution, they start to deviate from correct parsing. So these require either some sort of guard-rail or constant monitoring.

I've also tried to build one from scratch, leveraging pre-trained model with fine tuning couple of last layers, but it require a lot of data, and it's very hard to get clean data.

Probably if possible we can do one thing,

Make a category, correctly associated with metadata service id, and resource which are not correctly associated.

Then, resource which are not correctly associated, expose handler or other mechanism to provide additional regex to extract required information, which can help associate to metadata service.

In worst case, allow user to provide id.

Doing this recursively build a large tested regex repo for our usecase. Only thing to be cautious of is to organise these regex, so these can be easily debugged. Probably enforcing named-group capturing in regex !!

Let me know if you need assist in planning out the execution of the plan and splitting the work, can help on weekends.

Also sharing a dump of filepath I gathered from my repo and couple of my friends. It's very divers, will helps with boosting regex accuracy and stability. directorylist.txt

Thanks! I'll check them out and we can discuss how to proceed with this further . If you have any additional suggestions or updates, reach out on Discord as well.

- Added logic to handle cases where the result is a tuple. - fixes imdb/tmdb formatting issue by directly appending ID's to movie name Addresses #36

- Add advanced query cleaning function with: * Configurable max word limit * Better handling of TV/movie title variations - Expand show episode pattern matching to support "series.X.YofZ" format - Enhance movie title cleaning with better technical term filtering - Fix proper name propagation in movie processing results Addresses #36

sureshfizzy · 2024-12-21T12:08:28Z

@arao, I've updated a few logics to improve accuracy. Please give these changes a try and let me know if they make a difference. The code is available in the anime-fix branch. If you need a Docker image, you can use sureshfizzy/cinesync:anime-fix.

sureshfizzy added a commit that referenced this issue Dec 15, 2024

fix: Ensure TMDb and IMDb IDs are attached when result is a tuple

44b92ec

- Added logic to handle cases where the result is a tuple. - fixes imdb/tmdb formatting issue by directly appending ID's to movie name Addresses #36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Couple of bugs notices with the usage, listing them down here. #36

Bug: Couple of bugs notices with the usage, listing them down here. #36

arao commented Dec 14, 2024

sureshfizzy commented Dec 14, 2024 •

edited

Loading

arao commented Dec 14, 2024

sureshfizzy commented Dec 14, 2024

arao commented Dec 14, 2024 •

edited

Loading

sureshfizzy commented Dec 14, 2024

sureshfizzy commented Dec 21, 2024

Bug: Couple of bugs notices with the usage, listing them down here. #36

Bug: Couple of bugs notices with the usage, listing them down here. #36

Comments

arao commented Dec 14, 2024

sureshfizzy commented Dec 14, 2024 • edited Loading

arao commented Dec 14, 2024

sureshfizzy commented Dec 14, 2024

arao commented Dec 14, 2024 • edited Loading

sureshfizzy commented Dec 14, 2024

sureshfizzy commented Dec 21, 2024

sureshfizzy commented Dec 14, 2024 •

edited

Loading

arao commented Dec 14, 2024 •

edited

Loading