Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetched lyrics from Genius are incomplete #4815

Closed
calm3285 opened this issue Jun 6, 2023 · 9 comments · Fixed by #5352
Closed

Fetched lyrics from Genius are incomplete #4815

calm3285 opened this issue Jun 6, 2023 · 9 comments · Fixed by #5352
Labels
bitesize bug bugs that are confirmed and actionable

Comments

@calm3285
Copy link

calm3285 commented Jun 6, 2023

Problem

This is and example of the fetched lyrics

[Verse 1: Killer Mike]
Hear what I say, we are the business today
Fuck shit is finished today (What)
RT and J—we the new PB & J
We dropped a classic today (What)
We did a tablet of acid today
Lit joints with the matches and ashes away
SKRRRT! We dash away
Donner and Dixon, the pistol is blasting away```

[Verse 2: El-P]
Doctors of death
Curing our patients of breath
We are the pain you can trust
Crooked at work
Cookin' up curses and slurs
Smokin' my brain into mush
I became famous for flamin' you fucks
Maimin' my way through the brush
There is no training or taming of me and my bruh
Look like a man, but I'm animal raw

[Verse 3: Killer Mike]
We are the murderous pair
That went to jail and we murdered the murderers there
Then went to Hell and discovered the devil
Delivered some hurt and despair
Used to have powder to push
Now I smoke pounds of the kush
Holy, I'm burnin' a bush
Now I give a fuck about none of this shit
Jewel runner over and out of this bitch

While this is the link of the lyrics
https://genius.com/Run-the-jewels-legend-has-it-lyrics

Setup

  • OS: arch linux
  • Python version: 3.11.3
  • beets version: 1.6.1
  • Turning off plugins made problem go away (yes/no): no

My configuration (output of beet config) is:

lyrics:
    bing_lang_from: []
    force: yes
    sources: genius
    auto: yes
    bing_client_secret: REDACTED
    bing_lang_to:
    google_API_key: REDACTED
    google_engine_ID: REDACTED
    genius_api_key: REDACTED
    fallback:
    local: no
    dist_thresh: 0.1
library: ~/.config/beets/library.db
directory: /data/media/audio

plugins: zero lyrics

import:
    copy: no
    from_scratch: yes
    incremental: yes
    log: /data/media/audio/beetlogs.txt
    move: no
    quiet: no
    quiet_fallback: skip
    resume: ask
    timid: no
    write: yes
zero:
    auto: yes
    update_database: yes
    fields: images
    keep_fields: []

Also there isnt nothing in the documentation in how to configure the genius_api_key parameter

@wisp3rwind
Copy link
Member

Looks like it's only picking up the lyrics from the first div. Presumably, something about the site's structure changed, leading to this problem?

Fixing this should be possible by adapting the scraper at

def _scrape_lyrics_from_html(self, html):

@wisp3rwind wisp3rwind added bug bugs that are confirmed and actionable bitesize labels Jun 10, 2023
@mojolo
Copy link

mojolo commented Jun 18, 2023

Yeah, same problem here: https://genius.com/Gnarls-barkley-go-go-gadget-gospel-lyrics

Lyrics plugin is only picking up the first occurrence of the div:

<div data-lyrics-container="true" class="Lyrics__Container-sc-1ynbvzw-5 Dzxov">

[Intro]
Pump up the peculiar
While I yell unique
F your wondering what you look like, look at me
Ah, let me show you right here
Hey, Ahaha
Ooooh, yeah, yeah, yeah

[Verse 1]
I'm well on my way
I'm almost everything
And this is my day
You make me want to say

[Chorus]
I'm free! Look at me!
Behold everything I'm allowed to see
Free! Come and see
Na, na, na, na, na na na

[Verse 2]
The shapeless, formless, heart is enormous
Bore this, I've worn this, no never what the norm is
Come hear this, it's fearless
Contrast, colour, prisms, so warmin'
Listen and love it

the second occurrence with the same div class/name is ignored:

<div data-lyrics-container="true" class="Lyrics__Container-sc-1ynbvzw-5 Dzxov">

[Chorus]
I'm freeee! Look at me!
Freedom in hi-fidelity
Free! come and see
Na, na, na, na, na na na

[Verse 3]
What you waitin' on?
I won't ask your, passion, smilin', laughin'
Yieldin', feelin', helpin', healin'
Introduce your neighbour to your saviour

[Chorus]
I'm free! Look at me!
Freedom in hi-fidelity
Free!
Na, na, na, na, na na na

@Daredevil09m
Copy link

Daredevil09m commented Aug 19, 2023

Hé guys did anyone of you find a solution for this problem i am struggling with this problem for a week already for example i want to fetch lyrics to the song All Eyez On Me by the artist 2pac

[Intro: 2Pac]
Big Syke, 'Nook, Paint, Bogart, Big Serge (yeah)
Y'all know how this shit go (you know)
All eyes on me
Motherfuckin' OG
Roll up in the club and shit, is that right?
All eyes on me
All eyes on me
But you know what?

[Verse 1: 2Pac]
I bet you got it twisted, you don't know who to trust
So many player-hatin' niggas tryna sound like us
Say they ready for the funk, but I don't think they knowin'
Straight to the depths of Hell is where those cowards goin'
Well, are you still down? Nigga, holla when you see me
And let these devils be sorry for the day they finally freed me
I got a caravan of niggas every time we ride
Hittin' motherfuckers up when we pass by
Until I die, live the life of a boss player
'Cause even when I'm high, fuck with me and get crossed later
The futures in my eyes, 'cause all I want is cash and thangs
A five-double-0 Benz, flauntin' flashy rings, uhh
Bitches pursue me like a dream
Been known to disappear before your eyes just like a dope fiend
It seems, my main thing was to be major paid
The game sharper than a motherfuckin' razor blade
Say money bring bitches, bitches bring lies
One nigga's gettin' jealous and motherfuckers died
Depend on me like the first and fifteenth
They might hold me for a second, but these punks won't get me
We got foe niggas and low riders in ski masks
Screamin', "Thug Life" every time they pass, all eyes on me

I am missing nearly all the lyrics of the song i have tried everything already anyone have a solution? PS : I am on Windows and used Python to install beets i am currently on beets version 1.6.1

@michaeldiazh
Copy link

Hi hi hi! Sorry, I'm new to this repo, but I think I can help. It seems like we are only searching for one data-lyrics-container div. But if you run the fetch method using Ice Cube's It Was A Good Day, it contains 3 data-lyrics-container div (run this unit test on the test_lyrics.py file:

    def test_fetch_with_real_api(self):
        lyrics = genius.fetch('ice-cube', 'it was a good day')
        print(lyrics)

If you do a break point on _scrape_lyrics_from_html and look at the soup var, you can see that there are 3 data-lyrics-container div. One way to fix this is to change the line:

lyrics_div = soup.find("div", {"data-lyrics-container": True})

To:

lyrics_divs = soup.find_all("div", {"data-lyrics-container": True})
Once done, try to iterate thru the results and append each lyrics to a lyric var like so:

 lyrics_divs = soup.find_all("div", {"data-lyrics-container": True})
        lyrics = ''
        for lyrics_div in lyrics_divs:
            if lyrics_div:
                self.replace_br(lyrics_div)
                lyrics += lyrics_div.get_text()
        .....
        return lyrics

Let me know if I can make this change! It's my first time on making changes in an open source project haha

@Daredevil09m
Copy link

michaeldiazh it does not work for me i just copied and past it but gives me an error when i modify it still doesnt give me the full lyrics

@michaeldiaz0315
Copy link

@Daredevil09m Mhhh let me take a look again when I get home (:

@michaeldiazh
Copy link

@Daredevil09m

So I reran the test and I got for Run The Jewels Legend Has It. I refactored a bit of the code so check it:

Here is the test (I am just printing out the lyrics):

    def test_fetch_with_real_api(self):
        lyrics = genius.fetch('Run The Jewels', 'Legend Has It')
        print(lyrics)

Here is the refactored code. Try to replace the _scrape_lyrics_from_html method in the Geniusclass in the lyrics.py module. Also add the helper method _try_extracting_lyrics_from_non_data_lyrics_container and check if that works!

    def _scrape_lyrics_from_html(self, html):
        """Scrape lyrics from a given genius.com html"""

        soup = try_parse_html(html)
        if not soup:
            return

        # Remove script tags that they put in the middle of the lyrics.
        [h.extract() for h in soup('script')]

        # Most of the time, the page contains a div with class="lyrics" where
        # all of the lyrics can be found already correctly formatted
        # Sometimes, though, it packages the lyrics into separate divs, most
        # likely for easier ad placement

        lyrics_divs = soup.find_all("div", {"data-lyrics-container": True})
        if not lyrics_divs:
            self._log.debug('Received unusual song page html')
            return self._try_extracting_lyrics_from_non_data_lyrics_container(soup)
        lyrics = ''
        for lyrics_div in lyrics_divs:
            self.replace_br(lyrics_div)
            lyrics += lyrics_div.get_text() + '\n\n'
        return lyrics

    def _try_extracting_lyrics_from_non_data_lyrics_container(self, soup):
        """Extract lyrics from a div without attribute data-lyrics-container
        This is the second most common layout on genius.com
        """
        verse_div = soup.find("div", class_=re.compile("Lyrics__Container"))
        if not verse_div:
            if soup.find("div",
                         class_=re.compile("LyricsPlaceholder__Message"),
                         string="This song is an instrumental"):
                self._log.debug('Detected instrumental')
                return "[Instrumental]"
            else:
                self._log.debug("Couldn't scrape page using known layouts")
                return None

        lyrics_div = verse_div.parent
        self.replace_br(lyrics_div)

        ads = lyrics_div.find_all("div",
                                  class_=re.compile("InreadAd__Container"))
        for ad in ads:
            ad.replace_with("\n")

        footers = lyrics_div.find_all("div",
                                      class_=re.compile("Lyrics__Footer"))
        for footer in footers:
            footer.replace_with("")
        return lyrics_div.get_text()

You should get these print statements from the test:
Screen Shot 2023-09-04 at 6 51 19 PM

Screen Shot 2023-09-04 at 6 51 37 PM

michaeldiazh pushed a commit to michaeldiazh/beets that referenced this issue Sep 5, 2023
… Integration test was also made in this PR
michaeldiazh pushed a commit to michaeldiazh/beets that referenced this issue Feb 19, 2024
… Integration test was also made in this PR
@michaeldiazh
Copy link

I am recreating this branch. I'll have an MR up soon (:

@HomerHaddock
Copy link
Contributor

This issue is still present in v2.0.0, could somebody make a pull request? The one michaeldiazh made is not up to date and is hundreds of commits behind.

HomerHaddock added a commit to HomerHaddock/beets-lyrics-fix that referenced this issue Jul 7, 2024
@HomerHaddock HomerHaddock mentioned this issue Jul 7, 2024
3 tasks
HomerHaddock added a commit to HomerHaddock/beets-lyrics-fix that referenced this issue Jul 8, 2024
Serene-Arc added a commit that referenced this issue Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bitesize bug bugs that are confirmed and actionable
Projects
None yet
7 participants