-
-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metascraper-twitter] Add specific connector #260
Comments
Twitter rewrote their client website using React + Webpack. metascraper is revealing that the new Twitter website has a very poor HTML markup, that's why it can't get almost anything useful there. The solution: Need to add a new It will be very similar to metascraper-soundcloud or metascraper-youtube |
OK thanks for the feedback, Twitter is a pain to work with and they now require you to answer an extensive set of questions simply to get a developer API. I am currently using their widgets.js code which does not integrate well with React, so it's slightly surprising to learn that their new web platform is react-based. Cheers :) |
I hope to write the Twitter connector for the next week. In the middle time, consider use metascraper-iframe, especially if you are consuming the data in a frontend side, the iframe is the standard way that providers offer to embed their content. If you want a zero pain solution, consider using Microlink SDK 🙂 |
Looks like there is a way to force get the old Twitter interface. Need to set the following request user-agent:
I confirm it works! https://api.microlink.io/?url=https://twitter.com/realDonaldTrump/status/1222907250383245320 Although I'm still interested in add a specific twitter package, this should be work in the middle time 🙂 |
Thanks! I have now been approved for the twitter api so i think I will use that for now but your solution looks viable also. I prefer to load the data on the server and store in mongoDB because that way I can preload the react component and avoid scroll issues due to delayed content rendering, which is my main concern with the twitter's widgets.js client side solution. |
@Kikobeats The user agent hack doesn't seem to work anymore, it looks like twitter dropped support for the browser and none of the data is in the returned HTML. perhaps we need to use the api like in metascaper-media-provider ? |
@JakeCoxon you're right; they're a new workaround to access to the old version, passing taspinar/twitterscraper#296 (comment) Although it isn't too affordable; Ideally we want to access to the current version with no tradeoff, need to investigate how to land a better solution there. |
@Kikobeats Thanks for maintaining this helpful library 🙌 Twitter seems to be one of the most important sites to be able to scrape previews from. Would you be open to gauging the community's interest in sponsoring your development of the Twitter connector? Perhaps noting that if the community can sponsor your time and effort for this feature at $n, we take it on? I would be willing to contribute sponsorship towards this effort 👍 |
Hey, @watlandc I prefer if someone take the initiative. My time is limited, and I will be happy to add new contributors to the project. Note the Twitter detection is not as bad as the original issue was reported, but definitively it could be better: https://api.microlink.io/?url=https://twitter.com/BytesAndHumans/status/1532772903523065858 Do you miss something specific? 🙂 |
@Kikobeats are you strictly using I'm asking this because the |
What's I didn't use any custom header. It looks like you are experimenting issues related with getting the content, which isn't metascraper scope. Metascraper just applies rules over the content, being getting the content a precondition. As good the content is right, the metascraper output will be accurate. In any case, I recommend you take a look to html-get! |
Woops, definitely meant But yea, using |
As you noted here, we were having issues getting the content from Twitter. We ended up finding a work around to get it working again.
Simply getting link preview content for an app that helps you stay organized https://sprout.io.
Looks like we're good for now (we're getting the same result as the screenshot above). Although, returning the Thanks! |
I wrote a full example using const createBrowserless = require('browserless')
const getHTML = require('html-get')
// Spawn Chromium process once
const browserlessFactory = createBrowserless()
// Kill the process when Node.js exit
process.on('exit', () => {
console.log('closing resources!')
browserlessFactory.close()
})
const getContent = async url => {
// create a browser context inside Chromium process
const browserContext = browserlessFactory.createContext()
const getBrowserless = () => browserContext
const result = await getHTML(url, { getBrowserless })
// close the browser context after it's used
await getBrowserless(browser => browser.destroyContext())
return result
}
const metascraper = require('metascraper')([
require('metascraper-author')(),
require('metascraper-date')(),
require('metascraper-description')(),
require('metascraper-image')(),
require('metascraper-logo')(),
require('metascraper-clearbit')(),
require('metascraper-publisher')(),
require('metascraper-title')(),
require('metascraper-url')()
])
getContent('https://twitter.com/BytesAndHumans/status/1532772903523065858')
.then(async ({ html, url }) => {
const metadata = await metascraper({ html, url })
console.log(metadata)
process.exit()
})
.catch(error => {
console.error(error)
process.exit(1)
}) output: {
author: null,
date: '2022-06-07T21:42:24.000Z',
description: '“What a week 🐣❤️📈”',
image: 'https://pbs.twimg.com/media/FUWAUW7XoAAxuP_.jpg:large',
logo: 'https://logo.clearbit.com/twitter.com',
publisher: 'Twitter',
title: 'Elena on Twitter',
url: 'https://twitter.com/BytesAndHumans/status/1532772903523065858'
} so could be possible, still the problem is getting the content at your side? |
In case you need to detect the author, then the tweet URL still worth it to write a specific twitter package! |
Very nice!
We're able to get the content now. |
@Kikobeats I was testing the above solution, and works for many scenarios, however I am still having issues getting the image for some URLs such as https://twitter.com/Twitter/status/1483427748500717573 Checking the Microlink API, it appears to also have an issue getting the image from this domain.
Looking at metascraper-image package, I'm confused as to why this line isn't picking up the
|
Has there been any updates on this? |
Hello everyone, The PR is prepared #608 Appreciated if you can feedback, otherwise will be merged shortly 🙂 |
Thanks for taking a look at it @Kikobeats - testing it out I'm noticing improvements for things like avatars on profiles, but am still having issues with many links such as the one above (https://twitter.com/Twitter/status/1483427748500717573/). Can you confirm that for you, this link is parsed correctly? Also happens for some images - I saved the outputted HTML to a file to debug, and it appears that page is not loading correctly the majority of the time, as it's missing at minimum some essential metadata like the og:image's - though sometimes it does appear - can you confirm that this is not an issue that you can replicate? Regardless, I don't believe this issue would be directly with Metascraper - but am wondering whether you could offer your opinion on whether it could relate to html-get, or perhaps Browserless. |
@Kikobeats, I have got more links working - including the one above and some others that I was having issues with. I've found that when turning off pre-rendering for Twitter and setting a UserAgent of One thing to note about the |
"metascraper": "^5.10.6",
"metascraper-author": "^5.10.6",
"metascraper-clearbit": "^5.10.6",
"metascraper-date": "^5.10.6",
"metascraper-description": "^5.10.6",
"metascraper-image": "^5.10.6",
"metascraper-logo": "^5.10.6",
"metascraper-publisher": "^5.10.6",
"metascraper-title": "^5.10.6",
"metascraper-url": "^5.10.6",
"metascraper-video": "^5.10.6",
"metascraper-youtube": "^5.10.6",
Twitter URLs return NULL for all fields except logo
Example URL: https://twitter.com/realDonaldTrump/status/1222907250383245320
Expected behaviour
meta data returned
Actual behaviour
no data returned except for logo URL and publisher
The text was updated successfully, but these errors were encountered: