Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metascraper-twitter] Add specific connector #260

Closed
Hereward opened this issue Jan 31, 2020 · 22 comments · Fixed by #608
Closed

[metascraper-twitter] Add specific connector #260

Hereward opened this issue Jan 31, 2020 · 22 comments · Fixed by #608

Comments

@Hereward
Copy link

Hereward commented Jan 31, 2020

"metascraper": "^5.10.6",
"metascraper-author": "^5.10.6",
"metascraper-clearbit": "^5.10.6",
"metascraper-date": "^5.10.6",
"metascraper-description": "^5.10.6",
"metascraper-image": "^5.10.6",
"metascraper-logo": "^5.10.6",
"metascraper-publisher": "^5.10.6",
"metascraper-title": "^5.10.6",
"metascraper-url": "^5.10.6",
"metascraper-video": "^5.10.6",
"metascraper-youtube": "^5.10.6",

Twitter URLs return NULL for all fields except logo

Example URL: https://twitter.com/realDonaldTrump/status/1222907250383245320

Expected behaviour

meta data returned

Actual behaviour

no data returned except for logo URL and publisher

twitter_meta_20100131

@Kikobeats
Copy link
Member

Twitter rewrote their client website using React + Webpack.

Screenshot 2020-02-01 at 00 03 14

metascraper is revealing that the new Twitter website has a very poor HTML markup, that's why it can't get almost anything useful there.

The solution: Need to add a new metascraper-twitter package to try to specify HTML markups rules for getting the right data.

It will be very similar to metascraper-soundcloud or metascraper-youtube

@Kikobeats Kikobeats changed the title Twitter URLs return NULL for all fields except logo Add a new package for connecting with Twitter Jan 31, 2020
@Kikobeats Kikobeats changed the title Add a new package for connecting with Twitter [metascraper-twitter] Add specific connector Jan 31, 2020
@Hereward
Copy link
Author

Hereward commented Feb 1, 2020

OK thanks for the feedback, Twitter is a pain to work with and they now require you to answer an extensive set of questions simply to get a developer API. I am currently using their widgets.js code which does not integrate well with React, so it's slightly surprising to learn that their new web platform is react-based. Cheers :)

@Kikobeats
Copy link
Member

Kikobeats commented Feb 1, 2020

I hope to write the Twitter connector for the next week.

In the middle time, consider use metascraper-iframe, especially if you are consuming the data in a frontend side, the iframe is the standard way that providers offer to embed their content.

If you want a zero pain solution, consider using Microlink SDK 🙂

@Kikobeats
Copy link
Member

Kikobeats commented Feb 2, 2020

Looks like there is a way to force get the old Twitter interface.

Need to set the following request user-agent:

Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko

I confirm it works!

https://api.microlink.io/?url=https://twitter.com/realDonaldTrump/status/1222907250383245320

Although I'm still interested in add a specific twitter package, this should be work in the middle time 🙂

@Hereward
Copy link
Author

Hereward commented Feb 3, 2020

Thanks! I have now been approved for the twitter api so i think I will use that for now but your solution looks viable also. I prefer to load the data on the server and store in mongoDB because that way I can preload the react component and avoid scroll issues due to delayed content rendering, which is my main concern with the twitter's widgets.js client side solution.

@Kikobeats
Copy link
Member

@Hereward In that case maybe you are interested into Microlink SDK since you can load external data using setData

@JakeCoxon
Copy link

@Kikobeats The user agent hack doesn't seem to work anymore, it looks like twitter dropped support for the browser and none of the data is in the returned HTML. perhaps we need to use the api like in metascaper-media-provider ?

@Kikobeats
Copy link
Member

@JakeCoxon you're right; they're a new workaround to access to the old version, passing X-Requested-With

taspinar/twitterscraper#296 (comment)

Although it isn't too affordable; Ideally we want to access to the current version with no tradeoff, need to investigate how to land a better solution there.

@watlandc
Copy link

watlandc commented Jun 3, 2022

@Kikobeats Thanks for maintaining this helpful library 🙌

Twitter seems to be one of the most important sites to be able to scrape previews from.

Would you be open to gauging the community's interest in sponsoring your development of the Twitter connector?

Perhaps noting that if the community can sponsor your time and effort for this feature at $n, we take it on?

I would be willing to contribute sponsorship towards this effort 👍

@Kikobeats
Copy link
Member

Kikobeats commented Jun 3, 2022

Hey, @watlandc

I prefer if someone take the initiative. My time is limited, and I will be happy to add new contributors to the project.

Note the Twitter detection is not as bad as the original issue was reported, but definitively it could be better:

CleanShot 2022-06-03 at 20 26 43@2x

https://api.microlink.io/?url=https://twitter.com/BytesAndHumans/status/1532772903523065858

Do you miss something specific?
How are you using metascraper?
What do you need?

🙂

@cusxio
Copy link

cusxio commented Jun 4, 2022

@Kikobeats are you strictly using metacraper with the headers User-Agent and X-Requested-With to get the result shown in the image above?

I'm asking this because the X-Requested-With work around doesn't seem to work.

@Kikobeats
Copy link
Member

What's microscraper? this project is metascraper, and the hosted version is microlink.io 😛

I didn't use any custom header. It looks like you are experimenting issues related with getting the content, which isn't metascraper scope. Metascraper just applies rules over the content, being getting the content a precondition. As good the content is right, the metascraper output will be accurate.

In any case, I recommend you take a look to html-get!

@cusxio
Copy link

cusxio commented Jun 4, 2022

Woops, definitely meant metacraper, must have mixed up the microlinkhq and metascraper together. My bad.

But yea, using html-get definitely gets the right html! Will need to dig in to why thats the case. i.e. what's html-get doing different from a vanilla fetch.

@watlandc
Copy link

watlandc commented Jun 7, 2022

Do you miss something specific?

As you noted here, we were having issues getting the content from Twitter. We ended up finding a work around to get it working again.

How are you using metascraper?

Simply getting link preview content for an app that helps you stay organized https://sprout.io.

What do you need?

Looks like we're good for now (we're getting the same result as the screenshot above).

Although, returning the media inside the tweet would be a huge bonus.

Thanks!

@Kikobeats
Copy link
Member

Kikobeats commented Jun 7, 2022

@watlandc

I wrote a full example using metascraper + html-get, targeting a tweet with media:

const createBrowserless = require('browserless')
const getHTML = require('html-get')

// Spawn Chromium process once
const browserlessFactory = createBrowserless()

// Kill the process when Node.js exit
process.on('exit', () => {
  console.log('closing resources!')
  browserlessFactory.close()
})

const getContent = async url => {
  // create a browser context inside Chromium process
  const browserContext = browserlessFactory.createContext()
  const getBrowserless = () => browserContext
  const result = await getHTML(url, { getBrowserless })
  // close the browser context after it's used
  await getBrowserless(browser => browser.destroyContext())
  return result
}

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

getContent('https://twitter.com/BytesAndHumans/status/1532772903523065858')
  .then(async ({ html, url }) => {
    const metadata = await metascraper({ html, url })
    console.log(metadata)
    process.exit()
  })
  .catch(error => {
    console.error(error)
    process.exit(1)
  })

output:

{
  author: null,
  date: '2022-06-07T21:42:24.000Z',
  description: '“What a week 🐣❤️📈”',
  image: 'https://pbs.twimg.com/media/FUWAUW7XoAAxuP_.jpg:large',
  logo: 'https://logo.clearbit.com/twitter.com',
  publisher: 'Twitter',
  title: 'Elena on Twitter',
  url: 'https://twitter.com/BytesAndHumans/status/1532772903523065858'
}

so could be possible, still the problem is getting the content at your side?

@Kikobeats
Copy link
Member

In case you need to detect the author, then the tweet URL https://twitter.com/:username/status/:id can be considered a pattern for getting the username as author.

still worth it to write a specific twitter package!

@watlandc
Copy link

I wrote a full example using metascraper + html-get, targeting a tweet with media:

Very nice!

so could be possible, still the problem is getting the content at your side?

We're able to get the content now.

@bhayward93
Copy link

@Kikobeats I was testing the above solution, and works for many scenarios, however I am still having issues getting the image for some URLs such as https://twitter.com/Twitter/status/1483427748500717573

Checking the Microlink API, it appears to also have an issue getting the image from this domain.

$ curl -sL 'https://api.microlink.io?url=https://twitter.com/Twitter/status/1483427748500717573' | jq

{
  "status": "success",
  "data": {
    "title": "Tweet / Twitter",
    "description": "Don’t miss what’s happening",
    "lang": "en",
    "author": null,
    "publisher": "Twitter",
    "image": null,
    "date": "2022-06-30T13:00:05.000Z",
    "url": "https://twitter.com/twitter/status/1483427748500717573",
    "logo": {
      "url": "https://abs.twimg.com/responsive-web/client-web/icon-ios.b1fc7278.png",
      "type": "png",
      "size": 8582,
      "height": 1024,
      "width": 1024,
      "size_pretty": "8.58 kB"
    }
  },
  "statusCode": 200,
  "headers": {
    "cache-control": "no-cache, no-store, must-revalidate, pre-check=0, post-check=0",
    "content-encoding": "gzip",
    "content-security-policy": "connect-src 'self' blob: https://*.pscp.tv https://*.video.pscp.tv https://*.twimg.com https://api.twitter.com https://api-stream.twitter.com https://ads-api.twitter.com https://aa.twitter.com https://caps.twitter.com https://pay.twitter.com https://sentry.io https://ton.twitter.com https://twitter.com https://upload.twitter.com https://www.google-analytics.com https://accounts.google.com/gsi/status https://accounts.google.com/gsi/log https://app.link https://api2.branch.io https://bnc.lt wss://*.pscp.tv https://vmap.snappytv.com https://vmapstage.snappytv.com https://vmaprel.snappytv.com https://vmap.grabyo.com https://dhdsnappytv-vh.akamaihd.net https://pdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://dwo3ckksxlb0v.cloudfront.net ; default-src 'self'; form-action 'self' https://twitter.com https://*.twitter.com; font-src 'self' https://*.twimg.com; frame-src 'self' https://twitter.com https://mobile.twitter.com https://pay.twitter.com https://cards-frame.twitter.com https://accounts.google.com/ https://client-api.arkoselabs.com/ https://iframe.arkoselabs.com/  https://recaptcha.net/recaptcha/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/; img-src 'self' blob: data: https://*.cdn.twitter.com https://ton.twitter.com https://*.twimg.com https://analytics.twitter.com https://cm.g.doubleclick.net https://www.google-analytics.com https://www.periscope.tv https://www.pscp.tv https://media.riffsy.com https://*.giphy.com https://media.tenor.com https://c.tenor.com https://*.pscp.tv https://*.periscope.tv https://prod-periscope-profile.s3-us-west-2.amazonaws.com https://platform-lookaside.fbsbx.com https://scontent.xx.fbcdn.net https://scontent-sea1-1.xx.fbcdn.net https://*.googleusercontent.com https://imgix.revue.co; manifest-src 'self'; media-src 'self' blob: https://twitter.com https://*.twimg.com https://*.vine.co https://*.pscp.tv https://*.video.pscp.tv https://dhdsnappytv-vh.akamaihd.net https://pdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://dwo3ckksxlb0v.cloudfront.net; object-src 'none'; script-src 'self' 'unsafe-inline' https://*.twimg.com https://recaptcha.net/recaptcha/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://client-api.arkoselabs.com/ https://www.google-analytics.com https://twitter.com https://app.link https://accounts.google.com/gsi/client https://appleid.cdn-apple.com/appleauth/static/jsapi/appleid/1/en_US/appleid.auth.js  'nonce-NDhlMTBkYWYtY2JjYy00OTlmLWFiZDUtOTgxYmMyODZiMWZi'; style-src 'self' 'unsafe-inline' https://accounts.google.com/gsi/style https://*.twimg.com; worker-src 'self' blob:; report-uri https://twitter.com/i/csp_report?a=O5RXE%3D%3D%3D&ro=false",
    "content-type": "text/html; charset=utf-8",
    "cross-origin-embedder-policy": "unsafe-none",
    "cross-origin-opener-policy": "same-origin-allow-popups",
    "date": "Thu, 30 Jun 2022 13:00:05 GMT",
    "expiry": "Tue, 31 Mar 1981 05:00:00 GMT",
    "last-modified": "Thu, 30 Jun 2022 13:00:05 GMT",
    "pragma": "no-cache",
    "server": "tsa_b",
    "set-cookie": "guest_id_marketing=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\nguest_id_ads=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\npersonalization_id=\"v1_8EK5HpiZcmdp/iJUPUtSKA==\"; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\nguest_id=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None",
    "strict-transport-security": "max-age=631138519",
    "x-connection-hash": "f744deb62fd346d8e4ea08b01e182bbb049039ceaffeca5a3f8472612fa1bdc1",
    "x-content-type-options": "nosniff",
    "x-frame-options": "DENY",
    "x-powered-by": "Express",
    "x-response-time": "35",
    "x-xss-protection": "0"
  }

Looking at metascraper-image package, I'm confused as to why this line isn't picking up the og:image tag.

toImage(($: any) => $('meta[property="og:image"]').attr('content')),

image

@bhayward93
Copy link

Has there been any updates on this?

Kikobeats added a commit that referenced this issue Dec 31, 2022
Kikobeats added a commit that referenced this issue Dec 31, 2022
@Kikobeats
Copy link
Member

Kikobeats commented Dec 31, 2022

Hello everyone,

The PR is prepared #608

Appreciated if you can feedback, otherwise will be merged shortly 🙂

Kikobeats added a commit that referenced this issue Jan 1, 2023
Kikobeats added a commit that referenced this issue Jan 1, 2023
@bhayward93
Copy link

Hello everyone,

The PR is prepared #608

Appreciated if you can feedback, otherwise will be merged shortly slightly_smiling_face

Thanks for taking a look at it @Kikobeats - testing it out I'm noticing improvements for things like avatars on profiles, but am still having issues with many links such as the one above (https://twitter.com/Twitter/status/1483427748500717573/). Can you confirm that for you, this link is parsed correctly?

Also happens for some images - I saved the outputted HTML to a file to debug, and it appears that page is not loading correctly the majority of the time, as it's missing at minimum some essential metadata like the og:image's - though sometimes it does appear - can you confirm that this is not an issue that you can replicate? Regardless, I don't believe this issue would be directly with Metascraper - but am wondering whether you could offer your opinion on whether it could relate to html-get, or perhaps Browserless.

@bhayward93
Copy link

@Kikobeats, I have got more links working - including the one above and some others that I was having issues with.

I've found that when turning off pre-rendering for Twitter and setting a UserAgent of googlebot, it is now working. I'd be happy to put forward an PR to add Twitter to your auto-domains file in your getHTML package if it's something that you would want - for now, I'm just going to fork it - note though without setting the UserAgent, it does not work, which is something I don't think you'd want to set at a package level.

One thing to note about the metascraper-twitter package, is that I had to place it below metascraper-image, else for some tweets that metascraper-image WAS correctly grabbing, it will grab a generic Twitter bird.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants