[metascraper-twitter] Add specific connector #260

Hereward · 2020-01-31T13:40:55Z

"metascraper": "^5.10.6",
"metascraper-author": "^5.10.6",
"metascraper-clearbit": "^5.10.6",
"metascraper-date": "^5.10.6",
"metascraper-description": "^5.10.6",
"metascraper-image": "^5.10.6",
"metascraper-logo": "^5.10.6",
"metascraper-publisher": "^5.10.6",
"metascraper-title": "^5.10.6",
"metascraper-url": "^5.10.6",
"metascraper-video": "^5.10.6",
"metascraper-youtube": "^5.10.6",

Twitter URLs return NULL for all fields except logo

Example URL: https://twitter.com/realDonaldTrump/status/1222907250383245320

Expected behaviour

meta data returned

Actual behaviour

no data returned except for logo URL and publisher

Kikobeats · 2020-01-31T23:08:46Z

Twitter rewrote their client website using React + Webpack.

metascraper is revealing that the new Twitter website has a very poor HTML markup, that's why it can't get almost anything useful there.

The solution: Need to add a new metascraper-twitter package to try to specify HTML markups rules for getting the right data.

It will be very similar to metascraper-soundcloud or metascraper-youtube

Hereward · 2020-02-01T06:20:27Z

OK thanks for the feedback, Twitter is a pain to work with and they now require you to answer an extensive set of questions simply to get a developer API. I am currently using their widgets.js code which does not integrate well with React, so it's slightly surprising to learn that their new web platform is react-based. Cheers :)

Kikobeats · 2020-02-01T09:18:23Z

I hope to write the Twitter connector for the next week.

In the middle time, consider use metascraper-iframe, especially if you are consuming the data in a frontend side, the iframe is the standard way that providers offer to embed their content.

If you want a zero pain solution, consider using Microlink SDK 🙂

Kikobeats · 2020-02-02T14:33:03Z

Looks like there is a way to force get the old Twitter interface.

Need to set the following request user-agent:

Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko

I confirm it works!

https://api.microlink.io/?url=https://twitter.com/realDonaldTrump/status/1222907250383245320

Although I'm still interested in add a specific twitter package, this should be work in the middle time 🙂

Hereward · 2020-02-03T04:50:38Z

Thanks! I have now been approved for the twitter api so i think I will use that for now but your solution looks viable also. I prefer to load the data on the server and store in mongoDB because that way I can preload the react component and avoid scroll issues due to delayed content rendering, which is my main concern with the twitter's widgets.js client side solution.

Kikobeats · 2020-02-03T09:51:23Z

@Hereward In that case maybe you are interested into Microlink SDK since you can load external data using setData

JakeCoxon · 2020-06-04T13:39:05Z

@Kikobeats The user agent hack doesn't seem to work anymore, it looks like twitter dropped support for the browser and none of the data is in the returned HTML. perhaps we need to use the api like in metascaper-media-provider ?

Kikobeats · 2020-06-04T13:53:13Z

@JakeCoxon you're right; they're a new workaround to access to the old version, passing X-Requested-With

taspinar/twitterscraper#296 (comment)

Although it isn't too affordable; Ideally we want to access to the current version with no tradeoff, need to investigate how to land a better solution there.

watlandc · 2022-06-03T16:59:34Z

@Kikobeats Thanks for maintaining this helpful library 🙌

Twitter seems to be one of the most important sites to be able to scrape previews from.

Would you be open to gauging the community's interest in sponsoring your development of the Twitter connector?

Perhaps noting that if the community can sponsor your time and effort for this feature at $n, we take it on?

I would be willing to contribute sponsorship towards this effort 👍

Kikobeats · 2022-06-03T18:28:14Z

Hey, @watlandc

I prefer if someone take the initiative. My time is limited, and I will be happy to add new contributors to the project.

Note the Twitter detection is not as bad as the original issue was reported, but definitively it could be better:

https://api.microlink.io/?url=https://twitter.com/BytesAndHumans/status/1532772903523065858

Do you miss something specific?
How are you using metascraper?
What do you need?

🙂

cusxio · 2022-06-04T02:48:47Z

@Kikobeats are you strictly using metacraper with the headers User-Agent and X-Requested-With to get the result shown in the image above?

I'm asking this because the X-Requested-With work around doesn't seem to work.

Kikobeats · 2022-06-04T10:20:38Z

What's microscraper? this project is metascraper, and the hosted version is microlink.io 😛

I didn't use any custom header. It looks like you are experimenting issues related with getting the content, which isn't metascraper scope. Metascraper just applies rules over the content, being getting the content a precondition. As good the content is right, the metascraper output will be accurate.

In any case, I recommend you take a look to html-get!

cusxio · 2022-06-04T11:10:51Z

Woops, definitely meant metacraper, must have mixed up the microlinkhq and metascraper together. My bad.

But yea, using html-get definitely gets the right html! Will need to dig in to why thats the case. i.e. what's html-get doing different from a vanilla fetch.

watlandc · 2022-06-07T17:30:22Z

Do you miss something specific?

As you noted here, we were having issues getting the content from Twitter. We ended up finding a work around to get it working again.

How are you using metascraper?

Simply getting link preview content for an app that helps you stay organized https://sprout.io.

What do you need?

Looks like we're good for now (we're getting the same result as the screenshot above).

Although, returning the media inside the tweet would be a huge bonus.

Thanks!

Kikobeats · 2022-06-07T21:46:00Z

@watlandc

I wrote a full example using metascraper + html-get, targeting a tweet with media:

const createBrowserless = require('browserless')
const getHTML = require('html-get')

// Spawn Chromium process once
const browserlessFactory = createBrowserless()

// Kill the process when Node.js exit
process.on('exit', () => {
  console.log('closing resources!')
  browserlessFactory.close()
})

const getContent = async url => {
  // create a browser context inside Chromium process
  const browserContext = browserlessFactory.createContext()
  const getBrowserless = () => browserContext
  const result = await getHTML(url, { getBrowserless })
  // close the browser context after it's used
  await getBrowserless(browser => browser.destroyContext())
  return result
}

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

getContent('https://twitter.com/BytesAndHumans/status/1532772903523065858')
  .then(async ({ html, url }) => {
    const metadata = await metascraper({ html, url })
    console.log(metadata)
    process.exit()
  })
  .catch(error => {
    console.error(error)
    process.exit(1)
  })

output:

{
  author: null,
  date: '2022-06-07T21:42:24.000Z',
  description: '“What a week 🐣❤️📈”',
  image: 'https://pbs.twimg.com/media/FUWAUW7XoAAxuP_.jpg:large',
  logo: 'https://logo.clearbit.com/twitter.com',
  publisher: 'Twitter',
  title: 'Elena on Twitter',
  url: 'https://twitter.com/BytesAndHumans/status/1532772903523065858'
}

so could be possible, still the problem is getting the content at your side?

Kikobeats · 2022-06-07T22:21:19Z

In case you need to detect the author, then the tweet URL https://twitter.com/:username/status/:id can be considered a pattern for getting the username as author.

still worth it to write a specific twitter package!

watlandc · 2022-06-10T16:08:42Z

I wrote a full example using metascraper + html-get, targeting a tweet with media:

Very nice!

so could be possible, still the problem is getting the content at your side?

We're able to get the content now.

bhayward93 · 2022-06-30T13:04:35Z

@Kikobeats I was testing the above solution, and works for many scenarios, however I am still having issues getting the image for some URLs such as https://twitter.com/Twitter/status/1483427748500717573

Checking the Microlink API, it appears to also have an issue getting the image from this domain.

$ curl -sL 'https://api.microlink.io?url=https://twitter.com/Twitter/status/1483427748500717573' | jq

{
  "status": "success",
  "data": {
    "title": "Tweet / Twitter",
    "description": "Don’t miss what’s happening",
    "lang": "en",
    "author": null,
    "publisher": "Twitter",
    "image": null,
    "date": "2022-06-30T13:00:05.000Z",
    "url": "https://twitter.com/twitter/status/1483427748500717573",
    "logo": {
      "url": "https://abs.twimg.com/responsive-web/client-web/icon-ios.b1fc7278.png",
      "type": "png",
      "size": 8582,
      "height": 1024,
      "width": 1024,
      "size_pretty": "8.58 kB"
    }
  },
  "statusCode": 200,
  "headers": {
    "cache-control": "no-cache, no-store, must-revalidate, pre-check=0, post-check=0",
    "content-encoding": "gzip",
    "content-security-policy": "connect-src 'self' blob: https://*.pscp.tv https://*.video.pscp.tv https://*.twimg.com https://api.twitter.com https://api-stream.twitter.com https://ads-api.twitter.com https://aa.twitter.com https://caps.twitter.com https://pay.twitter.com https://sentry.io https://ton.twitter.com https://twitter.com https://upload.twitter.com https://www.google-analytics.com https://accounts.google.com/gsi/status https://accounts.google.com/gsi/log https://app.link https://api2.branch.io https://bnc.lt wss://*.pscp.tv https://vmap.snappytv.com https://vmapstage.snappytv.com https://vmaprel.snappytv.com https://vmap.grabyo.com https://dhdsnappytv-vh.akamaihd.net https://pdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://dwo3ckksxlb0v.cloudfront.net ; default-src 'self'; form-action 'self' https://twitter.com https://*.twitter.com; font-src 'self' https://*.twimg.com; frame-src 'self' https://twitter.com https://mobile.twitter.com https://pay.twitter.com https://cards-frame.twitter.com https://accounts.google.com/ https://client-api.arkoselabs.com/ https://iframe.arkoselabs.com/  https://recaptcha.net/recaptcha/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/; img-src 'self' blob: data: https://*.cdn.twitter.com https://ton.twitter.com https://*.twimg.com https://analytics.twitter.com https://cm.g.doubleclick.net https://www.google-analytics.com https://www.periscope.tv https://www.pscp.tv https://media.riffsy.com https://*.giphy.com https://media.tenor.com https://c.tenor.com https://*.pscp.tv https://*.periscope.tv https://prod-periscope-profile.s3-us-west-2.amazonaws.com https://platform-lookaside.fbsbx.com https://scontent.xx.fbcdn.net https://scontent-sea1-1.xx.fbcdn.net https://*.googleusercontent.com https://imgix.revue.co; manifest-src 'self'; media-src 'self' blob: https://twitter.com https://*.twimg.com https://*.vine.co https://*.pscp.tv https://*.video.pscp.tv https://dhdsnappytv-vh.akamaihd.net https://pdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://dwo3ckksxlb0v.cloudfront.net; object-src 'none'; script-src 'self' 'unsafe-inline' https://*.twimg.com https://recaptcha.net/recaptcha/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://client-api.arkoselabs.com/ https://www.google-analytics.com https://twitter.com https://app.link https://accounts.google.com/gsi/client https://appleid.cdn-apple.com/appleauth/static/jsapi/appleid/1/en_US/appleid.auth.js  'nonce-NDhlMTBkYWYtY2JjYy00OTlmLWFiZDUtOTgxYmMyODZiMWZi'; style-src 'self' 'unsafe-inline' https://accounts.google.com/gsi/style https://*.twimg.com; worker-src 'self' blob:; report-uri https://twitter.com/i/csp_report?a=O5RXE%3D%3D%3D&ro=false",
    "content-type": "text/html; charset=utf-8",
    "cross-origin-embedder-policy": "unsafe-none",
    "cross-origin-opener-policy": "same-origin-allow-popups",
    "date": "Thu, 30 Jun 2022 13:00:05 GMT",
    "expiry": "Tue, 31 Mar 1981 05:00:00 GMT",
    "last-modified": "Thu, 30 Jun 2022 13:00:05 GMT",
    "pragma": "no-cache",
    "server": "tsa_b",
    "set-cookie": "guest_id_marketing=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\nguest_id_ads=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\npersonalization_id=\"v1_8EK5HpiZcmdp/iJUPUtSKA==\"; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\nguest_id=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None",
    "strict-transport-security": "max-age=631138519",
    "x-connection-hash": "f744deb62fd346d8e4ea08b01e182bbb049039ceaffeca5a3f8472612fa1bdc1",
    "x-content-type-options": "nosniff",
    "x-frame-options": "DENY",
    "x-powered-by": "Express",
    "x-response-time": "35",
    "x-xss-protection": "0"
  }

Looking at metascraper-image package, I'm confused as to why this line isn't picking up the og:image tag.

toImage(($: any) => $('meta[property="og:image"]').attr('content')),

bhayward93 · 2022-12-08T14:20:10Z

Has there been any updates on this?

Closes #260

Kikobeats · 2022-12-31T13:22:26Z

Hello everyone,

The PR is prepared #608

Appreciated if you can feedback, otherwise will be merged shortly 🙂

Closes #260

bhayward93 · 2023-01-20T15:59:52Z

Hello everyone,

The PR is prepared #608

Appreciated if you can feedback, otherwise will be merged shortly slightly_smiling_face

Thanks for taking a look at it @Kikobeats - testing it out I'm noticing improvements for things like avatars on profiles, but am still having issues with many links such as the one above (https://twitter.com/Twitter/status/1483427748500717573/). Can you confirm that for you, this link is parsed correctly?

Also happens for some images - I saved the outputted HTML to a file to debug, and it appears that page is not loading correctly the majority of the time, as it's missing at minimum some essential metadata like the og:image's - though sometimes it does appear - can you confirm that this is not an issue that you can replicate? Regardless, I don't believe this issue would be directly with Metascraper - but am wondering whether you could offer your opinion on whether it could relate to html-get, or perhaps Browserless.

bhayward93 · 2023-01-20T19:50:38Z

@Kikobeats, I have got more links working - including the one above and some others that I was having issues with.

I've found that when turning off pre-rendering for Twitter and setting a UserAgent of googlebot, it is now working. I'd be happy to put forward an PR to add Twitter to your auto-domains file in your getHTML package if it's something that you would want - for now, I'm just going to fork it - note though without setting the UserAgent, it does not work, which is something I don't think you'd want to set at a package level.

One thing to note about the metascraper-twitter package, is that I had to place it below metascraper-image, else for some tweets that metascraper-image WAS correctly grabbing, it will grab a generic Twitter bird.

Kikobeats changed the title ~~Twitter URLs return NULL for all fields except logo~~ Add a new package for connecting with Twitter Jan 31, 2020

Kikobeats added the enhancement label Jan 31, 2020

Kikobeats changed the title ~~Add a new package for connecting with Twitter~~ [metascraper-twitter] Add specific connector Jan 31, 2020

Kikobeats mentioned this issue Apr 13, 2021

[metascraper-twitter] Correct Images are Not Getting Scraped from Twitter #397

Closed

2 tasks

Kikobeats added a commit that referenced this issue Dec 31, 2022

feat: add metascraper-twitter

1c238fe

Closes #260

Kikobeats mentioned this issue Dec 31, 2022

feat: add metascraper-twitter #608

Merged

Kikobeats added a commit that referenced this issue Dec 31, 2022

feat: add metascraper-twitter

b1e592a

Closes #260

Kikobeats added a commit that referenced this issue Jan 1, 2023

feat: add metascraper-twitter

9179486

Closes #260

Kikobeats closed this as completed in #608 Jan 1, 2023

Kikobeats added a commit that referenced this issue Jan 1, 2023

feat: add metascraper-twitter (#608)

075c0ab

Closes #260

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metascraper-twitter] Add specific connector #260

[metascraper-twitter] Add specific connector #260

Hereward commented Jan 31, 2020 •

edited

Loading

Kikobeats commented Jan 31, 2020

Hereward commented Feb 1, 2020

Kikobeats commented Feb 1, 2020 •

edited

Loading

Kikobeats commented Feb 2, 2020 •

edited

Loading

Hereward commented Feb 3, 2020

Kikobeats commented Feb 3, 2020

JakeCoxon commented Jun 4, 2020

Kikobeats commented Jun 4, 2020

watlandc commented Jun 3, 2022

Kikobeats commented Jun 3, 2022 •

edited

Loading

cusxio commented Jun 4, 2022 •

edited

Loading

Kikobeats commented Jun 4, 2022

cusxio commented Jun 4, 2022 •

edited

Loading

watlandc commented Jun 7, 2022 •

edited

Loading

Kikobeats commented Jun 7, 2022 •

edited

Loading

Kikobeats commented Jun 7, 2022

watlandc commented Jun 10, 2022

bhayward93 commented Jun 30, 2022

bhayward93 commented Dec 8, 2022

Kikobeats commented Dec 31, 2022 •

edited

Loading

bhayward93 commented Jan 20, 2023

bhayward93 commented Jan 20, 2023

[metascraper-twitter] Add specific connector #260

[metascraper-twitter] Add specific connector #260

Comments

Hereward commented Jan 31, 2020 • edited Loading

Expected behaviour

Actual behaviour

Kikobeats commented Jan 31, 2020

Hereward commented Feb 1, 2020

Kikobeats commented Feb 1, 2020 • edited Loading

Kikobeats commented Feb 2, 2020 • edited Loading

Hereward commented Feb 3, 2020

Kikobeats commented Feb 3, 2020

JakeCoxon commented Jun 4, 2020

Kikobeats commented Jun 4, 2020

watlandc commented Jun 3, 2022

Kikobeats commented Jun 3, 2022 • edited Loading

cusxio commented Jun 4, 2022 • edited Loading

Kikobeats commented Jun 4, 2022

cusxio commented Jun 4, 2022 • edited Loading

watlandc commented Jun 7, 2022 • edited Loading

Kikobeats commented Jun 7, 2022 • edited Loading

Kikobeats commented Jun 7, 2022

watlandc commented Jun 10, 2022

bhayward93 commented Jun 30, 2022

bhayward93 commented Dec 8, 2022

Kikobeats commented Dec 31, 2022 • edited Loading

bhayward93 commented Jan 20, 2023

bhayward93 commented Jan 20, 2023

Hereward commented Jan 31, 2020 •

edited

Loading

Kikobeats commented Feb 1, 2020 •

edited

Loading

Kikobeats commented Feb 2, 2020 •

edited

Loading

Kikobeats commented Jun 3, 2022 •

edited

Loading

cusxio commented Jun 4, 2022 •

edited

Loading

cusxio commented Jun 4, 2022 •

edited

Loading

watlandc commented Jun 7, 2022 •

edited

Loading

Kikobeats commented Jun 7, 2022 •

edited

Loading

Kikobeats commented Dec 31, 2022 •

edited

Loading