Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for archive bugs #16

Open
tiffany352 opened this issue Oct 9, 2019 · 0 comments
Open

Tracking issue for archive bugs #16

tiffany352 opened this issue Oct 9, 2019 · 0 comments

Comments

@tiffany352
Copy link
Owner

tiffany352 commented Oct 9, 2019

There are a lot of issues with the archive's format, including missing information, ambiguously structured information, and things that are just plain difficult to work with. This ticket will just track these for future reference.

  • DM conversations include the account ID of the other person, but not their screen name. In some cases it can be inferred from other interactions (you replied to them, you retweeted them, you @'ed them), but most of the time cannot.
  • There is nothing in the data indicating whether a tweet is an RT other than to check for the text RT at the start of the tweet. In such cases, the first user mention will be the person who is being RT'd.
  • There is no data in the dump for how many replies a tweet got, despite having counts for the number of retweets and likes.
  • Replies to you and tweets you replied to are not included in the dump, which is understandable, but still frustrating.
  • There is no way to distinguish whether someone has been @'ed or has been tagged as being replied to. The dump only counts one person as being "replied to".
  • RTs get truncated at 140 characters, with the last 3 characters being replaced by an ellipsis ....
  • When an RT gets truncated, anything that gets chopped off will lose its entity information. So if an RT had a video, and the video's t.co link gets truncated, then the entity information will not contain the video and the .mp4 will not be in the dump.
  • The truncated field in the dump is always false, even when an RT gets truncated like above.
  • If someone you replied to has changed username since you made the tweet, then the user_mentions entity information will be missing.
  • The source field contains html, which is clunky. Why can't it be a URL and a summary field?
  • Liked tweets have extremely little information contained in the dump.
  • The only way to know what the other person in a DM is, is by parsing out the ID. The ID is the two participant's IDs, sorted numerically(??), then concatenated together with a -.
  • Nitpick: I have to parse Twitter CDN URLs in order to map them to filenames in the media zip. For example, https://video.twimg.com/ext_tw_video/{tweet id}/pu/vid/624x1230/{filename}.mp4?tag=10.
  • Profile pictures of people you've RT'd are not included in the dump.
  • Tweet full_text is for some reason HTML escaped, so I have to undo the >.
  • Many bits of the Tweet full_text do not show up in the Twitter Web client, but suitable metadata for doing the same is not included in the dump.
  • No information is included about Quote-RT's. There's no way to even know whether a Tweet references a QRT - there isn't even a link inline.
  • The status ID attached to an RT points to a weird "retweet status" rather than to the original tweet. For example, https://twitter.com/Tiffnixen/status/1158114079409504256 .
  • The .js files in the archive are like 40 characters away from being json. If you just removed this weird bit.. they would be json...
    window.YTD.follower.part0 = [ {
  • There's no version field attached to the archives, and they sometimes change format in ways that would break tools people have written. That said, the last time I'm aware of that the archive changed, it was making the images in the dump actually able to be referenced using data in the archive, so don't stop breaking it I guess.
  • In DMs, links only exist as inline t.co URLs that have to be parsed out and resolved. @-mentions have to be parsed out as well.
  • There are image media in the dump for DM conversations, but there is no way to correlate them with the t.co URLs in the actual DM conversations.
    • imageUrls turns out to not be always empty, it's used solely for this.
  • This list is a work in progress!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant