You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a lot of issues with the archive's format, including missing information, ambiguously structured information, and things that are just plain difficult to work with. This ticket will just track these for future reference.
DM conversations include the account ID of the other person, but not their screen name. In some cases it can be inferred from other interactions (you replied to them, you retweeted them, you @'ed them), but most of the time cannot.
There is nothing in the data indicating whether a tweet is an RT other than to check for the text RT at the start of the tweet. In such cases, the first user mention will be the person who is being RT'd.
There is no data in the dump for how many replies a tweet got, despite having counts for the number of retweets and likes.
Replies to you and tweets you replied to are not included in the dump, which is understandable, but still frustrating.
There is no way to distinguish whether someone has been @'ed or has been tagged as being replied to. The dump only counts one person as being "replied to".
RTs get truncated at 140 characters, with the last 3 characters being replaced by an ellipsis ....
When an RT gets truncated, anything that gets chopped off will lose its entity information. So if an RT had a video, and the video's t.co link gets truncated, then the entity information will not contain the video and the .mp4 will not be in the dump.
The truncated field in the dump is always false, even when an RT gets truncated like above.
If someone you replied to has changed username since you made the tweet, then the user_mentions entity information will be missing.
The source field contains html, which is clunky. Why can't it be a URL and a summary field?
Liked tweets have extremely little information contained in the dump.
The only way to know what the other person in a DM is, is by parsing out the ID. The ID is the two participant's IDs, sorted numerically(??), then concatenated together with a -.
Nitpick: I have to parse Twitter CDN URLs in order to map them to filenames in the media zip. For example, https://video.twimg.com/ext_tw_video/{tweet id}/pu/vid/624x1230/{filename}.mp4?tag=10.
Profile pictures of people you've RT'd are not included in the dump.
Tweet full_text is for some reason HTML escaped, so I have to undo the >.
Many bits of the Tweet full_text do not show up in the Twitter Web client, but suitable metadata for doing the same is not included in the dump.
No information is included about Quote-RT's. There's no way to even know whether a Tweet references a QRT - there isn't even a link inline.
The .js files in the archive are like 40 characters away from being json. If you just removed this weird bit.. they would be json...
window.YTD.follower.part0=[{
There's no version field attached to the archives, and they sometimes change format in ways that would break tools people have written. That said, the last time I'm aware of that the archive changed, it was making the images in the dump actually able to be referenced using data in the archive, so don't stop breaking it I guess.
In DMs, links only exist as inline t.co URLs that have to be parsed out and resolved. @-mentions have to be parsed out as well.
There are image media in the dump for DM conversations, but there is no way to correlate them with the t.co URLs in the actual DM conversations.
imageUrls turns out to not be always empty, it's used solely for this.
This list is a work in progress!
The text was updated successfully, but these errors were encountered:
There are a lot of issues with the archive's format, including missing information, ambiguously structured information, and things that are just plain difficult to work with. This ticket will just track these for future reference.
RT
at the start of the tweet. In such cases, the first user mention will be the person who is being RT'd....
.truncated
field in the dump is always false, even when an RT gets truncated like above.user_mentions
entity information will be missing.-
.https://video.twimg.com/ext_tw_video/{tweet id}/pu/vid/624x1230/{filename}.mp4?tag=10
.>
.t.co
URLs that have to be parsed out and resolved. @-mentions have to be parsed out as well.There are image media in the dump for DM conversations, but there is no way to correlate them with thet.co
URLs in the actual DM conversations.imageUrls
turns out to not be always empty, it's used solely for this.The text was updated successfully, but these errors were encountered: