Skip to content

Latest commit

 

History

History
254 lines (186 loc) · 7.1 KB

File metadata and controls

254 lines (186 loc) · 7.1 KB

Retweet Graphs

Constructing retweet graphs:

BIGQUERY_DATASET_NAME="impeachment_production" BATCH_SIZE=10000 TOPIC="impeach" python -m app.retweet_graphs.bq_retweet_grapher

BIGQUERY_DATASET_NAME="impeachment_production" BATCH_SIZE=10000 TOPIC="#MAGA" python -m app.retweet_graphs.bq_retweet_grapher

BIGQUERY_DATASET_NAME="impeachment_production" BATCH_SIZE=10000 "#ImpeachAndConvict" python -m app.retweet_graphs.bq_retweet_grapher

Observe the resulting job identifier (JOB_ID), and verify the graph and other artifacts are saved to local storage and/or Google Cloud Storage.

Once you have created a retweet graph, note its JOB_ID, and see how much memory it takes to load a given graph:

# right-leaning conversation graph
JOB_ID="2020-06-07-2049" STORAGE_MODE="local" python -m app.friend_graphs.graph_analyzer
JOB_ID="2020-06-07-2049" STORAGE_MODE="remote" python -m app.friend_graphs.graph_analyzer

# left-leaning conversation graph
JOB_ID="2020-06-07-2056" STORAGE_MODE="local" python -m app.friend_graphs.graph_analyzer
JOB_ID="2020-06-07-2056" STORAGE_MODE="remote" python -m app.friend_graphs.graph_analyzer

# neutral conversation retweet graph
JOB_ID="2020-06-15-2141" STORAGE_MODE="local" python -m app.friend_graphs.graph_analyzer

Weekly Retweet Graphs

Constructing retweet graphs for a given week in the dataset:

BIGQUERY_DATASET_NAME="impeachment_production" BATCH_SIZE=2500 python -m app.retweet_graphs.bq_weekly_grapher
# BIGQUERY_DATASET_NAME="impeachment_production" BATCH_SIZE=2500 WEEK_ID="2019-52" python -m app.retweet_graphs.bq_weekly_grapher

Load weekly retweet graphs to see how much memory it takes:

WEEK_ID="2019-52" python -m app.retweet_graphs.bq_weekly_graph_loader

Bot Classification

Once you have created a retweet graph, note its JOB_ID or WEEK_ID, then compute bot probabilities for each node:

# JOB_ID="2020-06-15-2141" python -m app.botcode_v2.classifier
JOB_ID="2020-06-15-2141" DRY_RUN="false" python -m app.botcode_v2.classifier
# WEEK_ID="2019-52" python -m app.retweet_graphs.bq_weekly_graph_bot_classifier
WEEK_ID="2019-52" DRY_RUN="false" python -m app.retweet_graphs.bq_weekly_graph_bot_classifier

KS Tests

Retweeter Age Distribution By Topic

Compare the distribution of user creation dates for those retweeting about a given topic, vs those not retweeting about that topic:

BIGQUERY_DATASET_NAME="impeachment_production" TOPIC="#ImpeachAndConvict" python -m app.ks_test.topic_analyzer

Retweeter Age Distribution By Topic Pair

Compare the distribution of user creation dates for those retweeting exclusively about one of two different topics:

BIGQUERY_DATASET_NAME="impeachment_production" X_TOPIC="#ImpeachAndConvict" Y_TOPIC="#MAGA" python -m app.ks_test.topic_pair_analyzer

Retweet Graph Notes

Migrate and populate a few tables which we'll use to construct mention graphs.

First, for each tweet that is a retweet, determine the screen names of the user (i.e. retweeter) and the retweet user (i.e. retweeted) in a new table called "retweets":

DROP TABLE IF EXISTS impeachment_production.retweets;
CREATE TABLE IF NOT EXISTS impeachment_production.retweets as (
  SELECT
    user_id
    ,user_created_at
    ,user_screen_name
    ,split(SPLIT(status_text, "@")[OFFSET(1)], ":")[OFFSET(0)] as retweet_user_screen_name
    ,status_id
    ,status_text
    ,created_at
  FROM impeachment_production.tweets
  WHERE retweet_status_id is not null
);
SELECT count(DISTINCT status_id)
FROM impeachment_production.retweets;

Of the 67M tweets, 55.9M are retweets.

Let's count how many times each user retweeted another, in a new table called "retweet_counts":

DROP TABLE IF EXISTS impeachment_production.retweet_counts;
CREATE TABLE IF NOT EXISTS impeachment_production.retweet_counts as (
  SELECT
    user_id
    ,user_screen_name
    ,retweet_user_screen_name
    ,count(distinct status_id) as retweet_count
  FROM impeachment_production.retweets
  GROUP BY 1,2,3
);

For the 55M retweets, there are 2.7M users who did the retweeting. This is compared to the 3.6M users total.

SELECT count(DISTINCT user_id)
FROM impeachment_production.retweet_counts;

It seems there are many users who retweeted themselves hundreds or thousands of times. Perhaps we want to filter them out of a table that will construct the rt graphs. Although it is possible they could indicate bot behavior.

SELECT user_id, user_screen_name, retweet_user_screen_name, retweet_count
FROM impeachment_production.retweet_counts
order by retweet_count desc
limit 100

Which users were retweeted the most?

select retweet_user_screen_name, count(distinct user_id) as rt_by_users_count, count(distinct status_id) as rt_count
from impeachment_production.retweets
group by 1
order by 2 desc
limit 25
retweet_user_screen_name rt_by_users_count rt_count
realDonaldTrump 330211 2572421
charliekirk11 261486 1235210
SethAbramson 137606 536125
gtconway3d 133292 536097
tribelaw 121937 639320
RepAdamSchiff 117465 293892
SpeakerPelosi 114274 226393
funder 112302 394151
dbongino 110514 456759
DonaldJTrumpJr 104977 334537
JoyceWhiteVance 102185 472908
kylegriffin1 98250 389961
RepMarkMeadows 98027 434223
RudyGiuliani 89722 266291
marklevinshow 87526 331255
GOPLeader 87211 276222
mmpadellan 80404 200454
RealJamesWoods 79354 157840
joncoopertweets 79340 212499
justinamash 79183 166078
w_terrence 78612 185069
TomFitton 78148 405667
RyanAFournier 78090 179753
TheRickWilson 77116 172818
senatemajldr 75212 191118

Identify which users were retweeted the most, about a given topic:

SELECT
  retweet_user_screen_name
  ,count(distinct user_id) as retweet_user_count
  ,sum(retweet_count) as retweeted_total
FROM (
  SELECT
    rt.user_id
    ,rt.user_screen_name
    ,rt.retweet_user_screen_name
    ,count(distinct status_id) as retweet_count
  FROM impeachment_production.retweets rt
  WHERE upper(rt.status_text) LIKE '%#IMPEACHANDCONVICT%' -- AND (created_at BETWEEN  AND '{end_at}')
    AND rt.user_screen_name <> rt.retweet_user_screen_name -- exlude people retweeting themselves, right?
  GROUP BY 1,2,3
  -- ORDER BY 4 desc
)
GROUP BY retweet_user_screen_name
ORDER BY 3 DESC

Constructing Retweet Graphs:

{
  "app_env": "development",
  "job_id": "2020-06-15-2141",
  "dry_run": false,
  "batch_size": 1000,
  "dataset_address": "tweet-collector-py.impeachment_production",
  "destructive": false,
  "verbose": false,
  "retweeters": true,
  "conversation":

  {"users_limit": 50000, "topic": "impeach", "start_at": "2020-01-01", "end_at": "2020-01-30"}
}

Updated Migrations

DROP TABLE IF EXISTS impeachment_production.retweets;
CREATE TABLE IF NOT EXISTS impeachment_production.retweets as (
  SELECT
    user_id
    ,user_created_at
    ,user_screen_name
    ,split(SPLIT(status_text, "@")[OFFSET(1)], ":")[OFFSET(0)] as retweet_user_screen_name
    ,status_id
    ,status_text
    ,created_at
  FROM impeachment_production.tweets
  WHERE retweet_status_id is not null
);