Skip to content
This repository has been archived by the owner on Nov 10, 2024. It is now read-only.

Provide cleaning function. #721

Closed
llrs opened this issue Aug 8, 2022 · 5 comments
Closed

Provide cleaning function. #721

llrs opened this issue Aug 8, 2022 · 5 comments

Comments

@llrs
Copy link
Collaborator

llrs commented Aug 8, 2022

Many text analysis remove url, hashtags, cashtags and user mentions.
It would be nice if there were a function to remove this from the information provided by the API.

@meier-flo
Copy link

Great idea! I feel like working with the new data structure is quite tricky.
For example, the entities list column: I can easily get access to the hashtags, however, the user_mentions I can't figure out how to unnest here:

search_object_result%>%select(id_str,entities)%>%
                                                  unnest_auto(entities)%>%
                                                          unnest(hashtags)

However, this breaks:

search_object_result%>%select(id_str,entities)%>%
                                                  unnest_auto(entities)%>%
                                                          unnest(user_mentions)

Using unnest_wider(entities); elements have 5 names in common
Error: ! Can't combine ..1$indices <data.frame> and ..22$indices .
Run rlang::last_error() to see where the error occurred.

A cleaning function for both text and maybe the tricky list columns would be nice!

@llrs
Copy link
Collaborator Author

llrs commented Sep 23, 2022

Entities is itself a list: search_object_result$entities[[1]]$user_mentions, search_object_result$entities[[2]]$user_mentions, ...
so to extract each user_mentions one would need to do something like lapply(search_object_result$entities, function(x){x$user_mentions}).

I suppose that unnest_auto is from tidyr or some other similar package but I'm not sure how it handles lists but I hope this is helpful.

@llrs
Copy link
Collaborator Author

llrs commented Sep 27, 2022

In the devel branch I implemented a function that can remove url, media links, mentions and hashtags from the text and returns the remaining text. You can test it in the devel branch (I recommend to activate the dev mode for this testing)

devtools::dev_mode()
remotes::install_github("ropensci/rtweet@devel")
library("rtweet")
clean_tweets(search_object_result)

Let me know if this works well or you think a different way would be better

@llrs
Copy link
Collaborator Author

llrs commented Oct 4, 2022

I'm closing the issue, but if you have any feedback will be appreciated.

@llrs llrs closed this as completed Oct 4, 2022
@llrs
Copy link
Collaborator Author

llrs commented Dec 9, 2022

In the latest version in devel (1.0.2.9014) there are some helpers to extract these. After installing it check ?helpers. Please @meier-flo let me know if they help or I should document better the output for each helper

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants