-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.json
1 lines (1 loc) · 115 KB
/
index.json
1
[{"authors":["admin"],"categories":null,"content":"My name is François de Ryckel. I have grown up in Belgium then I emigrated from there to finish up my study and start working. I lived in several places over the last 20 years. I\u0026rsquo;m a math / philosophy teacher with some stunts at business. I created 2 companies in Zambia: a fruits farm and a fresh produce trading company.\n Lived in Paris, France, for 2 years to complete my undergrad and start my master Lived in Freiburg \u0026amp; Leipzig, Germany, for 3 years to complete my master and do some teaching gigs at Alliance Française, Leipzig Universitat and Leipzig International School Lived in Dhaka, Bangladesh to teach at the International School Dhaka Lived in Zambia for 9 years to teach math, stats and philosophy at the American international School of Lusaka. I also started a citrus \u0026amp; mangoes farm (over 9,000 trees) and a produce (fresh fish, fruits, and meat) trading company I am currently living in Saudi Arabia as math teacher on the KAUST university Campus As a teacher, I\u0026rsquo;m always looking for good examples to incorporate in my practices.\nLately I am especially interested in machine learning applications in finance and education.\n","date":1554595200,"expirydate":-62135596800,"kind":"taxonomy","lang":"en","lastmod":1567641600,"objectID":"2525497d367e79493fd32b198b28f040","permalink":"/author/francois-de-ryckel/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/author/francois-de-ryckel/","section":"authors","summary":"My name is François de Ryckel. I have grown up in Belgium then I emigrated from there to finish up my study and start working. I lived in several places over the last 20 years.","tags":null,"title":"François de Ryckel","type":"authors"},{"authors":["吳恩達"],"categories":null,"content":"吳恩達 is a professor of artificial intelligence at the Stanford AI Lab. His research interests include distributed robotics, mobile computing and programmable matter. He leads the Robotic Neurobiology group, which develops self-reconfiguring robots, systems of self-organizing robots, and mobile sensor networks.\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed neque elit, tristique placerat feugiat ac, facilisis vitae arcu. Proin eget egestas augue. Praesent ut sem nec arcu pellentesque aliquet. Duis dapibus diam vel metus tempus vulputate.\n","date":1461110400,"expirydate":-62135596800,"kind":"taxonomy","lang":"en","lastmod":1555459200,"objectID":"da99cb196019cc5857b9b3e950397ca9","permalink":"/author/%E5%90%B3%E6%81%A9%E9%81%94/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/author/%E5%90%B3%E6%81%A9%E9%81%94/","section":"authors","summary":"吳恩達 is a professor of artificial intelligence at the Stanford AI Lab. His research interests include distributed robotics, mobile computing and programmable matter. He leads the Robotic Neurobiology group, which develops self-reconfiguring robots, systems of self-organizing robots, and mobile sensor networks.","tags":null,"title":"吳恩達","type":"authors"},{"authors":null,"categories":null,"content":"Flexibility This feature can be used for publishing content such as:\n Online courses Project or software documentation Tutorials The courses folder may be renamed. For example, we can rename it to docs for software/project documentation or tutorials for creating an online course.\nDelete tutorials To remove these pages, delete the courses folder and see below to delete the associated menu link.\nUpdate site menu After renaming or deleting the courses folder, you may wish to update any [[main]] menu links to it by editing your menu configuration at config/_default/menus.toml.\nFor example, if you delete this folder, you can remove the following from your menu configuration:\n[[main]] name = \u0026quot;Courses\u0026quot; url = \u0026quot;courses/\u0026quot; weight = 50 Or, if you are creating a software documentation site, you can rename the courses folder to docs and update the associated Courses menu configuration to:\n[[main]] name = \u0026quot;Docs\u0026quot; url = \u0026quot;docs/\u0026quot; weight = 50 Update the docs menu If you use the docs layout, note that the name of the menu in the front matter should be in the form [menu.X] where X is the folder name. Hence, if you rename the courses/example/ folder, you should also rename the menu definitions in the front matter of files within courses/example/ from [menu.example] to [menu.\u0026lt;NewFolderName\u0026gt;].\n","date":1536451200,"expirydate":-62135596800,"kind":"section","lang":"en","lastmod":1536451200,"objectID":"59c3ce8e202293146a8a934d37a4070b","permalink":"/courses/example/","publishdate":"2018-09-09T00:00:00Z","relpermalink":"/courses/example/","section":"courses","summary":"Learn how to use Academic's docs layout for publishing online courses, software documentation, and tutorials.","tags":null,"title":"Overview","type":"docs"},{"authors":null,"categories":null,"content":"In this tutorial, I\u0026rsquo;ll share my top 10 tips for getting started with Academic:\nTip 1 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.\nNullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.\nCras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.\nSuspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.\nAliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.\nTip 2 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.\nNullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.\nCras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.\nSuspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.\nAliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.\n","date":1557010800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1557010800,"objectID":"74533bae41439377bd30f645c4677a27","permalink":"/courses/example/example1/","publishdate":"2019-05-05T00:00:00+01:00","relpermalink":"/courses/example/example1/","section":"courses","summary":"In this tutorial, I\u0026rsquo;ll share my top 10 tips for getting started with Academic:\nTip 1 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum.","tags":null,"title":"Example Page 1","type":"docs"},{"authors":null,"categories":null,"content":"Here are some more tips for getting started with Academic:\nTip 3 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.\nNullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.\nCras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.\nSuspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.\nAliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.\nTip 4 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.\nNullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.\nCras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.\nSuspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.\nAliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.\n","date":1557010800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1557010800,"objectID":"1c2b5a11257c768c90d5050637d77d6a","permalink":"/courses/example/example2/","publishdate":"2019-05-05T00:00:00+01:00","relpermalink":"/courses/example/example2/","section":"courses","summary":"Here are some more tips for getting started with Academic:\nTip 3 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum.","tags":null,"title":"Example Page 2","type":"docs"},{"authors":[],"categories":["R"],"content":" Introduction library(readr) # to read and write (import / export) any type into our R console. library(dplyr) # for pretty much all our data wrangling library(ggplot2) library(stringr) library(forcats) library(purrr) library(janitor) # to clear variable names with clean_names() Using glove embedding GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.1\nGloVe encodes the ratios of word-word co-occurrence probabilities, which is thought to represent some crude form of meaning associated with the abstract concept of the word, as vector difference. The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence.\nThe simple workflow for vectorizing tweet text into glove embeddings is as follows - ^/[https://www.adityamangal.com/2020/02/nlp-with-disaster-tweets-part-1/]\nTokenize incoming tweet texts in the training data. Download and parse glove embeddings into an embedding matrix for the tokenized words. Generate embeddings vector for tweets text in training data. Generate embeddings vector for tweets text in test data. Append to given tweets features and export. We will not stem or lemmatize the tweets at first; this will keep most of the meaning in the word used.\nclean_tweets \u0026lt;- function(df){ df \u0026lt;- df %\u0026gt;% mutate(number_hashtag = str_count(string = text, pattern = \u0026quot;#\u0026quot;), number_number = str_count(string = text, pattern = \u0026quot;[0-9]\u0026quot;) %\u0026gt;% as.numeric(), number_http = str_count(string = text, pattern = \u0026quot;http\u0026quot;) %\u0026gt;% as.numeric(), number_mention = str_count(string = text, pattern = \u0026quot;@\u0026quot;) %\u0026gt;% as.numeric(), number_location = if_else(!is.na(location), 1, 0), number_keyword = if_else(!is.na(keyword), 1, 0), number_repeated_char = str_count(string = text, pattern = \u0026quot;([a-z])\\\\1{2}\u0026quot;) %\u0026gt;% as.numeric(), text = str_replace_all(string = text, pattern = \u0026quot;http[^[:space:]]*\u0026quot;, replacement = \u0026quot;\u0026quot;), text = str_replace_all(string = text, pattern = \u0026quot;@[^[:space:]]*\u0026quot;, replacement = \u0026quot;\u0026quot;), number_char = nchar(text), #add the length of the tweet in character. number_word = str_count(string = text, pattern = \u0026quot;\\\\w+\u0026quot;), text = str_replace_all(string = text, pattern = \u0026quot;[0-9]\u0026quot;, replacement = \u0026quot;\u0026quot;), text = future_map(text, function(.x) stringi::stri_trans_general(.x, \u0026quot;Latin-ASCII\u0026quot;)) %\u0026gt;% unlist(.), text = str_replace_all(string = text, pattern = \u0026quot;\\u0089\u0026quot;, replacement = \u0026quot;\u0026quot;)) %\u0026gt;% select(-keyword, -location) return(df) } library(furrr) plan(\u0026quot;multicore\u0026quot;) df_train \u0026lt;- read_csv(\u0026quot;~/disaster_tweets/data/train.csv\u0026quot;) %\u0026gt;% clean_tweets() # sorting out the same tweets, different target issues temp \u0026lt;- df_train %\u0026gt;% group_by(text) %\u0026gt;% mutate(mean_target = mean(target), new_target = if_else(mean_target \u0026gt; 0.5, 1, 0)) %\u0026gt;% ungroup() %\u0026gt;% mutate(target = new_target, target_bin = factor(if_else(target == 1, \u0026quot;a_truth\u0026quot;, \u0026quot;b_false\u0026quot;))) %\u0026gt;% select(-new_target, -mean_target, -target) df_train \u0026lt;- temp Using keras’ text_tokenizer to tokenize the text in tweets dataset.\nlibrary(keras) # we assign each word in the whole tweets df corpus an ID tokenizer \u0026lt;- text_tokenizer() %\u0026gt;% fit_text_tokenizer(df_train$text) # if we want to check how many different words were in the corpus. # we do +1 because we\u0026#39;re dealing with Python. num_words \u0026lt;- length(tokenizer$word_index) + 1 # Using the above fit tokenizer, one now convert all the text to an actual sequences of indices. sequences \u0026lt;- texts_to_sequences(tokenizer, df_train$text) ## how long is the longest tweet? 33 words! We can use that as the base for padding. summary(map_int(sequences, length)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 9.00 13.00 13.64 18.00 32.00 max_tweet_length \u0026lt;- max(map_int(sequences, length)) # now, we need to pad all other tweet to a length of 33. # by default we pad first, then put the text. padded_sequences \u0026lt;- pad_sequences(sequences = sequences, maxlen = max_tweet_length) # checking that we do have a 7613 tweets x 32 columns matrix. dim(padded_sequences) ## [1] 7613 32 Let’s have a look at the first 5 tweet were, their conversion into indices and their final padded form.\n# the first 5 tweets in words df_train$text[1:5] ## [1] \u0026quot;Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all\u0026quot; ## [2] \u0026quot;Forest fire near La Ronge Sask. Canada\u0026quot; ## [3] \u0026quot;All residents asked to \u0026#39;shelter in place\u0026#39; are being notified by officers. No other evacuation or shelter in place orders are expected\u0026quot; ## [4] \u0026quot;, people receive #wildfires evacuation orders in California\u0026quot; ## [5] \u0026quot;Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school\u0026quot; # the first 5 tweets in indices sequences[1:5] ## [[1]] ## [1] 113 4389 20 1 830 5 18 247 135 1562 4390 84 36 ## ## [[2]] ## [1] 184 42 215 764 6440 6441 1354 ## ## [[3]] ## [1] 36 1690 1563 4 6442 3 6443 20 128 6444 17 1691 35 419 241 ## [16] 53 2085 3 686 1355 20 1070 ## ## [[4]] ## [1] 58 4391 1447 241 1355 3 91 ## ## [[5]] ## [1] 30 92 1182 18 312 19 6445 2356 26 256 19 1447 6446 66 2 ## [16] 179 # And the first tweet with padding padded_sequences[1:5, ] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] ## [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [2,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [3,] 0 0 0 0 0 0 0 0 0 0 36 1690 1563 4 ## [4,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [5,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] ## [1,] 0 0 0 0 0 113 4389 20 1 830 5 18 ## [2,] 0 0 0 0 0 0 0 0 0 0 0 184 ## [3,] 6442 3 6443 20 128 6444 17 1691 35 419 241 53 ## [4,] 0 0 0 0 0 0 0 0 0 0 0 58 ## [5,] 0 0 30 92 1182 18 312 19 6445 2356 26 256 ## [,27] [,28] [,29] [,30] [,31] [,32] ## [1,] 247 135 1562 4390 84 36 ## [2,] 42 215 764 6440 6441 1354 ## [3,] 2085 3 686 1355 20 1070 ## [4,] 4391 1447 241 1355 3 91 ## [5,] 19 1447 6446 66 2 179 ??????? A total of 22701 unique words were assigned an index in the tokenization.\nBorrowing the code from Aditya Mangal’s blog 2 for parsing and generating glove embedding matrix from my deepSentimentR package.\nparse_glove_embeddings \u0026lt;- function(file_path) { lines \u0026lt;- readLines(file_path) embeddings_index \u0026lt;- new.env(hash = TRUE, parent = emptyenv()) for (i in 1:length(lines)) { line \u0026lt;- lines[[i]] values \u0026lt;- strsplit(line, \u0026quot; \u0026quot;)[[1]] word \u0026lt;- values[[1]] embeddings_index[[word]] \u0026lt;- as.double(values[-1]) } cat(\u0026quot;Found\u0026quot;, length(embeddings_index), \u0026quot;word vectors.\\n\u0026quot;) return(embeddings_index) } generate_embedding_matrix \u0026lt;- function(word_index, embedding_dim, max_words, glove_file_path) { embeddings_index \u0026lt;- parse_glove_embeddings(glove_file_path) embedding_matrix \u0026lt;- array(0, c(max_words, embedding_dim)) for (word in names(word_index)) { index \u0026lt;- word_index[[word]] if (index \u0026lt; max_words) { embedding_vector \u0026lt;- embeddings_index[[word]] if (!is.null(embedding_vector)) { embedding_matrix[index+1,] \u0026lt;- embedding_vector } } } return(embedding_matrix) } The Glove project has a Twitter dataset trained on 2B tweets with 27B tokens. It comes with word vectors that are 25d, 50d, 100d or 200d.\nWe’ll try different variant and we’ll adjust in functions of our results.\n# To pick the length of each word vectors embedding_dim \u0026lt;- 25 #embedding_dim \u0026lt;- 50 # this operation is the crux of the whole numerization of our text. # we basically assign a word-vector for each word. We decided to go with a 50d dense vector. embedding_matrix \u0026lt;- generate_embedding_matrix(tokenizer$word_index, embedding_dim = 25, max_words = num_words, \u0026quot;~/glove/glove.twitter.27B.25d.txt\u0026quot;) ## Found 1193514 word vectors. #embedding_matrix \u0026lt;- generate_embedding_matrix(tokenizer$word_index, embedding_dim = 50, max_words = num_words, # \u0026quot;data_glove.twitter.27B/glove.twitter.27B.50d.txt\u0026quot;) #there were around 12,638 different words in all the tweets. We have change all of these words in a 50d vectors. # so now we should have a matrix of dimension 12638 by 50 dim(embedding_matrix) ## [1] 15093 25 #Let\u0026#39;s save that precious matrix for further use #write_rds(x = embedding_matrix, path = \u0026quot;data/embedding_matrix_50d.rds\u0026quot;) write_rds(x = embedding_matrix, path = \u0026quot;~/disaster_tweets/data/embedding_matrix_25d.rds\u0026quot;) Using the Keras modeling framework to generate embeddings for the given training data. We basically create a simple sequential model with one embedding layer whose weights we will freeze based on our embedding matrix created above, and a flattening layer that will flatten the output into a 2D matrix of dimensions 7613, 32x25 for 25d and (7613, 32x50) for 50d word vectors.\nRemember the longest tweet had 32 words. Each words is a 50d vector. So we want at the end matrix of 7613 x 1600 or (32x50). For many tweets, that matrix going to start with a bunch of zeros because of the padding. Remember the padding is at the start in our case.\nSo we now we need to apply that embedding to each of the 7613 tweet. Keras will do that for us.\nembedding_matrix \u0026lt;- read_rds(\u0026quot;~/disaster_tweets/data/embedding_matrix_25d.rds\u0026quot;) #embedding_matrix \u0026lt;- read_rds(\u0026quot;data/embedding_matrix_50d.rds\u0026quot;) model_embedding \u0026lt;- keras_model_sequential() %\u0026gt;% layer_embedding(input_dim = num_words, #number of total words in all of the tweets output_dim = embedding_dim, #the length of our embedding vectors (50d in this case) input_length = max_tweet_length, #the number of words of the longest tweet. All other tweets will be padded to have that length name = \u0026quot;embedding\u0026quot;) %\u0026gt;% layer_flatten(name = \u0026quot;flatten\u0026quot;) model_embedding %\u0026gt;% get_layer(name = \u0026quot;embedding\u0026quot;) %\u0026gt;% set_weights(list(embedding_matrix)) %\u0026gt;% freeze_weights() tweets_embedding \u0026lt;- model_embedding %\u0026gt;% predict(padded_sequences) So, let’s make sense of what is happening. Each tweets is now 800 variables long (32 words x 25d). The first tweet was: [1] “Our deed be the Reason of this # earthquake May ALLAH Forgive us all”. This tweet is 13 words long. So the last 325 variables should be filled, when the first 475 should be 0s. Let’s check that.\nstr(tweets_embedding) ## num [1:7613, 1:800] 0 0 0 0 0 0 0 0 0 0 ... # and part of the first tweet. tweets_embedding[1, 450:500] ## [1] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ## [8] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ## [15] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ## [22] 0.000000 0.000000 0.000000 0.000000 0.000000 -0.420470 0.565260 ## [29] -0.033577 0.310190 0.189300 -0.645880 1.387600 -0.574840 -0.138960 ## [36] -0.390030 -0.169110 -0.073094 -5.702100 0.812640 -0.412840 -0.438670 ## [43] 0.361850 -0.344710 0.146530 0.076999 -1.275600 -0.631900 -0.635160 ## [50] -0.517290 -0.901670 We can now add these matrix to our initial df.\ndf_train_glove \u0026lt;- bind_cols(df_train, as_tibble(tweets_embedding, .name_repair = \u0026quot;unique\u0026quot;) %\u0026gt;% clean_names()) %\u0026gt;% clean_names() # and let\u0026#39;s save all this had work! write_rds(x = df_train_glove, path = \u0026quot;~/disaster_tweets/data/train_glove_25d.rds\u0026quot;) #write_rds(x = df_train_glove, path = \u0026quot;data/train_glove_50d.rds\u0026quot;) Before we go on and model, we still need to process our test data.\n https://nlp.stanford.edu/projects/glove/↩︎\n https://www.adityamangal.com/2020/02/nlp-with-disaster-tweets-part-1/↩︎\n ","date":1591488000,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1591504991,"objectID":"ec7d525205b3519ddb34a3b35465ddd7","permalink":"/post/disaster-tweets-part-iii/","publishdate":"2020-06-07T00:00:00Z","relpermalink":"/post/disaster-tweets-part-iii/","section":"post","summary":"Introduction library(readr) # to read and write (import / export) any type into our R console. library(dplyr) # for pretty much all our data wrangling library(ggplot2) library(stringr) library(forcats) library(purrr) library(janitor) # to clear variable names with clean_names() Using glove embedding GloVe is an unsupervised learning algorithm for obtaining vector representations for words.","tags":["Classification","kaggle","text2vec","NLP"],"title":"Disaster Tweets - Part iii","type":"post"},{"authors":[],"categories":["R"],"content":" In the second part of this NLP task, we will use Singular Value Decomposition to help us transform a sparse matrix (from the Document Term Matrix - dtm) into a dense matrix. Hence this is still very much a BOW approach. This approach combined with xgboost gave us the best results without using word-embedding (or word-vectors) techniques. That said, we are not sure how this approach would work in production as it seems we would have to constantly regenerate the dense matrix (which is quite computationally intense). We would love to see / hear from others on how to use svd in this type of task.\nIn a sense, SVD can be seen as a dimensionality reduction technique:going from a very wide sparse matrix (as many columns as there are different words in all the tweets), to a dense one.\nSo let’s first to build that sparse matrix: on the rows, the document number (in this case the tweet ID) on the columns the word (1 word per column)\nBecause the dimensionality reduction is based on the words, we need to use the whole dataset for this task. Of course this is not really reasonable in the case of new cases.\nAlso, since we have already developed a whole cleaning workflow, let’s re-use it on the whole df.\nSetting up library(readr) # to read and write (import / export) any type into our R console. library(dplyr) # for pretty much all our data wrangling library(ggplot2) library(stringr) library(forcats) library(purrr) library(kableExtra) library(rsample) # to use initial_split() and some other resampling techniques later on. library(recipes) # to use the recipe() and step_() functions library(parsnip) # the main engine that run the models library(workflows) # to use workflow() library(tune) # to fine tune the hyperparameters library(dials) # to use grid_regular(), tune_grid(), penalty() library(yardstick) # to create the measure of accuracy, f1 score and ROC-AUC library(doParallel) #to parallelize the work - useful in tune() library(tidytext) library(textrecipes) We’ll be reusing the same clean_tweets() function we have used on part I to clean the tweets. We just copy-paste it here and repurpose it.\ndf_train \u0026lt;- read_csv(\u0026quot;~/disaster_tweets/data/train.csv\u0026quot;) %\u0026gt;% as_tibble() %\u0026gt;% select(id, text, keyword, location) df_test \u0026lt;- read_csv(\u0026quot;~/disaster_tweets/data/test.csv\u0026quot;) %\u0026gt;% as_tibble() %\u0026gt;% select(id, text, keyword, location) df_all \u0026lt;- bind_rows(df_train, df_test) clean_tweets \u0026lt;- function(df){ df \u0026lt;- df %\u0026gt;% mutate(number_hashtag = str_count(string = text, pattern = \u0026quot;#\u0026quot;), number_number = str_count(string = text, pattern = \u0026quot;[0-9]\u0026quot;) %\u0026gt;% as.numeric(), number_http = str_count(string = text, pattern = \u0026quot;http\u0026quot;) %\u0026gt;% as.numeric(), number_mention = str_count(string = text, pattern = \u0026quot;@\u0026quot;) %\u0026gt;% as.numeric(), number_location = if_else(!is.na(location), 1, 0), number_keyword = if_else(!is.na(keyword), 1, 0), number_repeated_char = str_count(string = text, pattern = \u0026quot;([a-z])\\\\1{2}\u0026quot;) %\u0026gt;% as.numeric(), text = str_replace_all(string = text, pattern = \u0026quot;http[^[:space:]]*\u0026quot;, replacement = \u0026quot;\u0026quot;), text = str_replace_all(string = text, pattern = \u0026quot;@[^[:space:]]*\u0026quot;, replacement = \u0026quot;\u0026quot;), number_char = nchar(text), #add the length of the tweet in character. number_word = str_count(string = text, pattern = \u0026quot;\\\\w+\u0026quot;), text = str_replace_all(string = text, pattern = \u0026quot;[0-9]\u0026quot;, replacement = \u0026quot;\u0026quot;), text = map(text, textstem::lemmatize_strings) %\u0026gt;% unlist(.), text = map(text, function(.x) stringi::stri_trans_general(.x, \u0026quot;Latin-ASCII\u0026quot;)) %\u0026gt;% unlist(.), text = str_replace_all(string = text, pattern = \u0026quot;\\u0089\u0026quot;, replacement = \u0026quot;\u0026quot;)) %\u0026gt;% select(-keyword, -location) return(df) } df_all \u0026lt;- clean_tweets(df_all) Finding the SVD matrix Let’s now works on our sparse matrix with the bind_tf_idf() functions. First, we’ll need to tokenize the tweets and remove stop-words. To be able to use the tf_idf, we’ll also need to count the occurrence of each word in each tweet.\ndf_all_tok \u0026lt;- df_all %\u0026gt;% unnest_tokens(word, text) %\u0026gt;% anti_join(stop_words %\u0026gt;% filter(lexicon == \u0026quot;snowball\u0026quot;)) %\u0026gt;% mutate(word_stem = textstem::stem_words(word)) %\u0026gt;% count(id, word_stem) df_all_tf_idf \u0026lt;- df_all_tok %\u0026gt;% bind_tf_idf(term = word_stem, document = id, n = n) # turning the tf_idf into a matrix. dtm_df_all \u0026lt;- cast_dtm(term = word_stem, document = id, value = tf_idf, data = df_all_tf_idf) mat_df_all \u0026lt;- as.matrix(dtm_df_all) dim(mat_df_all) ## [1] 10873 13802 length(unique(df_all$id)) ## [1] 10876 # I have a problem! Some tweets have not made it to our matrix. # That\u0026#39;s probably because there were just a link, or just a number or just stop words. # which one are those links. This is also why I have hanged the corpus of stop-words. # so 3 tweets have not made it at all if we consider both training and testing set. Let’s have a look at our sparse matrix to better understand what’s going on.\nmat_df_all[1:10, 1:20] ## Terms ## Docs car crash happen just terribl allah deed ## 0 0.8580183 0.7837519 0.9588457 0.6432791 1.330996 0.0000000 0.000000 ## 1 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.9851632 1.228699 ## 2 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000 ## 3 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000 ## 4 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000 ## 5 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000 ## 6 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000 ## 7 0.0000000 0.0000000 0.0000000 0.3216396 0.000000 0.0000000 0.000000 ## 8 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000 ## 9 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000 ## Terms ## Docs earthquak forgiv mai reason u citi differ ## 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 ## 1 0.7270493 0.9851632 0.6197444 0.7839108 0.4343156 0.0000000 0.000000 ## 2 0.7270493 0.0000000 0.0000000 0.0000000 0.0000000 0.6864859 0.873712 ## 3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 ## 4 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 ## 5 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 ## 6 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 ## 7 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 ## 8 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 ## 9 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 ## Terms ## Docs everyon hear safe stai across fire ## 0 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 ## 1 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 ## 2 0.7207918 0.6628682 0.873712 0.7776986 0.0000000 0.0000000 ## 3 0.0000000 0.0000000 0.000000 0.0000000 0.6664668 0.3448581 ## 4 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.4433889 ## 5 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 ## 6 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 ## 7 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 ## 8 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.2586435 ## 9 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 The values in the matrix are not the frequency but their tf_idf.\nLet’s now fix the issues of the missing tweets or we will have some issues later on during the modeling workflow. We see that the matrix is ordered by ID\n# Let\u0026#39;s identify which tweets didn\u0026#39;t make it into our df3 and save them. df_mat_rowname \u0026lt;- tibble(id = as.numeric(rownames(mat_df_all))) df_rowname \u0026lt;- tibble(id = df_all$id) missing_id \u0026lt;- df_rowname %\u0026gt;% anti_join(df_mat_rowname) # Let\u0026#39;s add empty rows with the right id as rowname to our matrix. yo \u0026lt;- matrix(0.0, nrow = nrow(missing_id), ncol = ncol(mat_df_all)) rownames(yo) \u0026lt;- missing_id$id mat_df \u0026lt;- rbind(mat_df_all, yo) dim(mat_df) ## [1] 10876 13802 #mat_df3[7601:7613, 11290:11302] ### trying to keep track of the order of the matrix mat_df_id \u0026lt;- rownames(mat_df) head(mat_df_id, 20) ## [1] \u0026quot;0\u0026quot; \u0026quot;1\u0026quot; \u0026quot;2\u0026quot; \u0026quot;3\u0026quot; \u0026quot;4\u0026quot; \u0026quot;5\u0026quot; \u0026quot;6\u0026quot; \u0026quot;7\u0026quot; \u0026quot;8\u0026quot; \u0026quot;9\u0026quot; \u0026quot;10\u0026quot; \u0026quot;11\u0026quot; \u0026quot;12\u0026quot; \u0026quot;13\u0026quot; \u0026quot;14\u0026quot; ## [16] \u0026quot;15\u0026quot; \u0026quot;16\u0026quot; \u0026quot;17\u0026quot; \u0026quot;18\u0026quot; \u0026quot;19\u0026quot; tail(mat_df_id, 20) ## [1] \u0026quot;10859\u0026quot; \u0026quot;10860\u0026quot; \u0026quot;10861\u0026quot; \u0026quot;10862\u0026quot; \u0026quot;10863\u0026quot; \u0026quot;10864\u0026quot; \u0026quot;10865\u0026quot; \u0026quot;10866\u0026quot; \u0026quot;10867\u0026quot; ## [10] \u0026quot;10868\u0026quot; \u0026quot;10869\u0026quot; \u0026quot;10870\u0026quot; \u0026quot;10871\u0026quot; \u0026quot;10872\u0026quot; \u0026quot;10873\u0026quot; \u0026quot;10874\u0026quot; \u0026quot;10875\u0026quot; \u0026quot;6394\u0026quot; ## [19] \u0026quot;9697\u0026quot; \u0026quot;43\u0026quot; Now that we solved that issue of missing rows (which took almost a all day to figure out), we can move to finding the dense matrix. We will use the irlba library to help with the decomposition.\nincomplete.cases \u0026lt;- which(!complete.cases(mat_df)) mat_df[incomplete.cases,] \u0026lt;- rep(0.0, ncol(mat_df)) dim(mat_df) ## [1] 10876 13802 svd_mat \u0026lt;- irlba::irlba(t(mat_df), nv = 750, maxit = 2000) write_rds(x = svd_mat, path = \u0026quot;~/disaster_tweets/data/svd.rds\u0026quot;) # And then to save it the whole df with ID + svd svd_mat \u0026lt;- read_rds(\u0026quot;~/disaster_tweets/data/svd.rds\u0026quot;) yo \u0026lt;- as_tibble(svd_mat$v) dim(yo) ## [1] 10876 750 df4 \u0026lt;- bind_cols(id = as.numeric(mat_df_id), yo) write_rds(x = df4, path = \u0026quot;~/disaster_tweets/data/svd_df_all750.rds\u0026quot;) It is worth mentioning that singular value decomposition didn’t parallelized on my machine and it took a bit over 3hrs to get the matrix. That’s why we have saved it for further used. [When I used irlba on our university computer (84 cores, over 750 Gb of RAM), it did parallelized very nicely on all core and it didn’t take more than 5 min.]\nNow that we have our dense matrix, we can start to fit back all the pieces together for our modelling process.\ndf_train \u0026lt;- read_csv(\u0026quot;~/disaster_tweets/data/train.csv\u0026quot;) %\u0026gt;% clean_tweets() # sorting out the same tweets, different target issues temp \u0026lt;- df_train %\u0026gt;% group_by(text) %\u0026gt;% mutate(mean_target = mean(target), new_target = if_else(mean_target \u0026gt; 0.5, 1, 0)) %\u0026gt;% ungroup() %\u0026gt;% mutate(target = new_target, target_bin = factor(if_else(target == 1, \u0026quot;a_truth\u0026quot;, \u0026quot;b_false\u0026quot;))) %\u0026gt;% select(-new_target, -mean_target, -target) df_svd \u0026lt;- read_rds(\u0026quot;~/disaster_tweets/data/svd_df_all750.rds\u0026quot;) df_train \u0026lt;- left_join(temp, df_svd, by = \u0026quot;id\u0026quot;) %\u0026gt;% select(-text) SVD with Lasso set.seed(0109) rsplit_df \u0026lt;- initial_split(df_train, strata = target_bin, prop = 0.85) df_train_tr \u0026lt;- training(rsplit_df) df_train_te \u0026lt;- testing(rsplit_df) # reusing the same df_train, df_train_tr, df_train_te from before. recipe_tweet \u0026lt;- recipe(formula = target_bin ~ ., data = df_train_tr) %\u0026gt;% update_role(id, new_role = \u0026quot;ID\u0026quot;) %\u0026gt;% step_zv(all_numeric(), -all_outcomes()) %\u0026gt;% step_normalize(all_numeric()) # we \u0026#39;ll assign 40 different values for our penalty. # we noticed earlier that best values are between penalties 0.001 and 0.005 grid_lambda \u0026lt;- expand.grid(penalty = seq(0.0014,0.005, length = 45)) # This time we\u0026#39;ll use 10 folds cross-validation set.seed(0109) folds_training \u0026lt;- vfold_cv(df_train, v = 10, repeats = 1) model_lasso \u0026lt;- logistic_reg(mode = \u0026quot;classification\u0026quot;, penalty = tune(), mixture = 1) %\u0026gt;% set_engine(\u0026quot;glmnet\u0026quot;) # starting our worflow wf_lasso \u0026lt;- workflow() %\u0026gt;% add_recipe(recipe_tweet) %\u0026gt;% add_model(model_lasso) library(doParallel) registerDoParallel(cores = 64) # run a lasso regression with cross-validation, on 40 different levels of penalty tune_lasso \u0026lt;- tune_grid( wf_lasso, resamples = folds_training, grid = grid_lambda, metrics = metric_set(roc_auc, f_meas, accuracy), control = control_grid(verbose = TRUE) ) tune_lasso %\u0026gt;% collect_metrics() %\u0026gt;% write_csv(\u0026quot;~/disaster_tweets/data/metrics_lasso_svd750.csv\u0026quot;) best_metric \u0026lt;- tune_lasso %\u0026gt;% select_best(\u0026quot;f_meas\u0026quot;) wf_lasso \u0026lt;- finalize_workflow(wf_lasso, best_metric) last_fit(wf_lasso, rsplit_df) %\u0026gt;% collect_metrics() ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## \u0026lt;chr\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; ## 1 accuracy binary 0.798 ## 2 roc_auc binary 0.860 #save the final lasso model model_lasso_svd \u0026lt;- fit(wf_lasso, df_train) write_rds(x = model_lasso_svd, path = \u0026quot;~/disaster_tweets/data/model_lasso_svd750.rds\u0026quot;) Note 1 Lasso: svd with 1000L, normalize all, penalty 0.001681, scores: f1=73.99, acc =79.3, roc=85.4\nAnalysis of grid results # we read the results of our sample to see the penalty values and their performances. metrics \u0026lt;- read_csv(\u0026quot;~/disaster_tweets/data/metrics_lasso_svd750.csv\u0026quot;) metrics %\u0026gt;% ggplot(aes(x = penalty, y = mean, color = .metric)) + geom_line() + facet_wrap(~.metric) + scale_x_log10() Make predictions df_test \u0026lt;- read_csv(\u0026quot;~/disaster_tweets/data/test.csv\u0026quot;) %\u0026gt;% clean_tweets() df_svd \u0026lt;- read_rds(\u0026quot;~/disaster_tweets/data/svd_df_all750.rds\u0026quot;) df_test \u0026lt;- left_join(df_test, df_svd, by = \u0026quot;id\u0026quot;) library(glmnet) prediction_lasso_svd \u0026lt;- tibble(id = df_test$id, target = if_else(predict(model_lasso_svd, new_data = df_test) == \u0026quot;a_truth\u0026quot;, 1, 0)) prediction_lasso_svd %\u0026gt;% write_csv(path = \u0026quot;~/disaster_tweets/data/prediction_svd_lasso750.csv\u0026quot;) # clean everything rm(list = ls()) On the training set with cross-validation, this model with a penalty of 0.001681, gave us f1 = 73.99, accuracy = 79.3, roc = 85.4. On Kaggle, this model gave us a public score of 76.79. This is not really good considering we got much better results earlier with our enhanced approach\n SVD with Xgboost We can use the same idea with xgboost.\nclean_tweets \u0026lt;- function(df){ df \u0026lt;- df %\u0026gt;% mutate(number_hashtag = str_count(string = text, pattern = \u0026quot;#\u0026quot;), number_number = str_count(string = text, pattern = \u0026quot;[0-9]\u0026quot;) %\u0026gt;% as.numeric(), number_http = str_count(string = text, pattern = \u0026quot;http\u0026quot;) %\u0026gt;% as.numeric(), number_mention = str_count(string = text, pattern = \u0026quot;@\u0026quot;) %\u0026gt;% as.numeric(), number_location = if_else(!is.na(location), 1, 0), number_keyword = if_else(!is.na(keyword), 1, 0), number_repeated_char = str_count(string = text, pattern = \u0026quot;([a-z])\\\\1{2}\u0026quot;) %\u0026gt;% as.numeric(), text = str_replace_all(string = text, pattern = \u0026quot;http[^[:space:]]*\u0026quot;, replacement = \u0026quot;\u0026quot;), text = str_replace_all(string = text, pattern = \u0026quot;@[^[:space:]]*\u0026quot;, replacement = \u0026quot;\u0026quot;), number_char = nchar(text), #add the length of the tweet in character. number_word = str_count(string = text, pattern = \u0026quot;\\\\w+\u0026quot;), text = str_replace_all(string = text, pattern = \u0026quot;[0-9]\u0026quot;, replacement = \u0026quot;\u0026quot;), text = map(text, textstem::lemmatize_strings) %\u0026gt;% unlist(.), text = map(text, function(.x) stringi::stri_trans_general(.x, \u0026quot;Latin-ASCII\u0026quot;)) %\u0026gt;% unlist(.), text = str_replace_all(string = text, pattern = \u0026quot;\\u0089\u0026quot;, replacement = \u0026quot;\u0026quot;)) %\u0026gt;% select(-keyword, -location) return(df) } df_train \u0026lt;- read_csv(\u0026quot;~/disaster_tweets/data/train.csv\u0026quot;) %\u0026gt;% clean_tweets() # sorting out the same tweets, different target issues temp \u0026lt;- df_train %\u0026gt;% group_by(text) %\u0026gt;% mutate(mean_target = mean(target), new_target = if_else(mean_target \u0026gt; 0.5, 1, 0)) %\u0026gt;% ungroup() %\u0026gt;% mutate(target = new_target, target_bin = factor(if_else(target == 1, \u0026quot;a_truth\u0026quot;, \u0026quot;b_false\u0026quot;))) %\u0026gt;% select(-new_target, -mean_target, -target) df_svd \u0026lt;- read_rds(\u0026quot;~/disaster_tweets/data/svd_df_all750.rds\u0026quot;) df_train \u0026lt;- left_join(temp, df_svd, by = \u0026quot;id\u0026quot;) %\u0026gt;% select(-text) recipe_tweet \u0026lt;- recipe(formula = target_bin ~ ., data = df_train) %\u0026gt;% update_role(id, new_role = \u0026quot;ID\u0026quot;) # xgboost classification, tuning on trees, tree-depth and mtry model_xgboost \u0026lt;- boost_tree(mode = \u0026quot;classification\u0026quot;, trees = tune(), learn_rate = 0.01, tree_depth = tune(), mtry = tune()) %\u0026gt;% set_engine(\u0026quot;xgboost\u0026quot;, nthread = 64) # starting our workflow wf_xgboost \u0026lt;- workflow() %\u0026gt;% add_recipe(recipe_tweet) %\u0026gt;% add_model(model_xgboost) # This time we use 5 folds cross-validation. # xgboost is extremely resource intensive on wide df. set.seed(0109) folds_training \u0026lt;- vfold_cv(df_train, v = 5, repeats = 1) grid_xgboost \u0026lt;- expand.grid(trees = c(2000), tree_depth = c(5, 6), mtry = c(150, 300)) library(doParallel) registerDoParallel(cores = 64) # run a xgboost classification with cross-validation tune_xgboost \u0026lt;- tune_grid( wf_xgboost, resamples = folds_training, grid = grid_xgboost, metrics = metric_set(roc_auc, f_meas, accuracy), control = control_grid(verbose = TRUE, save_pred = TRUE) ) tune_xgboost %\u0026gt;% collect_metrics() %\u0026gt;% write_csv(\u0026quot;~/disaster_tweets/data/metrics_xgboost_svd750.csv\u0026quot;) best_metric \u0026lt;- tune_xgboost %\u0026gt;% select_best(\u0026quot;f_meas\u0026quot;) wf_xgboost \u0026lt;- finalize_workflow(wf_xgboost, best_metric) last_fit(wf_xgboost, rsplit_df) %\u0026gt;% collect_metrics() ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## \u0026lt;chr\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; ## 1 accuracy binary 0.825 ## 2 roc_auc binary 0.883 #save the final lasso model model_xgboost_svd \u0026lt;- fit(wf_xgboost, df_train) write_rds(x = model_xgboost_svd, path = \u0026quot;~/disaster_tweets/data/model_xgboost_svd750.rds\u0026quot;) Using xgboost in combination with svd gives much better results. Here are a few things that we have tried with our training data:\n svd 1000 wide matrix and xgboost with 150 mtry, 2500 trees, 5 tree-depth, gave us f1 = 74.77, accuracy = 80.90, roc = 86.45\n svd 750 wide matrix and xgboost with 150 mtry, 2000 trees, 6 tree-depth, gave us f1 = 74.99, accuracy = 81.05, roc = 87 svd 500 wide matrix and xgboost with 200 mtry, 2000 trees, 6 tree-depth, gave us f1 = 75.11, accuracy = 81.02, roc = 86.87 svd 250 wide matrix and xgboost with 125 mtry, 1500 trees, 5 tree-depth, gave us f1 = 74.93, accuracy = 80.81, roc = 86.62 variable importance library(vip) model_xgboost_svd %\u0026gt;% pull_workflow_fit() %\u0026gt;% vip::vip(geom = \u0026quot;point\u0026quot;, num_features=20) #%\u0026gt;% arrange(desc(Importance)) %\u0026gt;% Clearly, we can’t interpret anymore our variables as they are the result of singular variable decomposition of a tf-idf sparse matrix. However, we are happy to see that our extra variables have played a role in determining if a tweet was about real disaster or not.\n Submission of results df_test \u0026lt;- read_csv(\u0026quot;~/disaster_tweets/data/test.csv\u0026quot;) %\u0026gt;% clean_tweets() df_svd \u0026lt;- read_rds(\u0026quot;~/disaster_tweets/data/svd_df_all750.rds\u0026quot;) df_test \u0026lt;- left_join(df_test, df_svd, by = \u0026quot;id\u0026quot;) library(xgboost) prediction_xgboost_svd \u0026lt;- tibble(id = df_test$id, target = if_else(predict(model_xgboost_svd, new_data = df_test) == \u0026quot;a_truth\u0026quot;, 1, 0)) prediction_xgboost_svd %\u0026gt;% write_csv(path = \u0026quot;~/disaster_tweets/data/prediction_svd_xgboost750.csv\u0026quot;) Note 1: majority voting, svd with 850 wide, using lasso, got 77% public score.\nNote 2: majority voting, svd 500 wide, using xgboost with 200 mtry, 2000 trees, 6 tree-depth, got a 80.01 public score.\nNote 3: majority voting, svd with 750 wide, using xgboost with 200 mtry, 2000 trees, 6 tree-depth, got 81.29% public score. Yeahhh!!!!!!!\nHere is a screenshot of our results:\n References To help with the use of irlba and check for the complete matrix ","date":1590451200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1591365898,"objectID":"aa92bdbd26ce329fe1408ffc3b724ff9","permalink":"/post/disaster-tweets-part-ii/","publishdate":"2020-05-26T00:00:00Z","relpermalink":"/post/disaster-tweets-part-ii/","section":"post","summary":"In the second part of this NLP task, we will use Singular Value Decomposition to help us transform a sparse matrix (from the Document Term Matrix - dtm) into a dense matrix.","tags":["Classification","SVD","tidymodels","kaggle"],"title":"Disaster Tweets - Part II","type":"post"},{"authors":[],"categories":["R"],"content":" Introduction Baseline model - Lasso model on just text Creating a model workflow Analysis of results Picking the best model variable importance Submission of results Baseline with some additional features Rebuilding the data frame and variables Creating and tuning a model Variable importances Submission of results Wonderings and lessons learned. References Introduction Real or Not? NLP with Disaster Tweets Predict which Tweets are about real disasters and which ones are not.\nThe task comes from a Kaggle competition which is to detect if a tweet about an emergency disaster is real. Hence, this is an NLP classification problem.\nIt is kind of easy for a human to see if a tweet is real or not, but it is harder for a machine to detect it. For instance, the tweet “look at the sky last night, it was ABLAZE”. Although there is the use of a disaster keyword like “ablaze”, the use of that word in this context wasn’t meant to refer to an emergency disaster. This task is seen as “a getting started” problem by Kaggle.\nAs I’m a volunteer firefighter in my local community for the last 3 years, this Kaggle task struck a chord with me. And yes, that is me on the picture. Imagine this heavy, well insulated PPE, super intense physical challenge and then the Saudi heat with the humidity of the Red Sea ;-)\nI am planning on a 3 parts post.\n The first part is very much BOW (bag of word) approach using Lasso. The second part is still BOW approaches using SVD. Modelling with Lasso and Xgboost. The third part is word embedding using Glove. (Still trying to make it work with Bert pre-trained models. Maybe I’ll have that sort out by the end. ) Throughout these posts, I will use packages from 3 main sets: the tidyverse for data wrangling, the tidymodels for modelling and the tidytext for dealing with text data. These sets of packages make a coherent whole and, in my opinion, makes it easier to learn the data analysis \u0026amp; modelling workflow. It is, of course, not the only one. There are many other alternatives in R.\nLoading the libraries first.\nlibrary(readr) # to read and write (import / export) any type into our R console. library(dplyr) # for pretty much all our data wrangling library(stringr) # to deal with strings. this is a NLP task, so lots of it ;-) library(purrr) # to map functions over rows library(forcats) # to deal with categorical variables: the fct_reorder() function library(stringr) # to use str_remove() and many other regex functions later library(ggplot2) # to plot library(kableExtra) # for making pretty table on html library(rsample) # to split df with initial_split() # to use resampling techniques with bootstrap() and vfold_cv() library(parsnip) # the main engine that run the models library(recipes) # to use the recipe() functions library(textrecipes) # to use the step_tokenize() and step_tfidf() library(workflows) # to use workflow() library(tune) # to fine tune the hyper-parameters using tune() library(dials) # to create grid of parameters using grid_regular(), tune_grid(), penalty() library(yardstick) # to create the measure of accuracy, f1 score and ROC-AUC library(glmnet) # to use lasso, it is called automatically when calling set_engine() # but it isn\u0026#39;t call later on when doing using predict() library(vip) # tidy framework to check variables importance Without further adue, let’s get started by loading our training set and check its structure.\n# loading our training data df_train \u0026lt;- read_csv(\u0026quot;~/disaster_tweets/data/train.csv\u0026quot;) %\u0026gt;% as_tibble() # let\u0026#39;s have a look at it skimr::skim(df_train) Table 1: Data summary Name df_train Number of rows 7613 Number of columns 5 _______________________ Column type frequency: character 3 numeric 2 ________________________ Group variables None Variable type: character\n skim_variable n_missing complete_rate min max empty n_unique whitespace keyword 61 0.99 4 21 0 221 0 location 2534 0.67 1 49 0 3279 0 text 0 1.00 7 157 0 7503 0 Variable type: numeric\n skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist id 0 1 5441.93 3137.12 1 2734 5408 8146 10873 ▇▇▇▇▇ target 0 1 0.43 0.50 0 0 0 1 1 ▇▁▁▁▆ At first look: 7613 observations, 5 variables (one target + 4 predictors). Many missing values for the location variable and a few missing on the keyword variable as well. The Text variable (which are the tweets themselves) has 0 missing values. Notice, we have an ID variable (don’t think that it has any use).\nlet’s just have a look at 10 tweets and the target column will tell us if the tweet is being considered as one about a real emergency disaster.\n Table 2: 10 random tweets target text 0 First night with retainers in. It’s quite weird. Better get used to it; I have to wear them every single night for the next year at least. 1 Deputies: Man shot before Brighton home set ablaze http://t.co/gWNRhMSO8k 1 Man wife get six years jail for setting ablaze niece http://t.co/eV1ahOUCZA 0 SANTA CRUZ ÛÓ Head of the St Elizabeth Police Superintendent Lanford Salmon has r … - http://t.co/vplR5Hka2u http://t.co/SxHW2TNNLf 1 Police: Arsonist Deliberately Set Black Church In North CarolinaåÊAblaze http://t.co/pcXarbH9An 0 Noches El-Bestia ‘@Alexis_Sanchez: happy to see my teammates and training hard ?? goodnight gunners.?????? http://t.co/uc4j4jHvGR’ 1 #Kurds trampling on Turkmen flag later set it ablaze while others vandalized offices of Turkmen Front in #Diyala http://t.co/4IzFdYC3cg 1 TRUCK ABLAZE : R21. VOORTREKKER AVE. OUTSIDE OR TAMBO INTL. CARGO SECTION. http://t.co/8kscqKfKkF 0 Set our hearts ablaze and every city was a gift And every skyline was like a kiss upon the lips @Û_ https://t.co/cYoMPZ1A0Z 0 They sky was ablaze tonight in Los Angeles. I’m expecting IG and FB to be filled with sunset shots if I know my peeps!! 1 How the West was burned: Thousands of wildfires ablaze in #California alone http://t.co/iCSjGZ9tE1 #climate #energy http://t.co/9FxmN0l0Bd Because this is a classification problem, we need to make our target variable a factor.\nAlhtough, we will just use our train dataframe for modeling, we’ll still split it to get a testing set from it (which we will test our models on).\n# The target variable should be a factor as this is classification problem. df_train \u0026lt;- df_train %\u0026gt;% mutate(target_bin = factor(if_else(target == 1, \u0026quot;a_truth\u0026quot;, \u0026quot;b_false\u0026quot;))) %\u0026gt;% select(-target) # Just checking how balanced is our data. It seems well balanced. prop.table(table(df_train$target_bin)) ## ## a_truth b_false ## 0.4296598 0.5703402 # initial split with strata will keep the same proportion of target variable as in the original df. set.seed(0109) rsplit_df \u0026lt;- initial_split(df_train, strata = target_bin, prop = 0.85) # If we use cross-validation, we do not normally really need to do this. # we still check our accuracy on that set of data (our unseen data) . df_train_tr \u0026lt;- training(rsplit_df) # and just checking again about the ratio of target variable prop.table(table(df_train_tr$target_bin)) # same as original set. ## ## a_truth b_false ## 0.4296972 0.5703028 df_train_te \u0026lt;- testing(rsplit_df) The initial_split() function gives a rsplit object (rsplit_df in our case) that can be used with the training() and testing() functions to extract the data in each split. The strata argument “help ensure that the number of data points in the training data is equivalent to the proportions in the original data set.”\nA good thing to notice is a well-balanced data set with a 57% - 43% in the occurrence of the outcomes (0, 1). So we won’t need to add more/remove data in our set to over-compensate. That’s one less problem that we have to deal with.\n Baseline model - Lasso model on just text In this post, we skip some of the usual data exploration (wordcloud are pretty, but are there really useful?) and start straight into building a model. This very first model, will be our base case. We will build a Lasso classification model based just on a cleaner version of the text of the tweets.\nFor a Lasso modelling task, we can only use numerical values. Then we will need to normalize them. Also, we cannot include missing value. So, we remove these columns with missing values. So basically, we just use the text data as the predictor. We’ll numerize that text column using the tf_idf. For more on transforming text into tf_idf, you can check this section of the David Robinson \u0026amp; Julia Silge book on Tidy Text Mining. Most of the ideas here come from her book and blog.\nTo clean our tweets, we will use the recipes and textrecipes packages. So, the exact same steps can later be done more easily on the testing set.\nIn order, we’ll tokenize the tweets (at the same time, that will remove punctuations and lowercase all text), remove stop words, and keep only the first 1250 tokens. On the last step, we’ll transform our words into numerical values by transforming them to an tf_idf.\nlibrary(tidytext) library(textrecipes) recipe_tweet \u0026lt;- recipe(formula = target_bin ~ text + id, data = df_train_tr) %\u0026gt;% update_role(id, new_role = \u0026quot;ID\u0026quot;) %\u0026gt;% step_tokenize(text) %\u0026gt;% # Tokenize the tweets into words step_stopwords(text) %\u0026gt;% # Filtering off stopwords from the tokenlist variable step_tokenfilter(text, max_tokens = 1250) %\u0026gt;% # Only keep the 1250 most important words step_tfidf(text) %\u0026gt;% # transform each words by its tf_idf values. step_normalize(all_numeric()) # normalizing the tf_idf values Once a recipe is written, we can check what it does to the original data frame by prepping and then juicing it.\n# checking on the prep() function. df_train_tr_processed \u0026lt;- recipe_tweet %\u0026gt;% prep %\u0026gt;% juice() dim(df_train_tr_processed) ## [1] 6472 1252 Notice, we have now 1252 columns. The id column + the target variable + the 1250 tokens.\nCreating a model workflow With the new tidymodel API, you can now create a workflow for a model, that can be reused later on.\nTo find the most appropriate penalty for this data set, we’ll boostrap 25 samples on each penalty. We’ll do that using the rsample library.\nlibrary(doParallel) registerDoParallel(cores = 16) #let\u0026#39;s work on 16 cores. # defining our model. It is a logistic regression, using glmnet. # notice that the penaly is set to tune()... We\u0026#39;ll create a grid for that. # mixture = 1, means we are dealing with LASSO. model_lasso \u0026lt;- logistic_reg(mode = \u0026quot;classification\u0026quot;, penalty = tune(), mixture = 1) %\u0026gt;% set_engine(\u0026quot;glmnet\u0026quot;) # if we are to tune the penalty parameters, let\u0026#39;s create a grid of possible values # we \u0026#39;ll assign 40 different values for our penalty. grid_lambda \u0026lt;- grid_regular(penalty(), levels = 40) # And we\u0026#39;ll test each penalty value on 25 bootstrap. ## So that\u0026#39;s like a 1000 models to fit. folds_training \u0026lt;- bootstraps(df_train_tr, strata = target_bin, times = 25) # starting our worflow wf_lasso \u0026lt;- workflow() %\u0026gt;% add_recipe(recipe_tweet) %\u0026gt;% add_model(model_lasso) # the tune_grid() will fit our 1000 models using parallel processing # we are looking at 3 measures of validity: roc, f1 and accuracy. tune_lasso \u0026lt;- tune_grid( wf_lasso, resamples = folds_training, grid = grid_lambda, metrics = metric_set(roc_auc, f_meas, accuracy), control = control_grid(verbose = TRUE) ) I believe that the tune_grid() function is the only place in our modelling workflow where parallel processing is being used. All the other tasks are single core processing.\nWhat have we done? First, we have created a model workflow, with workflow(), using a recipe for pre-processing and a model type, logistic_reg(). Second, we fine-tuned the parameters of our model. In this case, we only fine-tuned the penalty parameter.\n Analysis of results # we collect the results of our sample to see what penalty value we will pick. metrics \u0026lt;- tune_lasso %\u0026gt;% collect_metrics() # save the metric for later use metrics %\u0026gt;% write_csv(\u0026quot;~/disaster_tweets/data/metrics_lasso_base.csv\u0026quot;) metrics %\u0026gt;% ggplot(aes(x = penalty, y = mean, color = .metric)) + geom_line() + facet_wrap(~.metric) + scale_x_log10() Using the plots, we can see that there is only a small window of penalty values that will increase the performance of of the model. We keep that in mind for the next time we create our grid of penalties.\nBecause our dataset is somehow balance, we could choose accuracy as a measure of model validity. The best accuracy on this base model is 76.2 % for the training set. That said because Kaggle choose F1 as a performance metric, let’s choose the penalty with highest f_meas and fit it to our testing set.\n# let check the penalties values that give the best performances metrics %\u0026gt;% group_by(.metric) %\u0026gt;% top_n(4, mean) %\u0026gt;% arrange(.metric, desc(mean)) %\u0026gt;% kable(\u0026quot;html\u0026quot;, caption = \u0026quot;Penalties with best performances\u0026quot;) %\u0026gt;% kable_styling(bootstrap_options = c(\u0026quot;striped\u0026quot;, \u0026quot;hoover\u0026quot;), full_width = F, position = \u0026quot;center\u0026quot;) Table 3: Penalties with best performances penalty .metric .estimator mean n std_err 0.0049239 accuracy binary 0.7633420 25 0.0012918 0.0088862 accuracy binary 0.7591922 25 0.0012877 0.0027283 accuracy binary 0.7584935 25 0.0011643 0.0015118 accuracy binary 0.7507548 25 0.0013591 0.0027283 f_meas binary 0.7020865 25 0.0017578 0.0015118 f_meas binary 0.6994592 25 0.0017849 0.0049239 f_meas binary 0.6977248 25 0.0019471 0.0008377 f_meas binary 0.6921877 25 0.0022374 0.0049239 roc_auc binary 0.8183548 25 0.0014059 0.0088862 roc_auc binary 0.8160248 25 0.0015715 0.0027283 roc_auc binary 0.8128313 25 0.0013520 0.0015118 roc_auc binary 0.8045593 25 0.0013837 Picking the best model We’ll pick the penalty that gives us the best F1 score. The finalize_workflow() functions will take the existing workflow and add to it the chosen parameters. Finally, we will save our model for later.\nbest_metric \u0026lt;- tune_lasso %\u0026gt;% select_best(\u0026quot;f_meas\u0026quot;) wf_lasso \u0026lt;- finalize_workflow(wf_lasso, best_metric) # to summarize, this is how our workflow looked like wf_lasso ## ══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: logistic_reg() ## ## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## 5 Recipe Steps ## ## ● step_tokenize() ## ● step_stopwords() ## ● step_tokenfilter() ## ● step_tfidf() ## ● step_normalize() ## ## ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## Logistic Regression Model Specification (classification) ## ## Main Arguments: ## penalty = 0.00272833337648676 ## mixture = 1 ## ## Computational engine: glmnet # to check our model performance on the unseen data set, we use the last_fit() ## Notice how the last_fit() works on the rsplit object last_fit(wf_lasso, rsplit_df) %\u0026gt;% collect_metrics() ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## \u0026lt;chr\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; ## 1 accuracy binary 0.777 ## 2 roc_auc binary 0.831 # got 78% accuracy on unseen data (the df_test) with 82.9% ROC. # To save our model for later use, we first need to fit it, model_lasso_base \u0026lt;- fit(wf_lasso, df_train) write_rds(x = model_lasso_base, path = \u0026quot;~/disaster_tweets/data/model_lasso_base.rds\u0026quot;) variable importance We can also check for the important variables that determine if a tweet is about a real emergency or not.\nread_rds(\u0026quot;~/disaster_tweets/data/model_lasso_base.rds\u0026quot;) %\u0026gt;% pull_workflow_fit() %\u0026gt;% vi(lamda = best_metric$penalty) %\u0026gt;% group_by(Sign) %\u0026gt;% top_n(20, wt = abs(Importance)) %\u0026gt;% ungroup() %\u0026gt;% mutate(Importance = abs(Importance), Variable = str_remove(Variable, \u0026quot;tfidf_text_\u0026quot;), Variable = fct_reorder(Variable, Importance)) %\u0026gt;% ggplot(aes(x = Importance, y = Variable, fill = Sign)) + geom_col(show.legend = FALSE) + facet_wrap(~Sign, scales = \u0026quot;free_y\u0026quot;) + labs(y = NULL) For some reasons, I need to change the label of each graph. The “POS” terms will most likely give place to not a real emergency tweet.\nAlthough I am glad that the “lmao” expression is more often related to tweets that are not about real emergency, I am confused as to why “https” is one side and “t.co” on the other sides. They are both about links. 🤔\n Submission of results Let’s now apply our base model to our test set to create the prediction (target variable).\ntest \u0026lt;- read_csv(\u0026quot;~/disaster_tweets/data/test.csv\u0026quot;) prediction_lasso_base \u0026lt;- tibble(id = test$id, prediction = predict(read_rds(\u0026quot;~/disaster_tweets/data/model_lasso_base.rds\u0026quot;), new_data = test)) %\u0026gt;% mutate(target = if_else(prediction == \u0026quot;a_truth\u0026quot;, 1, 0)) prediction_lasso_base %\u0026gt;% select(id, target) %\u0026gt;% write_csv(path = \u0026quot;~/disaster_tweets/data/prediction_lasso_base.csv\u0026quot;) # this submission gave 77.8 % on the Kaggle public score. Baseline with some additional features In this new model, we will do some feature engineering at a basic level and see if that helps to increase the model performance and especially its accuracy. We continue to use the same lasso model. That means everything has to be converted back to numerical variables.\nWe create another version of the basic df. Here are the feature engineering steps we will take:\nRebuilding the data frame and variables In the following step, there were a lots of trials and errors in regards of the order with which we were performing the changes on the tweets.\nWe add the following variables:\n add a variable for the number of hashtag in a tweet (like #xxx) add a variable for the number of http link in a tweet (like http://xxxx) add a variable for the number of mention in a tweet (like @xxxx) add a variable if the tweet contains a location add a variable if the tweet contains a keyword remove all mentions and links add a variable for the number of digits in a tweet add a variable for the number of character in a tweet add a variable for the number of words in a tweet remove all the numbers in tweets On the text itself, we perform the following steps:\n remove all digits (although I am suspecting that 4 digits date like 2015 or 2017 could have an influence) remove all mentions (who is being mentioned might add little value, we have a variable if someone is mentioned) remove all http links. The link itself is not discriminatory (does not add any information). We just recorded at the previous step if we have a link, so we can delete it lemmatize all words remove all non latin letters\n count if there are multiple repeated characters in a row (like omggggg). My thinking is that real emergency tweets might use less of these. Because of all the cleaning, better make it a function that we can apply to both training and later on testing set.\nclean_tweets \u0026lt;- function(file_path){ df \u0026lt;- read_csv(file_path) %\u0026gt;% as_tibble() %\u0026gt;% mutate(number_hashtag = str_count(string = text, pattern = \u0026quot;#\u0026quot;), number_number = str_count(string = text, pattern = \u0026quot;[0-9]\u0026quot;) %\u0026gt;% as.numeric(), number_http = str_count(string = text, pattern = \u0026quot;http\u0026quot;) %\u0026gt;% as.numeric(), number_mention = str_count(string = text, pattern = \u0026quot;@\u0026quot;) %\u0026gt;% as.numeric(), number_location = if_else(!is.na(location), 1, 0), number_keyword = if_else(!is.na(keyword), 1, 0), number_repeated_char = str_count(string = text, pattern = \u0026quot;([a-z])\\\\1{2}\u0026quot;) %\u0026gt;% as.numeric(), text = str_replace_all(string = text, pattern = \u0026quot;http[^[:space:]]*\u0026quot;, replacement = \u0026quot;\u0026quot;), text = str_replace_all(string = text, pattern = \u0026quot;@[^[:space:]]*\u0026quot;, replacement = \u0026quot;\u0026quot;), number_char = nchar(text), #add the length of the tweet in character. number_word = str_count(string = text, pattern = \u0026quot;\\\\w+\u0026quot;), text = str_replace_all(string = text, pattern = \u0026quot;[0-9]\u0026quot;, replacement = \u0026quot;\u0026quot;), text = map(text, textstem::lemmatize_strings) %\u0026gt;% unlist(.), text = map(text, function(.x) stringi::stri_trans_general(.x, \u0026quot;Latin-ASCII\u0026quot;)) %\u0026gt;% unlist(.), text = str_replace_all(string = text, pattern = \u0026quot;\\u0089\u0026quot;, replacement = \u0026quot;\u0026quot;)) %\u0026gt;% select(-keyword, -location) return(df) } df_train \u0026lt;- clean_tweets(\u0026quot;~/disaster_tweets/data/train.csv\u0026quot;) A little more checking.\n# to help me see what other changes still have to be made yo \u0026lt;- df_train %\u0026gt;% select(id, text) %\u0026gt;% unnest_tokens(word, text) %\u0026gt;% anti_join(stop_words %\u0026gt;% filter(lexicon == \u0026quot;snowball\u0026quot;)) %\u0026gt;% count(word) %\u0026gt;% arrange(desc(n)) # I wanted to check what are the different stop-words dictionary # and see if it could make a difference. yo \u0026lt;- stop_words %\u0026gt;% group_by(lexicon) %\u0026gt;% summarize(n = n()) # just checking if everything is as expected skimr::skim(df_train) (#tab:skimr_lasso)Data summary Name df_train Number of rows 7613 Number of columns 12 _______________________ Column type frequency: character 1 numeric 11 ________________________ Group variables None Variable type: character\n skim_variable n_missing complete_rate min max empty n_unique whitespace text 0 1 4 157 0 6890 0 Variable type: numeric\n skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist id 0 1 5441.93 3137.12 1 2734 5408 8146 10873 ▇▇▇▇▇ target 0 1 0.43 0.50 0 0 0 1 1 ▇▁▁▁▆ number_hashtag 0 1 0.45 1.10 0 0 0 0 13 ▇▁▁▁▁ number_number 0 1 2.04 3.01 0 0 1 3 39 ▇▁▁▁▁ number_http 0 1 0.62 0.66 0 0 1 1 4 ▇▇▂▁▁ number_mention 0 1 0.36 0.72 0 0 0 1 8 ▇▁▁▁▁ number_location 0 1 0.67 0.47 0 0 1 1 1 ▃▁▁▁▇ number_keyword 0 1 0.99 0.09 0 1 1 1 1 ▁▁▁▁▇ number_repeated_char 0 1 0.02 0.17 0 0 0 0 5 ▇▁▁▁▁ number_char 0 1 83.20 32.32 5 59 84 112 157 ▂▆▇▇▂ number_word 0 1 14.24 6.11 1 10 14 18 34 ▃▇▆▃▁ Checking on the tweets, I can see that some digits are left. I am not sure how that happened (any suggestion welcome!)\nAlso, I notice that there are around 2000 words with n \u0026gt;= 6.\nSome tweets are in the dataset more than once, but … they do not have the same target value. Yes, that’s weird and it is due to bad encoding. So we’ll use a voting sytem, to make them equal.\nyo \u0026lt;- df_train %\u0026gt;% group_by(text) %\u0026gt;% mutate(mean_target = mean(target), new_target = if_else(mean_target \u0026gt; 0.5, 1, 0)) df_train \u0026lt;- yo %\u0026gt;% mutate(target = new_target, target_bin = factor(if_else(target == 1, \u0026quot;a_truth\u0026quot;, \u0026quot;b_false\u0026quot;))) %\u0026gt;% select(-new_target, -mean_target, -target) Now, we start our modeling workflow.\n Creating and tuning a model There are 2 things we’d like to try in this step. Would lemmatize or stemming work better? We’ll try both, but keep the one with best result. Also we try to see if doing a dimensionality reductions with PCA would help.\nAlso because we use tf_idf, should we normalize? Does this step add anything on our model?\nrecipe_tweet \u0026lt;- recipe(formula = target_bin ~ ., data = df_train) %\u0026gt;% update_role(id, new_role = \u0026quot;ID\u0026quot;) %\u0026gt;% step_normalize(contains(\u0026quot;number\u0026quot;), -id) %\u0026gt;% step_tokenize(text) %\u0026gt;% step_stopwords(text, stopword_source = \u0026quot;snowball\u0026quot;) %\u0026gt;% step_tokenfilter(text, max_tokens = 2500) %\u0026gt;% step_tfidf(text) %\u0026gt;% step_pca(contains(\u0026quot;tfidf\u0026quot;), threshold = 0.95) # to check how our df is now looking as it has been pre-processed. #df_train_processed \u0026lt;- recipe_tweet %\u0026gt;% prep() %\u0026gt;% juice() #dim(df_train_processed) registerDoParallel(cores = 16) # we \u0026#39;ll assign 40 different values for our penalty. # we noticed earlier that best values are between penalties 0.001 and 0.005 grid_lambda \u0026lt;- expand.grid(penalty = seq(0.0017,0.005, length = 40)) # This time we\u0026#39;ll use 10 folds cross-validation set.seed(0109) folds_training \u0026lt;- vfold_cv(df_train, v = 10, repeats = 2) model_lasso \u0026lt;- logistic_reg(mode = \u0026quot;classification\u0026quot;, penalty = tune(), mixture = 1) %\u0026gt;% set_engine(\u0026quot;glmnet\u0026quot;) # starting our worflow wf_lasso \u0026lt;- workflow() %\u0026gt;% add_recipe(recipe_tweet) %\u0026gt;% add_model(model_lasso) # run a lasso regression with bootstrap, on 40 different levels of penalty tune_lasso \u0026lt;- tune_grid( wf_lasso, resamples = folds_training, grid = grid_lambda, metrics = metric_set(roc_auc, f_meas, accuracy), control = control_grid(verbose = TRUE) ) tune_lasso %\u0026gt;% collect_metrics() %\u0026gt;% write_csv(\u0026quot;~/disaster_tweets/data/metrics_lasso_enhanced.csv\u0026quot;) tune_lasso %\u0026gt;% collect_metrics() %\u0026gt;% group_by(.metric) %\u0026gt;% top_n(4, mean) %\u0026gt;% arrange(.metric, desc(mean)) %\u0026gt;% kable() %\u0026gt;% kable_styling(bootstrap_options = c(\u0026quot;striped\u0026quot;, \u0026quot;hoover\u0026quot;), full_width = F, position = \u0026quot;center\u0026quot;) penalty .metric .estimator mean n std_err 0.0028000 accuracy binary 0.8003415 20 0.0036401 0.0028846 accuracy binary 0.8001444 20 0.0038903 0.0032231 accuracy binary 0.8001444 20 0.0037499 0.0031385 accuracy binary 0.8000787 20 0.0037480 0.0048308 f_meas binary 0.8343883 20 0.0030775 0.0049154 f_meas binary 0.8340468 20 0.0030167 0.0050000 f_meas binary 0.8340169 20 0.0030807 0.0032231 f_meas binary 0.8338193 20 0.0031357 0.0035615 roc_auc binary 0.8588161 20 0.0030108 0.0036462 roc_auc binary 0.8588096 20 0.0030060 0.0037308 roc_auc binary 0.8588078 20 0.0029970 0.0038154 roc_auc binary 0.8588028 20 0.0029785 Note1: the transformed df after the recipes steps is 7613 x 1237, that is with 1750 maxtokens and pca on 95% variance. Also we only do PCA on the words (not the extra variable) accuracy = 79.8 and f1 = 74.6.\nNote2: And with 2000 maxtokens, 95% threshold on pca, no normalization we got accuracy = 79.8% and f1 = 74.9%. df is 7613 x 1369.\nNote3: And with 1750 maxtokens, no pca, we got accuracy = 79.9% ad f1 = 74.2%.\nNote4: And with 2500 maxtokens, 95% threshold on pca, no normalization we got accuracy = 80.0% and f1 = 74.98%. df is 7613 x 1630. And we get the exact same results if we use normalization.\nAll this fine tuning gave us a very small increase in both accuracy and F1 scores. There is also an increase in ROC-AUC. Considering these are the performances on the trained data, and considering the leakage due to normalization, pca and tf-idf, these might be optimist results. Let’s see.\nbest_metric \u0026lt;- tune_lasso %\u0026gt;% select_best(\u0026quot;f_meas\u0026quot;) wf_lasso \u0026lt;- finalize_workflow(wf_lasso, best_metric) wf_lasso ## ══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: logistic_reg() ## ## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## 6 Recipe Steps ## ## ● step_normalize() ## ● step_tokenize() ## ● step_stopwords() ## ● step_tokenfilter() ## ● step_tfidf() ## ● step_pca() ## ## ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## Logistic Regression Model Specification (classification) ## ## Main Arguments: ## penalty = 0.00483076923076923 ## mixture = 1 ## ## Computational engine: glmnet #save the final lasso model fit(wf_lasso, df_train) %\u0026gt;% write_rds(path = \u0026quot;~/disaster_tweets/data/model_lasso_enhanced.rds\u0026quot;) Variable importances library(vip) read_rds(\u0026quot;~/disaster_tweets/data/model_lasso_enhanced.rds\u0026quot;) %\u0026gt;% pull_workflow_fit() %\u0026gt;% vi(lambda = best_metric$penalty) %\u0026gt;% group_by(Sign) %\u0026gt;% top_n(20, wt = abs(Importance)) %\u0026gt;% ungroup() %\u0026gt;% mutate(Importance = abs(Importance), #Variable = str_remove(Variable, \u0026quot;tfidf_text_\u0026quot;), Variable = fct_reorder(Variable, Importance)) %\u0026gt;% ggplot(aes(x = Importance, y = Variable, fill = Sign)) + geom_col(show.legend = FALSE) + facet_wrap(~Sign, scales = \u0026quot;free_y\u0026quot;) + labs(y = NULL, x = \u0026quot;Sign\u0026quot;) Figure 1: Most important variables. One thing worth noticing is that not many of our feature engineering made it to the top 20 of important variables.\nWhat about all our fancy extra variables like number of character? number of http link? Number of # and @?\nWell …\nread_rds(\u0026quot;~/disaster_tweets/data/model_lasso_enhanced.rds\u0026quot;) %\u0026gt;% pull_workflow_fit() %\u0026gt;% vi(lambda = best_metric$penalty) %\u0026gt;% filter(str_detect(Variable, pattern = \u0026quot;number\u0026quot;)) %\u0026gt;% arrange(desc(Importance)) ## # A tibble: 9 x 3 ## Variable Importance Sign ## \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;chr\u0026gt; ## 1 number_char 0.941 POS ## 2 number_http 0.376 POS ## 3 number_number 0.102 POS ## 4 number_mention 0.0639 POS ## 5 number_hashtag 0.0358 POS ## 6 number_location -0.0122 NEG ## 7 number_keyword -0.0584 NEG ## 8 number_repeated_char -0.0912 NEG ## 9 number_word -0.310 NEG Ugh! They are not looking that important! That’s pretty sad, I though I was becoming THE feature engineer guy of NLP. Nope! just humble pie instead…\n Submission of results Applying the model on the test data. We first need to reprocessed the data.\ntest \u0026lt;- clean_tweets(\u0026quot;~/disaster_tweets/data/test.csv\u0026quot;) model_lasso_enhanced \u0026lt;- read_rds(\u0026quot;~/disaster_tweets/data/model_lasso_enhanced.rds\u0026quot;) library(glmnet) prediction_lasso_enhanced \u0026lt;- tibble(id = test$id, prediction = predict(model_lasso_enhanced, new_data = test)) %\u0026gt;% mutate(target = if_else(prediction == \u0026quot;a_truth\u0026quot;, 1, 0)) write_csv(prediction_lasso_enhanced %\u0026gt;% select(id, target), path = \u0026quot;~/disaster_tweets/data/prediction_lasso_enhanced.csv\u0026quot;) rm(list = ls()) Note 1: maxtokens - 1750, 95% threshold for PCA, I got 76.8% public score.\nNote 2: with maxtoken = 2000 and 95% treshold for PCA, I got actually a worst accuracy score.\nNote 3: majority voting, maxtokens = 1750 and NO PCA, result with 78.7%.\nNote 4: majority voting, maxtokens = 2500, pca at 95% (df is 1630 wide), results with 78.3% (exact same results with normalization).\nNote 5: majority voting, maxtokens = 4000, pca at 90%. Got 78.3% public score.\nYep! That second attempt at features engeneering is just a little success. It only added 1% accuracy in the submission. It took a lot of work to make it happen but brought not much increase in F1 score. It’s part of the game!\nWe need to use another method to numerize our tweets. This is what we’ll do in Part II and III as we consider SVD and word-embedding.\nI am looking forward for comments / feedback.\n Wonderings and lessons learned. There are a few things I still wonder how to improve in the modelling workflow:\n the relationship between tf_idf and max-tokens in recipe. It seemed that increasing the max_tokens before the tf_idf step didn’t add much values. I am wondering until what point that is the case can we parallelize the map functions from the purrr package? It takes quite a bit of times for instance to do lemmatize each tweets using map(text, textstemm::lemmatize()). I do think that might be possible. can we parallelize the recipe() %\u0026gt;% prep() %\u0026gt;% juice()? I also find that steps very slow. I do not think that it is possible. The only reason I like to run that line is to check that the recipe steps are doing what I intended them to do Any feedback/advises on the 2 things above are really welcome ;-)\nLessons learned:\n the easiest model in the second lasso was the best. All the other fanciers, more computationally intense attempts didn’t provide better results. normalization has been recommended on Lasso models. In our case, normalizing after the tf_idf step or after the pca didn’t change much to our model. So in regards to being parsimonious, I would say I can skip these steps in the future. References I have copied ideas from several Kaggle notebooks and blogs.\n Extensive use of the textrecipes library. Go check his posts on his blog To get ideas on how to use Lasso modeling on a NLP task using the tidymodel framework. Julia Silge blog: Sentiment analysis with tidymodels and #TidyTuesday Animal Crossing reviews To get ideas to feature engineer the original tweetsNLP with Disaster - EDA | DFM | SVD | Ensemble ","date":1590364800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1590435308,"objectID":"a9ea6167293a3369528efd809960f3c2","permalink":"/post/disaster-tweets-part-i/","publishdate":"2020-05-25T00:00:00Z","relpermalink":"/post/disaster-tweets-part-i/","section":"post","summary":"Introduction Baseline model - Lasso model on just text Creating a model workflow Analysis of results Picking the best model variable importance Submission of results Baseline with some additional features Rebuilding the data frame and variables Creating and tuning a model Variable importances Submission of results Wonderings and lessons learned.","tags":["Classification","Lasso","NLP","kaggle"],"title":"Disaster Tweets - Part I","type":"post"},{"authors":null,"categories":null,"content":"Academic is designed to give technical content creators a seamless experience. You can focus on the content and Academic handles the rest.\nHighlight your code snippets, take notes on math classes, and draw diagrams from textual representation.\nOn this page, you\u0026rsquo;ll find some examples of the types of technical content that can be rendered with Academic.\nExamples Code Academic supports a Markdown extension for highlighting code syntax. You can enable this feature by toggling the highlight option in your config/_default/params.toml file.\n```python import pandas as pd data = pd.read_csv(\u0026quot;data.csv\u0026quot;) data.head() ``` renders as\nimport pandas as pd data = pd.read_csv(\u0026quot;data.csv\u0026quot;) data.head() Math Academic supports a Markdown extension for $\\LaTeX$ math. You can enable this feature by toggling the math option in your config/_default/params.toml file.\nTo render inline or block math, wrap your LaTeX math with $...$ or $$...$$, respectively.\nExample math block:\n$$\\gamma_{n} = \\frac{ \\left | \\left (\\mathbf x_{n} - \\mathbf x_{n-1} \\right )^T \\left [\\nabla F (\\mathbf x_{n}) - \\nabla F (\\mathbf x_{n-1}) \\right ] \\right |} {\\left \\|\\nabla F(\\mathbf{x}_{n}) - \\nabla F(\\mathbf{x}_{n-1}) \\right \\|^2}$$ renders as\n$$\\gamma_{n} = \\frac{ \\left | \\left (\\mathbf x_{n} - \\mathbf x_{n-1} \\right )^T \\left [\\nabla F (\\mathbf x_{n}) - \\nabla F (\\mathbf x_{n-1}) \\right ] \\right |}{\\left |\\nabla F(\\mathbf{x}_{n}) - \\nabla F(\\mathbf{x}_{n-1}) \\right |^2}$$\nExample inline math $\\nabla F(\\mathbf{x}_{n})$ renders as $\\nabla F(\\mathbf{x}_{n})$.\nExample multi-line math using the \\\\\\\\ math linebreak:\n$$f(k;p_0^*) = \\begin{cases} p_0^* \u0026amp; \\text{if }k=1, \\\\\\\\ 1-p_0^* \u0026amp; \\text {if }k=0.\\end{cases}$$ renders as\n$$f(k;p_0^*) = \\begin{cases} p_0^* \u0026amp; \\text{if }k=1, \\\\\n1-p_0^* \u0026amp; \\text {if }k=0.\\end{cases}$$\nDiagrams Academic supports a Markdown extension for diagrams. You can enable this feature by toggling the diagram option in your config/_default/params.toml file or by adding diagram: true to your page front matter.\nAn example flowchart:\n```mermaid graph TD A[Hard] --\u0026gt;|Text| B(Round) B --\u0026gt; C{Decision} C --\u0026gt;|One| D[Result 1] C --\u0026gt;|Two| E[Result 2] ``` renders as\ngraph TD A[Hard] --\u0026gt;|Text| B(Round) B --\u0026gt; C{Decision} C --\u0026gt;|One| D[Result 1] C --\u0026gt;|Two| E[Result 2] An example sequence diagram:\n```mermaid sequenceDiagram Alice-\u0026gt;\u0026gt;John: Hello John, how are you? loop Healthcheck John-\u0026gt;\u0026gt;John: Fight against hypochondria end Note right of John: Rational thoughts! John--\u0026gt;\u0026gt;Alice: Great! John-\u0026gt;\u0026gt;Bob: How about you? Bob--\u0026gt;\u0026gt;John: Jolly good! ``` renders as\nsequenceDiagram Alice-\u0026gt;\u0026gt;John: Hello John, how are you? loop Healthcheck John-\u0026gt;\u0026gt;John: Fight against hypochondria end Note right of John: Rational thoughts! John--\u0026gt;\u0026gt;Alice: Great! John-\u0026gt;\u0026gt;Bob: How about you? Bob--\u0026gt;\u0026gt;John: Jolly good! An example Gantt diagram:\n```mermaid gantt section Section Completed :done, des1, 2014-01-06,2014-01-08 Active :active, des2, 2014-01-07, 3d Parallel 1 : des3, after des1, 1d Parallel 2 : des4, after des1, 1d Parallel 3 : des5, after des3, 1d Parallel 4 : des6, after des4, 1d ``` renders as\ngantt section Section Completed :done, des1, 2014-01-06,2014-01-08 Active :active, des2, 2014-01-07, 3d Parallel 1 : des3, after des1, 1d Parallel 2 : des4, after des1, 1d Parallel 3 : des5, after des3, 1d Parallel 4 : des6, after des4, 1d An example class diagram:\n```mermaid classDiagram Class01 \u0026lt;|-- AveryLongClass : Cool \u0026lt;\u0026lt;interface\u0026gt;\u0026gt; Class01 Class09 --\u0026gt; C2 : Where am i? Class09 --* C3 Class09 --|\u0026gt; Class07 Class07 : equals() Class07 : Object[] elementData Class01 : size() Class01 : int chimp Class01 : int gorilla class Class10 { \u0026lt;\u0026lt;service\u0026gt;\u0026gt; int id size() } ``` renders as\nclassDiagram Class01 \u0026lt;|-- AveryLongClass : Cool \u0026lt;\u0026lt;interface\u0026gt;\u0026gt; Class01 Class09 --\u0026gt; C2 : Where am i? Class09 --* C3 Class09 --|\u0026gt; Class07 Class07 : equals() Class07 : Object[] elementData Class01 : size() Class01 : int chimp Class01 : int gorilla class Class10 { \u0026lt;\u0026lt;service\u0026gt;\u0026gt; int id size() } An example state diagram:\n```mermaid stateDiagram [*] --\u0026gt; Still Still --\u0026gt; [*] Still --\u0026gt; Moving Moving --\u0026gt; Still Moving --\u0026gt; Crash Crash --\u0026gt; [*] ``` renders as\nstateDiagram [*] --\u0026gt; Still Still --\u0026gt; [*] Still --\u0026gt; Moving Moving --\u0026gt; Still Moving --\u0026gt; Crash Crash --\u0026gt; [*] Todo lists You can even write your todo lists in Academic too:\n- [x] Write math example - [x] Write diagram example - [ ] Do something else renders as\n Write math example Write diagram example Do something else Tables Represent your data in tables:\n| First Header | Second Header | | ------------- | ------------- | | Content Cell | Content Cell | | Content Cell | Content Cell | renders as\n First Header Second Header Content Cell Content Cell Content Cell Content Cell Asides Academic supports a shortcode for asides, also referred to as notices, hints, or alerts. By wrapping a paragraph in {{% alert note %}} ... {{% /alert %}}, it will render as an aside.\n{{% alert note %}} A Markdown aside is useful for displaying notices, hints, or definitions to your readers. {{% /alert %}} renders as\n A Markdown aside is useful for displaying notices, hints, or definitions to your readers. Spoilers Add a spoiler to a page to reveal text, such as an answer to a question, after a button is clicked.\n{{\u0026lt; spoiler text=\u0026quot;Click to view the spoiler\u0026quot; \u0026gt;}} You found me! {{\u0026lt; /spoiler \u0026gt;}} renders as\n Click to view the spoiler You found me! Icons Academic enables you to use a wide range of icons from Font Awesome and Academicons in addition to emojis.\nHere are some examples using the icon shortcode to render icons:\n{{\u0026lt; icon name=\u0026quot;terminal\u0026quot; pack=\u0026quot;fas\u0026quot; \u0026gt;}} Terminal {{\u0026lt; icon name=\u0026quot;python\u0026quot; pack=\u0026quot;fab\u0026quot; \u0026gt;}} Python {{\u0026lt; icon name=\u0026quot;r-project\u0026quot; pack=\u0026quot;fab\u0026quot; \u0026gt;}} R renders as\n Terminal\n Python\n R\nDid you find this page helpful? Consider sharing it 🙌 ","date":1562889600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1562889600,"objectID":"07e02bccc368a192a0c76c44918396c3","permalink":"/post/writing-technical-content/","publishdate":"2019-07-12T00:00:00Z","relpermalink":"/post/writing-technical-content/","section":"post","summary":"Academic is designed to give technical content creators a seamless experience. You can focus on the content and Academic handles the rest.\nHighlight your code snippets, take notes on math classes, and draw diagrams from textual representation.","tags":null,"title":"Writing technical content in Academic","type":"post"},{"authors":["François de Ryckel"],"categories":null,"content":" Click the Slides button above to demo Academic\u0026rsquo;s Markdown slides feature. Supplementary notes can be added here, including code and math.\n","date":1554595200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1554595200,"objectID":"557dc08fd4b672a0c08e0a8cf0c9ff7d","permalink":"/publication/preprint/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/publication/preprint/","section":"publication","summary":"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum.","tags":["Source Themes"],"title":"An example preprint / working paper","type":"publication"},{"authors":["François de Ryckel"],"categories":[],"content":"from IPython.core.display import Image Image('https://www.python.org/static/community_logos/python-logo-master-v3-TM-flattened.png') print(\u0026quot;Welcome to Academic!\u0026quot;) Welcome to Academic! Install Python and JupyterLab Install Anaconda which includes Python 3 and JupyterLab.\nAlternatively, install JupyterLab with pip3 install jupyterlab.\nCreate or upload a Jupyter notebook Run the following commands in your Terminal, substituting \u0026lt;MY-WEBSITE-FOLDER\u0026gt; and \u0026lt;SHORT-POST-TITLE\u0026gt; with the file path to your Academic website folder and a short title for your blog post (use hyphens instead of spaces), respectively:\nmkdir -p \u0026lt;MY-WEBSITE-FOLDER\u0026gt;/content/post/\u0026lt;SHORT-POST-TITLE\u0026gt;/ cd \u0026lt;MY-WEBSITE-FOLDER\u0026gt;/content/post/\u0026lt;SHORT-POST-TITLE\u0026gt;/ jupyter lab index.ipynb The jupyter command above will launch the JupyterLab editor, allowing us to add Academic metadata and write the content.\nEdit your post metadata The first cell of your Jupter notebook will contain your post metadata ( front matter).\nIn Jupter, choose Markdown as the type of the first cell and wrap your Academic metadata in three dashes, indicating that it is YAML front matter:\n--- title: My post's title date: 2019-09-01 # Put any other Academic metadata here... --- Edit the metadata of your post, using the documentation as a guide to the available options.\nTo set a featured image, place an image named featured into your post\u0026rsquo;s folder.\nFor other tips, such as using math, see the guide on writing content with Academic.\nConvert notebook to Markdown jupyter nbconvert index.ipynb --to markdown --NbConvertApp.output_files_dir=. Example This post was created with Jupyter. The orginal files can be found at https://github.com/gcushen/hugo-academic/tree/master/exampleSite/content/post/jupyter\n","date":1549324800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1567641600,"objectID":"6e929dc84ed3ef80467b02e64cd2ed64","permalink":"/post/jupyter/","publishdate":"2019-02-05T00:00:00Z","relpermalink":"/post/jupyter/","section":"post","summary":"Learn how to blog in Academic using Jupyter notebooks","tags":[],"title":"Display Jupyter Notebooks with Academic","type":"post"},{"authors":[],"categories":[],"content":"Create slides in Markdown with Academic Academic | Documentation\n Features Efficiently write slides in Markdown 3-in-1: Create, Present, and Publish your slides Supports speaker notes Mobile friendly slides Controls Next: Right Arrow or Space Previous: Left Arrow Start: Home Finish: End Overview: Esc Speaker notes: S Fullscreen: F Zoom: Alt + Click PDF Export: E Code Highlighting Inline code: variable\nCode block:\nporridge = \u0026quot;blueberry\u0026quot; if porridge == \u0026quot;blueberry\u0026quot;: print(\u0026quot;Eating...\u0026quot;) Math In-line math: $x + y = z$\nBlock math:\n$$ f\\left( x \\right) = ;\\frac{{2\\left( {x + 4} \\right)\\left( {x - 4} \\right)}}{{\\left( {x + 4} \\right)\\left( {x + 1} \\right)}} $$\n Fragments Make content appear incrementally\n{{% fragment %}} One {{% /fragment %}} {{% fragment %}} **Two** {{% /fragment %}} {{% fragment %}} Three {{% /fragment %}} Press Space to play!\nOne Two Three \n A fragment can accept two optional parameters:\n class: use a custom style (requires definition in custom CSS) weight: sets the order in which a fragment appears Speaker Notes Add speaker notes to your presentation\n{{% speaker_note %}} - Only the speaker can read these notes - Press `S` key to view {{% /speaker_note %}} Press the S key to view the speaker notes!\n Only the speaker can read these notes Press S key to view Themes black: Black background, white text, blue links (default) white: White background, black text, blue links league: Gray background, white text, blue links beige: Beige background, dark text, brown links sky: Blue background, thin dark text, blue links night: Black background, thick white text, orange links serif: Cappuccino background, gray text, brown links simple: White background, black text, blue links solarized: Cream-colored background, dark green text, blue links Custom Slide Customize the slide style and background\n{{\u0026lt; slide background-image=\u0026quot;/img/boards.jpg\u0026quot; \u0026gt;}} {{\u0026lt; slide background-color=\u0026quot;#0000FF\u0026quot; \u0026gt;}} {{\u0026lt; slide class=\u0026quot;my-style\u0026quot; \u0026gt;}} Custom CSS Example Let\u0026rsquo;s make headers navy colored.\nCreate assets/css/reveal_custom.css with:\n.reveal section h1, .reveal section h2, .reveal section h3 { color: navy; } Questions? Ask\n Documentation\n","date":1549324800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1549324800,"objectID":"0e6de1a61aa83269ff13324f3167c1a9","permalink":"/slides/example/","publishdate":"2019-02-05T00:00:00Z","relpermalink":"/slides/example/","section":"slides","summary":"An introduction to using Academic's Slides feature.","tags":[],"title":"Slides","type":"slides"},{"authors":[],"categories":null,"content":" ","date":1548241200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1548241200,"objectID":"942f5391d63c44cb485d1058ca29debe","permalink":"/talk/timeodyssey/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/talk/timeodyssey/","section":"talk","summary":"How philosophy has conceptualized time throughout history. The audience will travel in time to discover how Greek philosophers already devised linear and cyclical conceptions of times, the medieval times, the advent of Islam, and how thinkers are adding a transcendental connotation to the idea of time. To then discover how with the modernity, the concepts of times and techniques become linked together (humans become the \"masters and possessors of nature\"). With the eminence of technology, humans then use philosophy as a tool to prevent the sophistic misappropriation of the discourse (logos) on technic. Finally, this lecture will focus on the concept of speed and slowness and will evaluate the promises of modernity and technology in light of these ideas.","tags":[],"title":"A Philosophical Odyssey in The Concept of Time","type":"talk"},{"authors":null,"categories":null,"content":"","date":1461715200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1461715200,"objectID":"d1311ddf745551c9e117aa4bb7e28516","permalink":"/project/external-project/","publishdate":"2016-04-27T00:00:00Z","relpermalink":"/project/external-project/","section":"project","summary":"An example of linking directly to an external project website using `external_link`.","tags":["Demo"],"title":"External Project","type":"project"},{"authors":null,"categories":null,"content":"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.\nNullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.\nCras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.\nSuspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.\nAliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.\n","date":1461715200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1461715200,"objectID":"8f66d660a9a2edc2d08e68cc30f701f7","permalink":"/project/internal-project/","publishdate":"2016-04-27T00:00:00Z","relpermalink":"/project/internal-project/","section":"project","summary":"An example of using the in-built project page.","tags":["Deep Learning"],"title":"Internal Project","type":"project"},{"authors":["François de Ryckel","吳恩達"],"categories":["Demo","教程"],"content":"Create a free website with Academic using Markdown, Jupyter, or RStudio. Choose a beautiful color theme and build anything with the Page Builder - over 40 widgets, themes, and language packs included!\n Check out the latest demo of what you\u0026rsquo;ll get in less than 10 minutes, or view the showcase of personal, project, and business sites.\n 👉 Get Started 📚 View the documentation 💬 Ask a question on the forum 👥 Chat with the community 🐦 Twitter: @source_themes @GeorgeCushen #MadeWithAcademic 💡 Request a feature or report a bug ⬆️ Updating? View the Update Guide and Release Notes ❤️ Support development of Academic: ☕️ Donate a coffee 💵 Become a backer on Patreon 🖼️ Decorate your laptop or journal with an Academic sticker 👕 Wear the T-shirt 👩💻 Contribute Academic is mobile first with a responsive design to ensure that your site looks stunning on every device. Key features:\n Page builder - Create anything with widgets and elements Edit any type of content - Blog posts, publications, talks, slides, projects, and more! Create content in Markdown, Jupyter, or RStudio Plugin System - Fully customizable color and font themes Display Code and Math - Code highlighting and LaTeX math supported Integrations - Google Analytics, Disqus commenting, Maps, Contact Forms, and more! Beautiful Site - Simple and refreshing one page design Industry-Leading SEO - Help get your website found on search engines and social media Media Galleries - Display your images and videos with captions in a customizable gallery Mobile Friendly - Look amazing on every screen with a mobile friendly version of your site Multi-language - 15+ language packs including English, 中文, and Português Multi-user - Each author gets their own profile page Privacy Pack - Assists with GDPR Stand Out - Bring your site to life with animation, parallax backgrounds, and scroll effects One-Click Deployment - No servers. No databases. Only files. Themes Academic comes with automatic day (light) and night (dark) mode built-in. Alternatively, visitors can choose their preferred mode - click the sun/moon icon in the top right of the Demo to see it in action! Day/night mode can also be disabled by the site admin in params.toml.\n Choose a stunning theme and font for your site. Themes are fully customizable.\nEcosystem Academic Admin: An admin tool to import publications from BibTeX or import assets for an offline site Academic Scripts: Scripts to help migrate content to new versions of Academic Install You can choose from one of the following four methods to install:\n one-click install using your web browser (recommended) install on your computer using Git with the Command Prompt/Terminal app install on your computer by downloading the ZIP files install on your computer with RStudio Then personalize and deploy your new site.\nUpdating View the Update Guide.\nFeel free to star the project on Github to help keep track of updates.\nLicense Copyright 2016-present George Cushen.\nReleased under the MIT license.\n","date":1461110400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1555459200,"objectID":"279b9966ca9cf3121ce924dca452bb1c","permalink":"/post/getting-started/","publishdate":"2016-04-20T00:00:00Z","relpermalink":"/post/getting-started/","section":"post","summary":"Create a beautifully simple website in under 10 minutes.","tags":["Academic","开源"],"title":"Academic: the website builder for Hugo","type":"post"},{"authors":["François de Ryckel","Robert Ford"],"categories":null,"content":" Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software. Click the Slides button above to demo Academic\u0026rsquo;s Markdown slides feature. Supplementary notes can be added here, including code and math.\n","date":1441065600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1441065600,"objectID":"966884cc0d8ac9e31fab966c4534e973","permalink":"/publication/journal-article/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/publication/journal-article/","section":"publication","summary":"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum.","tags":["Source Themes"],"title":"An example journal article","type":"publication"},{"authors":null,"categories":["R"],"content":" R Markdown This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.\nYou can embed an R code chunk like this:\nsummary(cars) ## speed dist ## Min. : 4.0 Min. : 2.00 ## 1st Qu.:12.0 1st Qu.: 26.00 ## Median :15.0 Median : 36.00 ## Mean :15.4 Mean : 42.98 ## 3rd Qu.:19.0 3rd Qu.: 56.00 ## Max. :25.0 Max. :120.00 fit \u0026lt;- lm(dist ~ speed, data = cars) fit ## ## Call: ## lm(formula = dist ~ speed, data = cars) ## ## Coefficients: ## (Intercept) speed ## -17.579 3.932 Including Plots You can also embed plots. See Figure 1 for example:\npar(mar = c(0, 1, 0, 1)) pie( c(280, 60, 20), c(\u0026#39;Sky\u0026#39;, \u0026#39;Sunny side of pyramid\u0026#39;, \u0026#39;Shady side of pyramid\u0026#39;), col = c(\u0026#39;#0292D8\u0026#39;, \u0026#39;#F7EA39\u0026#39;, \u0026#39;#C4B632\u0026#39;), init.angle = -50, border = NA ) Figure 1: A fancy pie chart. ","date":1437703994,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1437703994,"objectID":"10065deaa3098b0da91b78b48d0efc71","permalink":"/post/2015-07-23-r-rmarkdown/","publishdate":"2015-07-23T21:13:14-05:00","relpermalink":"/post/2015-07-23-r-rmarkdown/","section":"post","summary":"R Markdown This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.","tags":["R Markdown","plot","regression"],"title":"Hello R Markdown","type":"post"},{"authors":["François de Ryckel","Robert Ford"],"categories":null,"content":" Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software. Click the Slides button above to demo Academic\u0026rsquo;s Markdown slides feature. Supplementary notes can be added here, including code and math.\n","date":1372636800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1372636800,"objectID":"69425fb10d4db090cfbd46854715582c","permalink":"/publication/conference-paper/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/publication/conference-paper/","section":"publication","summary":"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum.","tags":["Source Themes"],"title":"An example conference paper","type":"publication"},{"authors":[],"categories":null,"content":" Click on the Slides button above to view the built-in slides feature. Slides can be added in a few ways:\n Create slides using Academic\u0026rsquo;s Slides feature and link using slides parameter in the front matter of the talk file Upload an existing slide deck to static/ and link using url_slides parameter in the front matter of the talk file Embed your slides (e.g. Google Slides) or presentation video on this page using shortcodes. Further talk details can easily be added to this page using Markdown and $\\rm \\LaTeX$ math code.\n","date":1054472400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1054472400,"objectID":"96344c08df50a1b693cc40432115cbe3","permalink":"/talk/example/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/talk/example/","section":"talk","summary":"An example talk using Academic's Markdown slides feature.","tags":[],"title":"Example Talk","type":"talk"}]