WEB SCRAPING

INTRODUCTION

This is Task 6 Submission by Team B of the HNG Internship 6.0, Machine Learning Track. We were assigned to use a web scraping tool like Selenium to get the H-index, names and other informations of Computer Science professors on Google Scholar(Page 1 - 25)

GETTING STARTED

The Google Scholar Site scraped is https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=Computer+Science+professors&btnG=

Requirements

Google Chrome
Chrome Driver which should be the same version as Your Google Chrome installed
A compatible Integrated Development Environment(IDE) such as Visual Studio Code(VSCode)
Selenium
Pandas

INSTALLATIONS

https://www.google.com/chrome/ to install google chrome
https://sites.google.com/a/chromium.org/chromedriver/home to install webdriver for chrome
pip install selenium to install the web scraping tool, make sure you have pip installed
pip install pandas to install pandas.

CONFIGURATION

# Import the necessary libraries
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

#This blocks the notification popup generated by the site
OPTIONS = webdriver.ChromeOptions()
OPTIONS.add_argument("--disable-notifications")

# Define empty list variables in which we will store the different results in
LIST_OF_LINKS = []
TOTAL_NAME = []
TOTAL_BIO = []
TOTAL_H_INDEX = []
H_INDEX_2014 = []
TOTAL_I_10_INDEX = []
I_10_INDEX_2014 = []
TOTAL_CITATIONS = []

# After collecting the necessary html tags of the urls, begin to scrape each of them to extract the names and H-Index
for each in LIST_OF_LINKS:
    print("collecting data of "+str(int(LIST_OF_LINKS.index(each))+1))
    DRIVER.get(each) # DRIVER goes to each link      
# Next use "try" and "except" to account for the various possible cases such as
# a case whereby the "DRIVER" finds a name 
# a case whereby the "DRIVER" doesn't find a name and we need to prevent the code from crashing
# a case whereby the "DRIVER" finds an H-Index
# a case whereby the "DRIVER" doesn't find an H-Index and we need to prevent the code from crashing

#print the results of each list

print(TOTAL_NAME)
print(TOTAL_BIO)
print(TOTAL_H_INDEX)
print(H_INDEX_2014)
print(TOTAL_I_10_INDEX)
print(I_10_INDEX_2014)
print(TOTAL_CITATIONS)

#Stores these results in Pandas Dataframe

DF = pd.DataFrame({'Names': TOTAL_NAME,
                   'Bio Data': TOTAL_BIO,
                   'H Index': TOTAL_H_INDEX,
                   'H Index sice 2014': H_INDEX_2014,
                   'I-10 Index': TOTAL_I_10_INDEX,
                   'I-10 Index since 2014': I_10_INDEX_2014,
                   'Citation': TOTAL_CITATIONS,
                   })

OUTPUT

#Output the dataframe into a csv file

DF.to_csv('output.csv')

#A part of the Result
 	Names 	                H Index
0 	David S. Johnson 	    132
1 	Jiawei Han 	            169
2 	Rob Knight 	            166
3 	William H. Press   	    76
4 	Stephen Boyd 	            112
5 	Scott Shenker 	            154

How to Run this Project

After installing the necessary applications and packages, proceed to run the scraper.py file, that is, run python3 scraper.py in the terminal. Make sure it is in the same directory as the chromedriver installed.
In order to avoid the existence of duplicate files, please rename the "output.csv" file located on the last line of scraper.py file.

Important Precautions to Take

Please wait patiently for the code to finish running, it might take a while. The reason for delay is the time.sleep() function which will prevent a flagdown from google as a result of the frequent actions occuring in the site.
Sometimes a fluctuating network might cause a break in code, so if it is stuck at a point for too long, please refresh the browser.

CONCLUSION

Following the above instructions will give an output of the names of 250 Computer Science Professors and their H-Index in csv format.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
before pylint check		before pylint check
some of our failed test code		some of our failed test code
ReadMe.md		ReadMe.md
chromedriver.exe		chromedriver.exe
initial.py		initial.py
output showing h-index in all 50 pages.csv		output showing h-index in all 50 pages.csv
output summarizing sample output with just 2 pages out of 25.csv		output summarizing sample output with just 2 pages out of 25.csv
scraper.ipynb		scraper.ipynb
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WEB SCRAPING

INTRODUCTION

GETTING STARTED

Requirements

INSTALLATIONS

CONFIGURATION

OUTPUT

How to Run this Project

Important Precautions to Take

CONCLUSION

Built with Visual Studio Code by members of TEAM B, Task 6.

About

Releases

Packages

Contributors 3

Languages

Leke-Ariyo/TASK6_TEAMB

Folders and files

Latest commit

History

Repository files navigation

WEB SCRAPING

INTRODUCTION

GETTING STARTED

Requirements

INSTALLATIONS

CONFIGURATION

OUTPUT

How to Run this Project

Important Precautions to Take

CONCLUSION

Built with Visual Studio Code by members of TEAM B, Task 6.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages