forked from abudac/INKE_WIScker
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
153 lines (153 loc) · 7.82 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
<!DOCTYPE HTML>
<!-- ============================================================================================== -->
<!-- Created by Andrea Budac at the University of Alberta -->
<!-- Created under the INKE project and Dr. Rockwell and Dr. Ruecker -->
<!-- Original idea by Robert Budac -->
<!-- Contributors: Andrea Budac, Geoffrey Rockwell, Zachary Palmer, Robert Budac, Stan Ruecker -->
<!-- ============================================================================================== -->
<!-- WIkcer (Wikipedia Idea Scraper) is a tool for scraping old revisions of Wikipedia articles. -->
<!-- The corpus resulting from the scrape is then available to the user to analyze as they see fit. -->
<!-- -->
<!-- Copyright (C) 2014-2015 Andrea Budac and INKE -->
<!-- -->
<!-- This program is free software: you can redistribute it and/or modify -->
<!-- it under the terms of the GNU General Public License as published by -->
<!-- the Free Software Foundation, either version 3 of the License, or -->
<!-- (at your option) any later version. -->
<!-- -->
<!-- This program is distributed in the hope that it will be useful, -->
<!-- but WITHOUT ANY WARRANTY; without even the implied warranty of -->
<!-- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -->
<!-- GNU General Public License for more details. -->
<!-- -->
<!-- You should have received a copy of the GNU General Public License -->
<!-- along with this program. If not, see <http://www.gnu.org/licenses/>. -->
<!-- ============================================================================================== -->
<html>
<head>
<meta charset=utf-8>
<title>WIcker (Wikipedia Idea scraper)</title>
<meta name="description" content="Wikipedia Scraper">
<meta name="author" content="abudac">
<!-- CSS loaded from style.css -->
<link rel="stylesheet" type="text/css" href="style.css">
<!-- Bootstrap CSS -->
<link rel="stylesheet" href="css/bootstrap.min.css" media="screen">
<link rel="stylesheet" href="css/datepicker.css">
<!-- javascript starts here -->
<!-- Load Jquery and Bootstrap -->
<script src="http://code.jquery.com/jquery.js"></script>
<script src="js/bootstrap.min.js"></script>
<script src="js/bootstrap-datepicker.js"></script>
<script src="js/wiki-edit-compare.js"></script>
</head>
<body>
<!--- Header containing logo and name -->
<header class="mainTitle">WIcker
<!-- Info Panels Buttons -->
<span id="aboutInfo">About</span>
<span id="helpInfo">Instructions</span>
<!-- Images that link to their respective sites -->
<div class="iNKELogo"><a href="http://inke.ca/" target="_blank"><img src="img/inke_logo.png"> </a> </div>
<div class="uALogo"><a href="http://www.ualberta.ca/" target="_blank"> <img src="img/ua-color.png"> </a> </div>
</header>
<!-- Section for About and Instructions -->
<section class="infoPanel" id="aboutPanel">
WIkcer (Wikipedia Idea Scraper) is a tool for scraping old versions of Wikipedia articles.
The corpus resulting from the scrape is then available to you to analyze as you see fit.
WIcker supports all languages of Wikipedia.
<br>
WIcker was created at the University of Alberta under the INKE project.
</section>
<section class="infoPanel" id="helpPanel">
In order to scape an article, copy the URL of the Wikipedia article you would like to scrape directly into this form,
then enter the interval of time that you would like the article to be scraped for.
The date to start scraping must be before the date to end scraping.
WIcker will scrape the text of the article as it was on the day that is selected.
However, if the date to start scraping is prior to the article's creation,
WIcker will return the earliest version of the article for that time.<br>
How often the article will be scraped determines at what times the scraper will collect a version.
Note that a larger number of versions will take longer to scrape.<br>
You can request that WIcker print the resulting text as xml or in a human readable format.
Characters such as "&" and "<" are removed from the resulting xml text to prevent xml errors.
As well, you can choose between plain text formatting and Wikipedia style formatting.
Plain text format is more readable, but has data such as links and the infobox removed.<br>
Then press the "Scrape" button and WIcker will print the corpus below the Wikipedia window.<br>
To copy the resulting text, press the "Select Text" button and the copy it into whatever software you wish to use.<br>
The "Reset" button clears the form and the output section. It is suggested that the form be reset between scrapes to prevent errors.<br>
</section>
<section class="mainFormAndOutput">
<!-- Form for User on Left -->
<section class="wikiFormLeft">
<!-- Form for input -->
<form id="wikiForm">
<!-- Text box for address of Wikipedia article -->
<div class="form-group" id="wikiAddressBox">
<label>Enter the URL of the Wikipedia article you want to scrape:</label>
<input type="text" class="form-control" id="wikiAddress" placeholder="http://en.wikipedia.org/"
data-toggle="popover"
onchange=validateInfo(); oninput=validateInfo();>
</div>
<!-- Date to start searching -->
<div class="form-group" id="startDateBox">
<label>Date to start scraping:</label>
<br>
<input type="text" class="form-control" id="startDate" placeholder="yyyy/mm/dd" data-toggle="popover">
</div>
<!-- Date to end search -->
<div class="form-group" id="endDateBox">
<label>Date to end scraping:</label>
<br>
<input type="text" class="form-control" id="endDate" placeholder="yyyy/mm/dd" data-toggle="popover">
</div>
<!-- How often to grab revision -->
<div class="form-group">
<label>How often should an article be scraped?</label>
<select class="form-control" id="oftenDate" onchange=validateInfo(); oninput=validateInfo();>
<option value="7">every week</option>
<option value="14">every two weeks</option>
<option value="30">every month</option>
<option value="61">every two months</option>
<option value="183">every six months</option>
<option value="365">every year</option>
</select>
</div>
<!-- Whether to have XML output or human readable output and plain text or Wikipedia Formatting-->
<div class="form-group">
<label>How should the text be formatted?</label>
<select class="form-control" id="textFormat" onchange=validateInfo(); oninput=validateInfo();>
<option value="1">XML</option>
<option value="2">Human Readable</option>
</select>
<br>
<select class="form-control" id="bodyFormat" onchange=validateInfo(); oninput=validateInfo();>
<option value="3">Plain Text</option>
<option value="4">Wikipedia Formatting</option>
</select>
</div>
</form>
<div class="progress" id="progressBarWrapper">
<div class="progress-bar progress-bar-success progress-bar-striped active" id="progressBar">
<span class="sr-only"></span>
</div>
</div>
<!-- Buttons -->
<button type="button" class="btn btn-success" id="submitButton" onclick=wikiScrapeGo();>Scrape</button>
<button type="button" class="btn btn-warning" id="resetButton" onclick=resetForm();>Reset</button>
<button type="button" class="btn btn-info" id="selectTextButton" onclick=selectAllText('printOutHere');>Select Text</button>
</section>
<!-- Output Section on Right -->
<section class="wikiOutputRight">
<!-- Page Preview -->
<div id="previewBox">
<iframe src="http://en.wikipedia.org/" id="previewFrame" sandbox=""></iframe>
</div>
<!-- Print Output Here -->
<div class="alert alert-danger alert-dismissible fade in" id="errorOutputAlert">
</div>
<div id="printOutHere">
</div>
</section>
</section>
</body>
</html>