A web application which takes a website URL as an input and provides general information about the contents of the page
- HTML Version
- Page Title
- Headings count by level
- Amount of internal and external links
- Amount of inaccessible links
- If a page contains a login form (NB: Logic is not perfect)
$ git clone https://github.com/vkmrishad/go-webscraper-example.git
$ cd go-webscraper-example
$ go get
.
├── Dockerfile
├── README.md
├── controllers
│ ├── common.go
│ └── scraper.go
├── go-webscraper-example
├── go.mod
├── go.sum
├── main.go
├── models
│ ├── common.go
│ └── scraper.go
├── test
│ └── endpoint_test.go
└── utility
├── common.go
└── scraper.go
$ go run .
or
$ go run main.go
$ go build .
$ ./go-webscraper-example
$ docker build -t go-webscraper-example . --rm
$ docker run -p 8080:8080 --name go-webscraper-example --rm go-webscraper-example
$ go test ./test -v
Server URL: http://127.0.0.1:8080/
Endpoint: [POST]http://127.0.0.1:8080/scraper/
Request
POST /scraper/ HTTP/1.1
Host: 127.0.0.1:8080
Content-Type: application/json
Cookie: csrftoken=WhdzUpxwGJm5UBpiSb2fxavIQ8zFGcHKdTWJkUOWRG4993AevH1rYe0QS3b8QYzr
Content-Length: 43
{
"url": "https://mohammedrishad.com"
}
Response
{
"url": "https://mohammedrishad.com",
"html_version": "HTML 5",
"page_title": "Mohammed Rishad",
"heading_count": {
"h1": 1,
"h2": 0,
"h3": 1,
"h4": 0,
"h5": 0,
"h6": 0
},
"links": {
"internal": {
"urls": [
"https://www.mohammedrishad.com/#body",
"https://www.mohammedrishad.com/#skill_sec",
"https://www.mohammedrishad.com/#work_sec",
"https://www.mohammedrishad.com/#edu_sec",
"https://www.mohammedrishad.com/#exp_sec",
"https://www.mohammedrishad.com/#achivement_sec",
"https://www.mohammedrishad.com/#interest_sec",
"https://www.mohammedrishad.com/#contact_sec",
"https://www.mohammedrishad.com/#body",
"https://www.mohammedrishad.com/MohammedRishadResume.pdf",
"https://www.mohammedrishad.com/#body"
],
"count": 11
},
"external": {
"urls": [
"https://studysmarter.de",
"https://uoc.ac.in/",
"https://www.nuflights.com/",
"https://www.cliphire.co/",
"https://moonllight.com/",
"https://transportsimple.com/",
"https://www.etailpet.com/",
"https://www.feedmyfurbaby.co.nz/",
"mailto:[email protected]",
"https://www.linkedin.com/in/vkmrishad",
"https://github.com/vkmrishad",
"https://stackoverflow.com/users/6874947/mohammed-rishad",
"https://www.facebook.com/vkmrishad",
"https://www.twitter.com/vkmrishad",
"https://www.youtube.com/channel/UCvnIlVDRk46_xpnrrAiYrTg"
],
"count": 15
},
"failed": {
"urls": [
"mailto:[email protected]",
"https://moonllight.com/"
],
"count": 3
}
},
"page_contains_login_form": false
}