Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hits vs Pages #137

Open
Bllacky opened this issue Jul 26, 2019 · 27 comments
Open

Hits vs Pages #137

Bllacky opened this issue Jul 26, 2019 · 27 comments

Comments

@Bllacky
Copy link

Bllacky commented Jul 26, 2019

Hi,

This is a feature request.

Overall I love Awstats for its simplicity. It works very well for most of my needs.
However, it has a bit of trouble with bot detection.

I have multiple software to monitor my website's traffic, all using various methods. Awstats is one of them. Overall Awstats is in agreement with the others with one exception. There are lots of bots out there on the internet, that Awstats doesn't detect but which are obvious if you look at the List of Hosts.

Why are these bots obvious, because most modern websites, including mine, have multiple files loaded upon a visit, you have a bunch of CSS files, js files, and so on. So if you see a visit which is 1 page, 1 hit and 10KB in traffic, you know that's not a real visit and probably a bot. If I eliminate these visits from those counted by Awstats, then Awstats statistics are in agreement with those of Google Analytics or Matomo.

So my request is to allow me to set in the config files some parameters based on which AWstats should count what is a real visit and what is a bot.

Example:

MinHitsPerVisit=7;
(default=1)
//Minimum number of Hits required to consider a visit real (human) and not a bot.
MinHitsToPageRatio=1.5;
(default=1)
//Minimum ratio between hits and pages in a visit to consider a visit real (human) and not a bot.
MinVisitTraffic=100KB;
(default=1KB)
//Minimum traffic of a visit to consider that visit real (human) and not a bot.

Implement these 3 parameters and I can make my Awstats come in agreement with other traffic measuring software.

Thank you very much!

@visualperception
Copy link
Contributor

I raised this in issue #59 but no response yet

@visualperception
Copy link
Contributor

visualperception commented Oct 5, 2019

Also ....
The more people who ask for this the more likely it is to happen.
However, I think Eldy seems to have ceased significant development unless it is required to keep it running.
I think it will need a perl developer from the community to implement this. Any offers as it would provide a significant improvement to accuracy of awstats for the whole community?
I'm not a perl programmer or I would look at it myself.

@Bllacky
Copy link
Author

Bllacky commented Oct 6, 2019

I will try to find someone. @eldy hopefully he will accept the commit if it ever happens.

@visualperception
Copy link
Contributor

Since awstats is being used by several significant webhost companies with hundreds if not thousands of users, it is important that any changes to code such as this do not significantly effect the perfomance of awstats. I think that would be the major concern for eldy when considering wether or not to accept the commit but its not really for me to say.

@Bllacky
Copy link
Author

Bllacky commented Oct 18, 2019

I don't think there will be a significant performance hit, and it's a relatively simple filter. I think we can start a code bounty for this modification.

Like https://bountify.co/, http://www.coderbounty.com/, https://www.bountysource.com/

I'm more than willing to pinch in.

@Bllacky
Copy link
Author

Bllacky commented Nov 2, 2019

@visualperception I may have found someone to look into this issue and possible implement this feature. Is there any way I can contact you?

@visualperception
Copy link
Contributor

visualperception commented Mar 28, 2020

If you post an email address I will send you an email which has my email addres in it.
You can set up a temporary email address for this purpose at:
https://www.gmx.com/ ( or https://www.gmx.co.uk, I'm in the UK )
and then delete it once we have exchanged private email addresses.
https://www.gmx.com is part of 1&1 based in Germany. A good email provider without the bloat of gmail.
Is your person still willing to do it?

Regards
VisualPerception

@Bllacky
Copy link
Author

Bllacky commented Mar 28, 2020

That's very kind of you.
I've tried to make an account on gmx but it fails every time with
A technical error has occurred. Error Code: eb6a5658-ec9b-4246-b5e5-0d0bf8a01f86

What do you think of a temporary email address such as:
https://www.throwawaymail.com/en

@visualperception
Copy link
Contributor

Have you enabled first party cookies.for gmx.com ?

@Bllacky
Copy link
Author

Bllacky commented Mar 29, 2020

Have you enabled first party cookies.for gmx.com ?

Yes, and I have tried it with different browsers. Same result. Seems their website is broken.

@visualperception
Copy link
Contributor

https://www.throwawaymail.com/en
or any mail provider, but make sure you can delete the email account from it.
My email will temporary so I'm not worried about spam as I can delete/change it.

@Bllacky
Copy link
Author

Bllacky commented Mar 29, 2020

Send to [email protected] Thank you!

@visualperception
Copy link
Contributor

I sent email to your address several months back but I have heard nothing since. Did you find anyone willing to code this?

@Bllacky
Copy link
Author

Bllacky commented Oct 27, 2020

I sent email to your address several months back but I have heard nothing since. Did you find anyone willing to code this?

I did check that email address for a while, but if I fail to check it 48 hours, it gets deleted. So I reckon your email got lost at some point. Anyway, you can try and send it again here: [email protected]

I did find someone who said they were willing, but in the end nothing came out of it. So at this point there is no one to pick up this modification.

@visualperception
Copy link
Contributor

OK I have sent my email address to your posted email address at 01:05 GMT 2020-10-28

@visualperception
Copy link
Contributor

Ooops,
I have received an Message undeliverable from my mailserver as follows:

[email protected]
Error Type: SMTP
Connection to recipients server failed.
Error: Host name: 104.27.164.72, message: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

@visualperception
Copy link
Contributor

OK.
I have tried again and so far I have no Message undeliverable message.
It seems that email server has been trying to log into my mailserver to send messages already. I had several entries in my blacklist

@visualperception
Copy link
Contributor

Spoke too soon. I just got another Undeliverable message:

[email protected]
Error Type: SMTP
Connection to recipients server failed.
Error: Host name: 104.27.165.72, message: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

@Bllacky
Copy link
Author

Bllacky commented Oct 28, 2020

I will try with a different serivce. Looks like throwaway mail doesn't work.
Try [email protected] ?
Thank you.

@visualperception
Copy link
Contributor

visualperception commented Oct 28, 2020

email sent 12:20pmGMT Wed 2020-10-28

@Bllacky
Copy link
Author

Bllacky commented Oct 28, 2020

email sent 12:20pmGMT Wed 2020-10-28

These free mail accounts... they don't seem to work as they should.
One last try: [email protected]

@visualperception
Copy link
Contributor

sent

@Bllacky
Copy link
Author

Bllacky commented Oct 29, 2020

Go it! Finally made contact.

@chuckhoupt
Copy link
Contributor

How would this feature interact with cached resources (css, js, images, etc) and return visits? It seems like it might miss returning visits if resources are set to be cached for long periods. I.e. a first visit triggers 3+ hits for a page, but a second visit later in the day may only be single hit on the page's HTML file. If MinHitsPerVisit or MinHitsToPageRatio are greater than 1, then the second visit would be ignored?

@Bllacky
Copy link
Author

Bllacky commented Nov 17, 2020

How would this feature interact with cached resources (CSS, js, images, etc.) and return visits? It seems like it might miss returning visits if resources are set to be cached for long periods. I.e. a first visit triggers 3+ hits for a page. Still, a second visit later in the day may only be single hit on the page's HTML file. If MinHitsPerVisit or MinHitsToPageRatio are greater than 1, then the second visit would be ignored?

In theory, I think you are absolutely right.

But I am not sure if that is how it will work in practice because I have never encountered such a situation. On my website, the first visit is usually 1 page - 50 hits, while second visits are usually 1 page 20/30 hits. But I never had 1 page - 1 hit from a valid visit. I often get 1 page - 1 hit from some IPs in Russia/China/Vietnam etc.

I suppose you could get 1 page - 1 hit wrong if all your resources were static and cacheable, but I am not sure if there are any modern websites which work like that. If there were, that's why we proposed for this setting to be configurable. My website for example has many microservices and dynamic scripts, and it would be impossible to have 1 page - 1 hit.

@visualperception
Copy link
Contributor

chuckhoupt wrote:

How would this feature interact with cached resources (css, js, images, etc) and return visits? It seems like it might miss returning visits if resources are set to be cached for long periods. I.e. a first visit triggers 3+ hits for a page, but a second visit later in the day may only be single hit on the page's HTML file. If MinHitsPerVisit or MinHitsToPageRatio are greater than 1, then the second visit would be ignored?

chuckhoupt, there are a couple or more ways you can check this. Firstly you could look in the stored statics file. e.g. awstats102020.domain.txt at the section titled:
Host - Pages - Hits - Bandwidth - Last visit date - [Start date of last visit] - [Last page of last visit]
[Start date of last visit] and [Last page of last visit] are saved only if session is not finished
The 10 first Hits must be first (order not required for others)

This shows pages and hits for an IP. Where it is 0 or 1 pages and 1 hit that is likely a robot, And if
the user agent does not contain a bot id then it also a potential bot. Also if that IP has accessed robots.txt it can definitely be considered a bot. So there is plenty there to check against.
You only need to check after the robot detection has run and it says it wasn't a bot. Those are the ones we would like to be checked. From the same section you could take everything between BEGIN_VISITOR and END_VISITOR, sort them on Pages - Hits descending and you will get a list with all the lowest pages and hits first so you can stop processing once you hit the new conf file parameters for pages and hits.
Just a few things to consider as suggestions. You may of course find a better way once you get into it and you will need to modify all relevant reports/stats datsbase etc.
And consider that my investigations show something like 30% bad bots and getting rid of those is unlikely to generate nearly as many wrong detections so whilst it may not be 100% better it is likely to be only a small %error in detection so will improve the accuracy considerably. But this will come out in testing I think

visualperception

@visualperception
Copy link
Contributor

visualperception commented Feb 18, 2021

@chuckhoupt

Any more thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants