-
-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Change instance-stats-randomize
to instance-stats-mode
with multiple options
#3723
Comments
hi, sorry about reaching out about this feature post-merge but i've only been made aware recently i can't stop feeling like this is a deliberate action to poison data: nodeinfo has all stats fields nullable, so it's perfectly valid to not include them (and i'd say preferrable over serving the underlying reasoning is valid and I don't want to oppose it: instance admins must have ways to opt-out of data collection for their instance. this should not clash with those wanting data collection and public listing however i'd also say that this is counter-productive for any GtS instance wanting to be private in good faith: showing astronomical numbers just puts your instance more in the spotlight, not less. the only use case i can think of that benefits from this is inflating user stats, which is not at all respectful of the network on the previous issue you mention that GtS explicitly opts-out of crawling on would you consider this behavior a violation of your crawler rules? if yes, can you propose alternative ways to identify server software? i believe this behavior is fair and warranted, but i'd still like to ask for feedback with this said, just adding robots.txt rules to hide nodeinfo isn't per se enough to have such (node)info never be used by remote software. and last change makes every GtS instance potentially a bad actor, answering such queries in bad faith and poisoning local instance counts. these complaints may seem excessive: the active user count just isn't that much of a deal for most software, but i still claim this is data poisoning and reflects poorly on GtS "network respectability" for example, i'm working on a "threadiverse view" which would use instance activity stats to provide a barebones "sort by active". GtS faking in millions of MAU will be very detrimental to this view, unless I explicitly exclude GtS instances i would kindly ask to reconsider this change: the options should be between serving actual stats or serving no stats at all (not even if you really want to keep the randomized feature in, please mark it somewhere: nodeinfo has a metadata dictionary, maybe include a |
Thank you for creating this issue and for your intention to reconsider the shape of the introduced change. I'd like to share my (GoToSocial user/admin) perspective on this matter, in hope that it will help you decide on the future direction for this feature. I kindly ask you to reconsider this feature in a way that it allows to hide stats for those who don't want to share them, but also allow for instances to be visible in services like fediverse.observer, for those who wish so. I'm afraid that leaving an option to serve random statistics won't be enough for achieving that. |
For me, random or "real" are the only two options I care about. Either I want my stats to be tracked, then I allow this in
I feel like the instance to address regarding that matter is the the people maintaining these stat-trackers, not of GtS. IMHO the solution to this problem is: Don't track instances that disallow that in their |
One thing I'd suggest would be something like a 'fairplay' mode that works like a combination of serve (for good requests) and some other variation for blocked scrapers in robots.txt. Basically, if something that's blocked by robots.txt hits the endpoint, GTS serves up random, zero, or otherwise misleading data. But if a standard user from the web or some other allowed user agent were to hit it, they'd get correct stats. This would encourage scrapers to respect robots.txt, since the data they get back wouldn't be trustable anyway. And would allow for reporting of true stats to sources that have the good faith to respect robots.txt (to which you could grant an allow rule) |
While I understand there are concerns around validity of statistics with this feature, it seems that those with these concerns are not recognizing that the crawler site stats are not at all accurate anyway. fedidb.org is always missing instances when I've searched, and fediverse.observer is as well as it's limited by the nature of federation; it has to be aware of an instance to even crawl it, so obviously it cannot and does not have every single instance for every AP software in its database either. The stats they both provide are good for getting an idea of activity, but to tout them as fact or use them as proof of anything is careless at best. I also strongly agree with @moan0s here:
Why should GtS be coded to play nice with tools that aren't respecting something as simple as nobot? Why does the notion of "respectability" fall only on GtS when just looking at an instance's logs shows that there are tons of crawlers with no identifying information or announced purpose scraping them all day every day? Enhancing this feature to add more options is great for admins to have better control over their data, but the truth of the matter is that this would all be a non-issue if crawlers respected robots.txt in the first place |
I understand the concerns about poisoning data sets, given that some people rely on those -- for better or worse -- to make claims about "the fediverse" as a coherent, mappable entity, and to promote it to folks who are invested in things like user counts, activity counts, the network effect, etc. I think this issue is interesting (and it's had me pondering all weekend) because it highlights some fundamental ideological tensions between different groups of people deploying and using ActivityPub software. On the one hand you have people who are interested in shining light into every corner of the wider federating network, and making an overview or "map" of it in order to make it knowable, visualize it in charts, and "sell" it to others who don't know about it yet. In other words, this group of folks uses ontological claims about "the fediverse" as a basis for its relevance, so statements like "there is x number of users on this software" / "this instance has x number of users" / "the fediverse has x million users on it, so it's worthy of consideration by outside interests, and by users from other platforms". Such folks are often oriented towards keeping track of growth and activity, I think at least partially because those are drilled into all of us by capitalism as being inherently good things to work towards. Many folks have a dream that everyone should move from other platforms to the fediverse, and take pleasure in pointing at graphs going upwards, which indicate that this dream is on its way to becoming true. On the other hand you have people who recognize "the fediverse" as a shibboleth, and either don't really care about numbers and statistics, or are actively opposed to the activity of creating a "map" of the federating network, because they recognize (instinctively?) that there's a danger in being seen and known -- mapping a space is, after all, the first step towards taking control over it. These are folks who don't post on public much, they usually don't have huge follower/following counts, and they mostly use their AP instance to hang out and crack jokes with their friends, not to participate in "public square" conversations. Many such folks run allowlist instances to separate themselves entirely from the gaze of the wider network. To such folks, trolling scrapers and crawlers is not a controversial thing to do, because they would prefer to opt-out of being scraped and crawled entirely, and if they can't opt out (because crawlers ignore directives not to crawl), they can at least provide comically ridiculous numbers or no numbers at all. Software like GoToSocial tends to appeal to this latter group of folks because it provides lots of granular privacy controls and makes no bones about being opposed to opt-out discoverability features and that sort of thing. I think this issue (and the PR that spawned it) reflects a clash in expectations between these two groups about the fundamental nature of what "the fediverse" means, and that is not something that will be resolved in this issue :P I would like to point out a few things here that informed the creation of the original PR, and led to me opening this followup issue:
I'm still pondering what the right course of action forward is here -- whether to remove this feature now that it's served its purpose, or whether to change it as suggested above so that admins can choose to just show 0 for everything. One thing I'm considering is always showing stats linked in |
To my knowledge not even the crawler bundled by mastodon for getting instance stats respects robots.txt. I'd argue that respecting robots.txt is no longer the norm, as it's a lot of overhead when you want to do a single HTTP request that is almost as cheap as serving the robots.txt file itself, for both server and client. I don't think the data poisoning mode achieves any of your own stated ideological goals of reducing visibility (of your own profile) and makes it harder to develop reliable applications on the fediverse for everybody else. I don't think "it's already not reliable" is a valid argument, because reliability of overall stats is a spectrum and GTS making data poisoning this easy definitely pushes the entire network far towards one end. As a user of GTS I fear that hostile and to be honest, petty, actions like these will cause application developers to special-case GTS further and focus on Mastodon only, as is already the case with a bunch of client software. Even though it's off by default and off on my instance. |
Just to clarify, I don't consider other federating instances to be crawlers if they're just gathering information necessary for federation: that would be an obviously ridiculous and contradictory stance as the fedi AP endpoints are also politely requested not to be crawled in robots.txt. I don't think anyone could sensibly make an argument that other AP instances need to parse robots.txt in order to federate. GoToSocial itself certainly doesn't do this. Mastodon doesn't have a crawler bundled with it afaik, it fetches stuff in response to user requests, which is distinct from scraping (EDIT: I am wrong about this, Mastodon is bundled with a CLI command that can crawl instances). Moreover, This also should have no effect on client application developers, as they don't need to be looking at nodeinfo for anything apart from possibly identifying the server software the user is attempting to use the client to sign in with, in order to do feature gating. Probably a few clients do this but afaik they don't use the stats for anything. |
I must add I find it very interesting throughout this whole conversation that the onus has been put on GoToSocial to behave and to be respectable for the sake of the abstract concept of the fediverse, and not on crawlers to behave and to be respectable, even though they are already breaking basic rules by compiling statistics. I think there are some really fundamental assumptions being made about respectability which are interesting to interrogate. |
@tsmethurst to be clear, I myself suffered through a lot of crawler abuse, both as fediverse user and as operator of other public services. to me it's not about who to put the onus on or whether GTS is the only bearer of moral responsibilities, I just prefer to reason with the parties that can be reasoned with. there are many crawlers out there and the worst of them don't have a point of contact. forcing them to behave one by one is, to me, a losing proposition. and mainly I disagree with the way it's being done here, not trying to single out GTS as the worst actor on the fediverse. |
@untitaker Got you, thanks for the clarity, that makes sense :) |
with regard to "the official mastodon crawler", I mean this thing: https://github.com/mastodon/mastodon/blob/a85a9f98d97991edce7fff744a7b131471e8291a/lib/mastodon/cli/domains.rb#L104 |
Oh wow that's wild, I didn't even know that existed. Thanks for linking. |
From a technical standpoint, connection pooling/re-use should really mitigate this regardless. There is also the
Yes, this. The impetus is clearly on the operator of such a service to behave. I am a GtS admin who is totally happy to have my instance stats public and even crawled for some purposes - I request AI bots to go away but not search engines or fedi indexers! But also, it's clearly the individual admin's right, and I wholeheartedly support the option of randomized (or zero'd) stats. |
i don't feel like my concerns are being addressed. this is not about crawlers: add a user agent rule to serve stupid numbers to fedidb or fedi observer. also i think my point is being misrepresented: i don't feel like i belong to the "line goes up" crowd, and many times argued that fedi can't compete with centralized architectures.
definitely, so why are you penalizing other AP instances to "hurt" these crawlers? I claim that my concern is reasonable: I don't host a crawler and I'm getting hit by this change. Does my software deserve it?
of course numbers aren't precise. an error range of +-100 users/posts is fine. GtS right now is throwing numbers in the millions, and every GtS instance is 1 config away from doing that. this isn't just "unreliable stats", this is GtS trolling other fedi instances (and i'm using the exact word tsmethurst picked)
sorry but to avoid serving a static document you expect crawlers to get a static document from you before? really, can't GtS just serve a blank nodeinfo, with just software version and no stats numbers, and cache that? I'd argue that serving such a static nodeinfo is just as much load as serving a robots.txt telling crawlers to NOT go to nodeinfo.
This is just false: I made it very clear in my first post. Any fedi software willing to be as compatible as possible with the whole network will have to consider which software the remote instance is running, and currently the only way to see it is nodeinfo. Also, some clients will get nodeinfo to know which features they can use
"an eye for an eye" is not a great way to reason. is my software behaving unfairly with GtS? why am I being hit with this? GtS is behaving badly right now, and I think it's fair to open an issue with the involved software itself. I have no issue with crawlers getting my nodeinfo: if you don't like that you should mail fedidb or fediobserver, trolling every AP instance to get back at them feels... misguided this really feels like a super petty way to deal with crawlers: "oh you won't stop crawling me? fine, have junk numbers!!!". except you are hitting bystanders. once again: what's the point of this?
I once again stress: this feature is misguided, GtS trolling the network is a very bad outlook on the project |
User agent rules are completely inadequate for the use case. There are active crawlers copying legitimate user agents as it is - I've seen Mastodon, Chrome, and Tusky UAs hitting honeypots that I know no legitimate user interacts with (a deprecated Pleroma endpoint that at least one badly written scraper checked first). Easier to block IP ranges of known bad actors... which will also catch plenty of legitimate users on VPNs (like my wife, for instance). |
I think the general problem here is that the said parties are not really cooperative. And new, non-cooperative ones keep cropping up from season to season. I don't really think it's fair to require GtS devs and admins to keep track of these people, communicate with each of them, and suck it up if they refuse to cooperate. It's also an unnecessary burden for anybody who maintains these supposed services to handle each case promptly and fairly. If they won't honour As for the matter of requesting Lastly, any willing administrator of any available fediverse server can easily patch their software to serve whatever bogus data they want. Uncurated, automatically gathered numbers are not data. All that's been done here is to include an implementation of this in this one piece of software as an optional, default-off, built in feature. It's not even the first time it happened I think, if I am not mistaken people had modified Mastodon in the past to act similarly. The fact that these "services" just pick up self reported numbers and dump it into their calculations without reasonable assertions and any way to flag falsification and improbable numbers beyond someone noticing it is... concerning, to say the least. It goes on to show that the numbers they publish are unreliable already, from a statistical standpoint. Not all collections of numbers are data. [Sorry for deleting and reposting, I accidentally hit Ctrl+Enter.] |
"Hurt," "deserve," and "getting hit" are very emotionally loaded terms frankly. On a purely technical level I'd say that if you are serving self-reported remote instance user numbers to your software's users in anything but a purely advisory way, eg., "this is how many users the instance says it has, which may not be accurate," then that is not really a good idea. As I wrote above, it's already very possible for people to fake stats, and it has happened multiple times over the years that some instance admin says "hey I'm going to mess with these numbers for fun" (see: hellsite.site). I totally concede that the feature that spawned this issue makes it easier for instance admins to do that, though, but it's also not even in a release yet and we're already considering how to temper it, which is why this issue was raised in the first place. Incidentally, I'm also not aware of any other AP softwares that use those nodeinfo numbers for anything. Are there others that you know of? I'd be interested to know. This isn't meant as a "gotcha," I am genuinely curious because I actually don't know.
This has been defined for over 30 years in RFC 9309, we at GtS didn't decide how robots.txt should work, we just implement it.
It's not false.
Also not really true. The whole point of a protocol is that you shouldn't require working around edge cases from software to software, because the "rules" for how to transmit information are there in the protocol document that everyone has access to1. For example, GoToSocial federates with a whole bunch of different ActivityPub-capable servers and we don't check what kind of software we're federating with; it's not relevant to us as we do all of our server to server communication via signed ActivityPub calls, webfinger requests, and GETs to media links (I may be missing a few non-essential functions off the top of my head). We do try to see if we can fetch /api/v1/instance or nodeinfo to fetch the instance's description of itself, and we store a version number for instances we know about, but we don't even use that for anything, and should probably just remove it as it's wasting lots of space. We don't fail to federate if we can't fetch an instance's nodeinfo endpoint, because that doesn't make any sense to do.
Again, this is a misrepresentation of what we're doing here, because as I wrote above, "every AP instance" does not use those nodeinfo numbers for something, and I think/suspect that very few do (yours is the only one I know of). Yes, I used the word "trolling" above but I specifically said: "To [some] folks, trolling scrapers and crawlers is not a controversial thing to do." That is, trolling scrapers and crawlers! We can argue semantics about what constitutes a scraper or a crawler but I think most people would agree that other federating servers are not those things, because other fedi servers make requests in response to user actions, and do not do automated scrapes/crawls.
I'm sorry you feel that way, but I have been pondering this all day and discussing it with other devs, and in this issue here, and on the fedi also (https://gts.superseriousbusiness.org/@dumpsterqueer/statuses/01JK5PXHCHWP58172Y40QKE2ZZ), and we (the GtS devs) are trying to figure out a way to make the feature still useful, but less chaotic. So please be assured that it is being worked on and likely won't end up in a final release in its current form. 1. Whether or not the protocol document has been successful, in ActivityPub's case, at getting everyone on the same page is another matter. But you really don't need to check which software you're federating with in order to parse incoming AP data, if you have sensible fallbacks. |
In discussion of #3718 it was pointed out that admins may wish to just serve 0 for all stats instead of serving accurate stats or random stats.
We could do this by changing
instance-stats-randomize
to something likeinstance-stats-mode
and allowing multiple options like, I don't know,serve
,zero
, orrandom
, withserve
being the default as it is currently.The text was updated successfully, but these errors were encountered: