-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The case of DuckAssistBot: real-time vs LLM bots #53
Comments
Very interesting. According to What is DuckAssist on DuckDuckGo Search?:
My understanding, which may be wrong, is that DuckAssistBot's scrapings are used by an AI model ("AI-powered natural language technology") in generating answers to search queries, but the scrapings are not themselves used to train an AI model. Does DuckAssistBot fall into the class of crawlers we want to block? According to our FAQ, we block mainly because of copyright abuse. (We also mention excessive load on crawled sites, but I would query this as any other crawler could potentially be guilty of this and I wouldn't want our scope to creep to all crawlers.) I would add -- and maybe this also needs to go in the FAQ -- the excessive environmental impact of training LLMs, due to very high electricity consumption (not all of which is green energy) and the increase in e-waste from LLM training hardware. I'm not sure that DuckAssistBot is much worse than a normal search engine index in terms of copyright abuse and if there is no LLM training using the crawled data, there shouldn't be excessive environmental impact. So I'd personally be happy to see DuckAssistBot removed from our list of crawlers. |
I deleted the point about excessive load on crawled sites as any other crawler could potentially be guilty of this and I wouldn't want our scope to creep to all crawlers. Ref: ai-robots-txt#53 (comment)
Since it adds source links, I vote for leaving this one off of the list. |
Note that DuckAssistBot came from Dark Visitors. If we decide not to include it, we'll have to change our logic that ensures the whole Dark Visitors AI assistants list is rolled into ours. We could either omit AI assistants or add further logic to filter out DuckAssistBot. /cc @cdransf |
I'm comfortable including this — it still contributes to the behavior at issue if perhaps less directly. AI driven search has been demonstrated to be unreliable and still requires data that's often not clearly sourced. |
Not sure we've covered that in the FAQ "Why should we block these crawlers?" If we're going to keep DuckAssistBot in the list, please could you propose an extension to that FAQ? (I personally think that would be scope creep, but this is your repo, so your opinion should count more than mine.)
@technoogies claims DDG adds source links. If that's the case, what's not clearly sourced about the data gathered by DuckAssistBot? |
Or we could offer two robots.txt: one that covers every AI-related crawler and another one focussed on LLM bots. That way, we're putting the choice into users' hands and giving them more granular control over what to block, rather than imposing one single interpretation of 'bad' AI bots. Having two files also makes it easy for users to automate their own flows by pulling from the respective file. |
I like that solution — though it may require some manual intervention if the workflow that checks Dark Visitors is adding them without that distinction in mind. |
I think that proposal would work ok for crawlers listed by Dark Visitors, so long as the two files were split by Dark Visitors classifications. But what about other contributions direct to this site? We'd need to classify those crawlers too. This could be solved by adding another fields to the entries in robots.json. We'd need to be able to map Dark Visitors classifications to that field, so the values of that field would need careful thought. |
I think DuckAssistBot is good test case for where we want to draw the line between AI crawlers and other crawlers.
The README currently says
If it’s the training of LLMs that’s the problem, DuckDuckGo shouldn’t make the list since it states that ‘This data is not used in any way to train AI models.’ https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot/
If AI assistants that generate real-time answers should also be blocked, then we can probably improve our docs to make it clear that any crawler associated with AI use is within our scope even when they’re not being used to train LLMs.
The text was updated successfully, but these errors were encountered: