Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The case of DuckAssistBot: real-time vs LLM bots #53

Open
nisbet-hubbard opened this issue Nov 9, 2024 · 8 comments
Open

The case of DuckAssistBot: real-time vs LLM bots #53

nisbet-hubbard opened this issue Nov 9, 2024 · 8 comments

Comments

@nisbet-hubbard
Copy link
Contributor

I think DuckAssistBot is good test case for where we want to draw the line between AI crawlers and other crawlers.

The README currently says

This is an open list of web crawlers associated with AI companies and the training of LLMs to block.

If it’s the training of LLMs that’s the problem, DuckDuckGo shouldn’t make the list since it states that ‘This data is not used in any way to train AI models.’ https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot/

If AI assistants that generate real-time answers should also be blocked, then we can probably improve our docs to make it clear that any crawler associated with AI use is within our scope even when they’re not being used to train LLMs.

@glyn
Copy link
Contributor

glyn commented Nov 9, 2024

Very interesting. According to What is DuckAssist on DuckDuckGo Search?:

DuckAssist is an optional feature in our search results that can anonymously generate answers to search queries. To do this, it scans the web for relevant content and then uses AI-powered natural language technology to generate a brief answer based on relevant information found.

My understanding, which may be wrong, is that DuckAssistBot's scrapings are used by an AI model ("AI-powered natural language technology") in generating answers to search queries, but the scrapings are not themselves used to train an AI model.

Does DuckAssistBot fall into the class of crawlers we want to block? According to our FAQ, we block mainly because of copyright abuse. (We also mention excessive load on crawled sites, but I would query this as any other crawler could potentially be guilty of this and I wouldn't want our scope to creep to all crawlers.)

I would add -- and maybe this also needs to go in the FAQ -- the excessive environmental impact of training LLMs, due to very high electricity consumption (not all of which is green energy) and the increase in e-waste from LLM training hardware.

I'm not sure that DuckAssistBot is much worse than a normal search engine index in terms of copyright abuse and if there is no LLM training using the crawled data, there shouldn't be excessive environmental impact. So I'd personally be happy to see DuckAssistBot removed from our list of crawlers.

glyn added a commit to glyn/ai.robots.txt that referenced this issue Nov 9, 2024
I deleted the point about excessive load on
crawled sites as any other crawler could potentially
be guilty of this and I wouldn't want our scope to
creep to all crawlers.

Ref: ai-robots-txt#53 (comment)
@technoogies
Copy link

Since it adds source links, I vote for leaving this one off of the list.
Just my take on it.

@glyn
Copy link
Contributor

glyn commented Nov 9, 2024

Note that DuckAssistBot came from Dark Visitors. If we decide not to include it, we'll have to change our logic that ensures the whole Dark Visitors AI assistants list is rolled into ours. We could either omit AI assistants or add further logic to filter out DuckAssistBot.

/cc @cdransf

@cdransf
Copy link
Member

cdransf commented Nov 9, 2024

I'm comfortable including this — it still contributes to the behavior at issue if perhaps less directly. AI driven search has been demonstrated to be unreliable and still requires data that's often not clearly sourced.

@glyn
Copy link
Contributor

glyn commented Nov 10, 2024

I'm comfortable including this — it still contributes to the behavior at issue if perhaps less directly. AI driven search has been demonstrated to be unreliable

Not sure we've covered that in the FAQ "Why should we block these crawlers?" If we're going to keep DuckAssistBot in the list, please could you propose an extension to that FAQ? (I personally think that would be scope creep, but this is your repo, so your opinion should count more than mine.)

and still requires data that's often not clearly sourced.

@technoogies claims DDG adds source links. If that's the case, what's not clearly sourced about the data gathered by DuckAssistBot?

@nisbet-hubbard
Copy link
Contributor Author

Or we could offer two robots.txt: one that covers every AI-related crawler and another one focussed on LLM bots.

That way, we're putting the choice into users' hands and giving them more granular control over what to block, rather than imposing one single interpretation of 'bad' AI bots.

Having two files also makes it easy for users to automate their own flows by pulling from the respective file.

@cdransf
Copy link
Member

cdransf commented Nov 11, 2024

I like that solution — though it may require some manual intervention if the workflow that checks Dark Visitors is adding them without that distinction in mind.

@glyn
Copy link
Contributor

glyn commented Nov 12, 2024

I think that proposal would work ok for crawlers listed by Dark Visitors, so long as the two files were split by Dark Visitors classifications. But what about other contributions direct to this site? We'd need to classify those crawlers too. This could be solved by adding another fields to the entries in robots.json. We'd need to be able to map Dark Visitors classifications to that field, so the values of that field would need careful thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants