The case of DuckAssistBot: real-time vs LLM bots #53

nisbet-hubbard · 2024-11-09T00:07:11Z

I think DuckAssistBot is good test case for where we want to draw the line between AI crawlers and other crawlers.

The README currently says

This is an open list of web crawlers associated with AI companies and the training of LLMs to block.

If it’s the training of LLMs that’s the problem, DuckDuckGo shouldn’t make the list since it states that ‘This data is not used in any way to train AI models.’ https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot/

If AI assistants that generate real-time answers should also be blocked, then we can probably improve our docs to make it clear that any crawler associated with AI use is within our scope even when they’re not being used to train LLMs.

glyn · 2024-11-09T04:33:14Z

Very interesting. According to What is DuckAssist on DuckDuckGo Search?:

DuckAssist is an optional feature in our search results that can anonymously generate answers to search queries. To do this, it scans the web for relevant content and then uses AI-powered natural language technology to generate a brief answer based on relevant information found.

My understanding, which may be wrong, is that DuckAssistBot's scrapings are used by an AI model ("AI-powered natural language technology") in generating answers to search queries, but the scrapings are not themselves used to train an AI model.

Does DuckAssistBot fall into the class of crawlers we want to block? According to our FAQ, we block mainly because of copyright abuse. (We also mention excessive load on crawled sites, but I would query this as any other crawler could potentially be guilty of this and I wouldn't want our scope to creep to all crawlers.)

I would add -- and maybe this also needs to go in the FAQ -- the excessive environmental impact of training LLMs, due to very high electricity consumption (not all of which is green energy) and the increase in e-waste from LLM training hardware.

I'm not sure that DuckAssistBot is much worse than a normal search engine index in terms of copyright abuse and if there is no LLM training using the crawled data, there shouldn't be excessive environmental impact. So I'd personally be happy to see DuckAssistBot removed from our list of crawlers.

I deleted the point about excessive load on crawled sites as any other crawler could potentially be guilty of this and I wouldn't want our scope to creep to all crawlers. Ref: ai-robots-txt#53 (comment)

technoogies · 2024-11-09T04:52:31Z

Since it adds source links, I vote for leaving this one off of the list.
Just my take on it.

glyn · 2024-11-09T05:08:47Z

Note that DuckAssistBot came from Dark Visitors. If we decide not to include it, we'll have to change our logic that ensures the whole Dark Visitors AI assistants list is rolled into ours. We could either omit AI assistants or add further logic to filter out DuckAssistBot.

/cc @cdransf

cdransf · 2024-11-09T19:26:30Z

I'm comfortable including this — it still contributes to the behavior at issue if perhaps less directly. AI driven search has been demonstrated to be unreliable and still requires data that's often not clearly sourced.

glyn · 2024-11-10T01:13:37Z

I'm comfortable including this — it still contributes to the behavior at issue if perhaps less directly. AI driven search has been demonstrated to be unreliable

Not sure we've covered that in the FAQ "Why should we block these crawlers?" If we're going to keep DuckAssistBot in the list, please could you propose an extension to that FAQ? (I personally think that would be scope creep, but this is your repo, so your opinion should count more than mine.)

and still requires data that's often not clearly sourced.

@technoogies claims DDG adds source links. If that's the case, what's not clearly sourced about the data gathered by DuckAssistBot?

nisbet-hubbard · 2024-11-10T03:15:07Z

Or we could offer two robots.txt: one that covers every AI-related crawler and another one focussed on LLM bots.

That way, we're putting the choice into users' hands and giving them more granular control over what to block, rather than imposing one single interpretation of 'bad' AI bots.

Having two files also makes it easy for users to automate their own flows by pulling from the respective file.

cdransf · 2024-11-11T21:09:01Z

I like that solution — though it may require some manual intervention if the workflow that checks Dark Visitors is adding them without that distinction in mind.

glyn · 2024-11-12T11:49:39Z

I think that proposal would work ok for crawlers listed by Dark Visitors, so long as the two files were split by Dark Visitors classifications. But what about other contributions direct to this site? We'd need to classify those crawlers too. This could be solved by adding another fields to the entries in robots.json. We'd need to be able to map Dark Visitors classifications to that field, so the values of that field would need careful thought.

glyn mentioned this issue Nov 9, 2024

Clarify our rationale #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The case of DuckAssistBot: real-time vs LLM bots #53

The case of DuckAssistBot: real-time vs LLM bots #53

nisbet-hubbard commented Nov 9, 2024

glyn commented Nov 9, 2024 •

edited

Loading

technoogies commented Nov 9, 2024

glyn commented Nov 9, 2024

cdransf commented Nov 9, 2024

glyn commented Nov 10, 2024

nisbet-hubbard commented Nov 10, 2024

cdransf commented Nov 11, 2024

glyn commented Nov 12, 2024

The case of DuckAssistBot: real-time vs LLM bots #53

The case of DuckAssistBot: real-time vs LLM bots #53

Comments

nisbet-hubbard commented Nov 9, 2024

glyn commented Nov 9, 2024 • edited Loading

technoogies commented Nov 9, 2024

glyn commented Nov 9, 2024

cdransf commented Nov 9, 2024

glyn commented Nov 10, 2024

nisbet-hubbard commented Nov 10, 2024

cdransf commented Nov 11, 2024

glyn commented Nov 12, 2024

glyn commented Nov 9, 2024 •

edited

Loading