Target Audience & AI Accuracy - How will a junior developer be able to fact-check AI generated responses? #411

Zarthus · 2023-07-08T09:36:34Z

Zarthus
Jul 8, 2023

My main concern is that AI Help does not have a place in technical documentation. Yes, in theory it could help out a few people, but the target audience it seems to aim for (new developers or someone unfamiliar with the concept it is trying to learn about) coupled with our current understanding and research about LLMs (in a nutshell; they can confidently present inaccurate information) seems to be a hugely concerning mismatch.

You need someone to fact-check the response from a LLMs, a four eye principle is often applied on technical docs (one writer, and at least one reviewer) which is missing from the LLM.

Therefore, there is a significantly increased risk that the LLM provides wrong information to someone not knowledgeable enough about the subject to fact-check if the AI is confidently providing misinformation, or is actually accurate.

How does the team behind AI Explain hope to alleviate this concern, beyond plastering the user with warnings (which might be hint that this is not a product-market fit?)

(pasted from this comment: mdn/yari#9230 (comment))

Answered by LeoMcA

Jul 17, 2023

I'll probably end up doing a worse job explaining this than I did in the community call, but let's have a go: (reading this bit again when giving it the once-over before submitting, ha, I've written a bit of an essay here, apologies, and buckle up!)

To answer the very direct question in the title:

How will a junior developer be able to fact-check AI generated responses?

AI Help provides sources at the end of its answer, always pointing to our reference documentation, which a developer can read and fact-check against the response provided. These sources are not generated by the LLM (they're the result of our embedding similarity search, and concatenated onto the response) so they can't b…

View full answer

caugner · 2023-07-10T08:14:40Z

caugner
Jul 10, 2023
Maintainer

We have recorded your question as number 20.

0 replies

LeoMcA · 2023-07-17T16:46:50Z

LeoMcA
Jul 17, 2023
Maintainer

I'll probably end up doing a worse job explaining this than I did in the community call, but let's have a go: (reading this bit again when giving it the once-over before submitting, ha, I've written a bit of an essay here, apologies, and buckle up!)

To answer the very direct question in the title:

How will a junior developer be able to fact-check AI generated responses?

AI Help provides sources at the end of its answer, always pointing to our reference documentation, which a developer can read and fact-check against the response provided. These sources are not generated by the LLM (they're the result of our embedding similarity search, and concatenated onto the response) so they can't be hallucinated, though there is the possibility of improving the embedding model/approach considering sometimes we return irrelevant documentation (e.g. a question about "array with" returning the unrelated with documentation). Ironically, in those kinds of cases, the LLM generated response can do quite a good job of ignoring the irrelevant bits of documentation (though this might just be random chance).

There's also the possibility of building other concrete features to help developers out: something like expanding the integration within code blocks we have with the MDN Playground to include the code blocks the LLM may output - which would allow inline verification of whether any of that code works at all, rather than trusting that a developer might try and run/debug it in isolation before incorporating it into their own code.

Therefore, there is a significantly increased risk that the LLM provides wrong information to someone not knowledgeable enough about the subject to fact-check if the AI is confidently providing misinformation, or is actually accurate.

This is the core of your question, though, and the above answers don't really answer that. There's seemingly a bit of a paradox in this feature: it potentially outputs misinformation, more junior developers don't know enough to fact check that, and if they did they'd be a more senior developer and probably not need the feature in the first place.

I think that's a convincing argument, and I appreciate that you've distilled it into that form rather than just "LLMs are inaccurate and therefore bad", but I think there's still some nuance to the issue missing.

First of all, it assumes that everything a junior developer is already reading out there is accurate, and that's clearly not true. I think we've all read answers on discussion forums which are wrong or outdated. Junior developers may also be in, or have very recent experience of, peer learning groups where very wrong answers may be thrown around in search of the solution.

So I reckon junior developers have a much greater ability to weed out incorrect information than we give them credit for, because they already do it so much.

Of course, you might say in response, "but users know inherently not to trust stackoverflow answers or answers they receive from their peers, and approach them with a degree of suspicion or caution they won't have on a site with the reputation and prestige of MDN", and I think you might be right there. For us working on the site day in/day out I think it's easy to forget that the architecture of the site isn't really clear to the outside world (nor does it need to be): so while it may be very apparent to us that this is a supplemental feature, doesn't affect the reference documentation, and should be held to a different standard and subject to more scepticism, it's not clear to everyone else.

I (personally) think that's one of the areas we clearly made a mistake, and have improved a bit (by adding some warnings), and probably need to improve more (again my very personal opinions here): either with a UX which reflects that nature (I've referred to this as a "choose your own adventure" in another comment, in terms of explicitly opting into the summary-in-context step after the similarity search step), prompt engineering to make the answers come out in a less authoritative tone, or even something silly like giving the "assistant" a bit of a personality which doesn't scream "authority" (goofy looking dinosaur or fox, anyone?).

Those previous examples of inaccurate sources of information ignore that it's not like we were the ones to open Pandora's box here. Developers are already using tools like ChatGPT to answer these questions, but they're potentially not prompting it in the best way: maybe they are asking for sources, but they don't check to see if they're hallucinated, maybe they do that but takes more time and they need to refine the question a bit before getting the answer. With "AI Help", because of the similarity search step, and the prompt we use, the answers should - when directly addressed in MDN's reference documentation - have a level of accuracy beyond that of "naive" use of ChatGPT, and in that sense: if the alternative if this feature doesn't exist is that a developer just asks ChatGPT anyway, then the feature doesn't need to be perfect, just better. Think of it in terms of harm reduction.

Finally, I'd like to compare this feature to possibly what we could consider the "ideal" way for a less experienced developer to find an answer if they're stuck: just ask a senior developer. Now ignoring the fact that not everyone has the privilege of being able to do that: they might not be comfortable doing so, the senior developer might just be busy, or they might not exist: some people are learning to code in contexts where they just don't have access to anyone to ask questions.

But ignoring that, how does "AI Help" compare to a senior developer? Second-for-second, quite well I'd say! Unless you knew the answer off the top of your head - which you might - could you really give a response in the few seconds it takes AI Help to generate one? I wouldn't be able to. And if you did know the answer off the top of your head, when subject to really critical scrutiny, would it hold up and be guaranteed to contain absolutely no inaccuracies? Again in my case: I can't say it would. There was one interesting case which was reported, where the explanation provided by AI Help to a question about String.prototype.replaceAll() was wrong because it hadn't explained that using a regexp without the g flag would cause the method to raise a TypeError: but not a single one of us engineers knew that off the top of our head, nor would we mention it if asked to explain that feature in a few seconds. It was only after careful reading of the reference docs that we learned that.

So I think that AI Help has been held to an unreasonably high standard overall: which is fine, we should strive to make it as accurate as possible, and indeed was a mistake on our part in terms of expectation setting. But fundamentally, considering the speed of answer, level of accuracy in most cases, and ability to answer near-limitless variations of a question, I genuinely think that it's a helpful feature with lots more potential. I'm sad I can't say that I was the one who built it, I only reviewed the PRs!

Thanks for your question, and your measured responses in the various issue threads: I enjoyed thinking about the answer to this one, and it led to a clarity of thought I hadn't had when reviewing the feature internally. This sort of feedback, questioning and discussion is exactly the kind which we wanted to have when releasing the feature in beta, and exactly the kind which overall helps make the output of FOSS projects better. It's just a bit of a shame we had to wade through some personal nastiness (not from you) and very "gotcha" bad-faith argument to get here (again, not from you), so thanks for not being a part of that ugly side of FOSS contribution.

As a final note, this whole discussion has been based on the premise that this feature is only useful to a less experienced developer, as a more experienced developer will already know how to get directly to the answer quickly. I'm not sure that's true either, and I can give an example of where I found "AI Help" really helpful the other day: I was writing a regexp, realised it was matching greedily when I needed a non-greedy match, and just wanted the bit of syntax needed to do that. So I asked AI Help:

I've cropped the image there, because that's where I stopped reading. I had already appended ? to my regexp and was testing to see if it worked. There might have been inaccuracies in the rest of the response, but that didn't matter because I didn't need to read it, the ? may have even been the wrong operator to use, but that didn't matter because I was testing it. And it worked!

So I think, when approached with appropriate scepticism (which I guess we tend to have as more experienced developers, and a more junior developer may not have in this context), these answers can be super helpful despite the inherent possibility for inaccuracies due to the fundamental nature of LLMs. The problem then is no longer the fact that those inaccuracies can exist (which we can never really solve for), but how we instil that scepticism into the users of this feature, and that's a very concrete, easier-to-work-on problem.

1 reply

archo5 Jul 20, 2023

AI Help provides sources at the end of its answer, always pointing to our reference documentation, which a developer can read and fact-check against the response provided.

Are these the same sources that were considered inadequate (difficult to understand and therefore unhelpful) for the junior developers?

First of all, it assumes that everything a junior developer is already reading out there is accurate, and that's clearly not true.

Why would it assume that? Is it not sufficient to desire that the wider misinformation issue simply isn't being actively made worse?
I don't think "other people provide bad answers so we should too" is quite the convincing argument that you seem to think it is. Particularly since you don't seem to have any strategy towards actually replacing the StackOverflow answers with yours, but you do have a strategy to ask for money (a well known barrier to entry).

So I reckon junior developers have a much greater ability to weed out incorrect information than we give them credit for, because they already do it so much.

That's an assumption. Wouldn't it be at least equally if not more possible that junior devs in fact do not weed anything out and bad solutions land in codebases to the extent that they at least barely work? Or that a more senior member of the team is spending their time fighting the misinformation?

giving the "assistant" a bit of a personality which doesn't scream "authority"

At this point just say the assistant is from MDN, that should work well enough. 😏

Developers are already using tools like ChatGPT to answer these questions

In that case, why would they pay for a limited service to supplement an unlimited one? Seems like the classic case of trying to solve a social issue by just adding more tech.

Unless you knew the answer off the top of your head - which you might - could you really give a response in the few seconds it takes AI Help to generate one?

Is the argument here that speed matters more than accuracy or that it compares to human performance given the same speed? Is the assumption here that every question has a direct answer (e.g. not an XY problem) or that there is rarely any context that could affect the answer?

There was one interesting case which was reported, where the explanation provided by AI Help to a question about String.prototype.replaceAll() was wrong because it hadn't explained that using a regexp without the g flag would cause the method to raise a TypeError

To be fair, this doesn't seem like the highest priority issue if the solution is mostly correct and the remaining bits can be figured out by running the code. I suspect many developers don't have an incentive to learn such things.

So I asked AI Help:

So I asked Google:

Isn't this what you're ultimately competing against here?

You mentioned an unreasonably high standard, so I'd want to ask, would you consider it fair that AI help is compared with Google? Have you done any benchmarks internally to compare with the competition, particularly those services which are more accessible?

the fact that those inaccuracies can exist (which we can never really solve for)

That should've been a hint that you just might be using the wrong tool for the job.

but how we instil that scepticism into the users of this feature, and that's a very concrete, easier-to-work-on problem.

Are you certain that a social problem is going to be easier to tackle than a technical one, especially for a company full of technical people?

Zarthus · 2023-07-17T17:13:34Z

Zarthus
Jul 17, 2023
Author

@LeoMcA - thanks for the elaborate response.

I've read through your answer and will let it sink in. I've marked it as the answer to the question for now. (there's a lot to digest and time, like everything, is limited ;))

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDN Web Docs

Target Audience & AI Accuracy - How will a junior developer be able to fact-check AI generated responses? #411

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

MDN Web Docs

Target Audience & AI Accuracy - How will a junior developer be able to fact-check AI generated responses? #411

Zarthus Jul 8, 2023

Replies: 3 comments · 1 reply

caugner Jul 10, 2023 Maintainer

LeoMcA Jul 17, 2023 Maintainer

archo5 Jul 20, 2023

Zarthus Jul 17, 2023 Author

Zarthus
Jul 8, 2023

Replies: 3 comments 1 reply

caugner
Jul 10, 2023
Maintainer

LeoMcA
Jul 17, 2023
Maintainer

Zarthus
Jul 17, 2023
Author