Target Audience & AI Accuracy - How will a junior developer be able to fact-check AI generated responses? #411
-
My main concern is that AI Help does not have a place in technical documentation. Yes, in theory it could help out a few people, but the target audience it seems to aim for (new developers or someone unfamiliar with the concept it is trying to learn about) coupled with our current understanding and research about LLMs (in a nutshell; they can confidently present inaccurate information) seems to be a hugely concerning mismatch. You need someone to fact-check the response from a LLMs, a four eye principle is often applied on technical docs (one writer, and at least one reviewer) which is missing from the LLM. Therefore, there is a significantly increased risk that the LLM provides wrong information to someone not knowledgeable enough about the subject to fact-check if the AI is confidently providing misinformation, or is actually accurate. How does the team behind AI Explain hope to alleviate this concern, beyond plastering the user with warnings (which might be hint that this is not a product-market fit?) (pasted from this comment: mdn/yari#9230 (comment)) |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
We have recorded your question as number 20. |
Beta Was this translation helpful? Give feedback.
-
I'll probably end up doing a worse job explaining this than I did in the community call, but let's have a go: (reading this bit again when giving it the once-over before submitting, ha, I've written a bit of an essay here, apologies, and buckle up!) To answer the very direct question in the title:
AI Help provides sources at the end of its answer, always pointing to our reference documentation, which a developer can read and fact-check against the response provided. These sources are not generated by the LLM (they're the result of our embedding similarity search, and concatenated onto the response) so they can't be hallucinated, though there is the possibility of improving the embedding model/approach considering sometimes we return irrelevant documentation (e.g. a question about "array with" returning the unrelated There's also the possibility of building other concrete features to help developers out: something like expanding the integration within code blocks we have with the MDN Playground to include the code blocks the LLM may output - which would allow inline verification of whether any of that code works at all, rather than trusting that a developer might try and run/debug it in isolation before incorporating it into their own code.
This is the core of your question, though, and the above answers don't really answer that. There's seemingly a bit of a paradox in this feature: it potentially outputs misinformation, more junior developers don't know enough to fact check that, and if they did they'd be a more senior developer and probably not need the feature in the first place. I think that's a convincing argument, and I appreciate that you've distilled it into that form rather than just "LLMs are inaccurate and therefore bad", but I think there's still some nuance to the issue missing. First of all, it assumes that everything a junior developer is already reading out there is accurate, and that's clearly not true. I think we've all read answers on discussion forums which are wrong or outdated. Junior developers may also be in, or have very recent experience of, peer learning groups where very wrong answers may be thrown around in search of the solution. So I reckon junior developers have a much greater ability to weed out incorrect information than we give them credit for, because they already do it so much. Of course, you might say in response, "but users know inherently not to trust stackoverflow answers or answers they receive from their peers, and approach them with a degree of suspicion or caution they won't have on a site with the reputation and prestige of MDN", and I think you might be right there. For us working on the site day in/day out I think it's easy to forget that the architecture of the site isn't really clear to the outside world (nor does it need to be): so while it may be very apparent to us that this is a supplemental feature, doesn't affect the reference documentation, and should be held to a different standard and subject to more scepticism, it's not clear to everyone else. I (personally) think that's one of the areas we clearly made a mistake, and have improved a bit (by adding some warnings), and probably need to improve more (again my very personal opinions here): either with a UX which reflects that nature (I've referred to this as a "choose your own adventure" in another comment, in terms of explicitly opting into the summary-in-context step after the similarity search step), prompt engineering to make the answers come out in a less authoritative tone, or even something silly like giving the "assistant" a bit of a personality which doesn't scream "authority" (goofy looking dinosaur or fox, anyone?). Those previous examples of inaccurate sources of information ignore that it's not like we were the ones to open Pandora's box here. Developers are already using tools like ChatGPT to answer these questions, but they're potentially not prompting it in the best way: maybe they are asking for sources, but they don't check to see if they're hallucinated, maybe they do that but takes more time and they need to refine the question a bit before getting the answer. With "AI Help", because of the similarity search step, and the prompt we use, the answers should - when directly addressed in MDN's reference documentation - have a level of accuracy beyond that of "naive" use of ChatGPT, and in that sense: if the alternative if this feature doesn't exist is that a developer just asks ChatGPT anyway, then the feature doesn't need to be perfect, just better. Think of it in terms of harm reduction. Finally, I'd like to compare this feature to possibly what we could consider the "ideal" way for a less experienced developer to find an answer if they're stuck: just ask a senior developer. Now ignoring the fact that not everyone has the privilege of being able to do that: they might not be comfortable doing so, the senior developer might just be busy, or they might not exist: some people are learning to code in contexts where they just don't have access to anyone to ask questions. But ignoring that, how does "AI Help" compare to a senior developer? Second-for-second, quite well I'd say! Unless you knew the answer off the top of your head - which you might - could you really give a response in the few seconds it takes AI Help to generate one? I wouldn't be able to. And if you did know the answer off the top of your head, when subject to really critical scrutiny, would it hold up and be guaranteed to contain absolutely no inaccuracies? Again in my case: I can't say it would. There was one interesting case which was reported, where the explanation provided by AI Help to a question about So I think that AI Help has been held to an unreasonably high standard overall: which is fine, we should strive to make it as accurate as possible, and indeed was a mistake on our part in terms of expectation setting. But fundamentally, considering the speed of answer, level of accuracy in most cases, and ability to answer near-limitless variations of a question, I genuinely think that it's a helpful feature with lots more potential. I'm sad I can't say that I was the one who built it, I only reviewed the PRs! Thanks for your question, and your measured responses in the various issue threads: I enjoyed thinking about the answer to this one, and it led to a clarity of thought I hadn't had when reviewing the feature internally. This sort of feedback, questioning and discussion is exactly the kind which we wanted to have when releasing the feature in beta, and exactly the kind which overall helps make the output of FOSS projects better. It's just a bit of a shame we had to wade through some personal nastiness (not from you) and very "gotcha" bad-faith argument to get here (again, not from you), so thanks for not being a part of that ugly side of FOSS contribution. As a final note, this whole discussion has been based on the premise that this feature is only useful to a less experienced developer, as a more experienced developer will already know how to get directly to the answer quickly. I'm not sure that's true either, and I can give an example of where I found "AI Help" really helpful the other day: I was writing a regexp, realised it was matching greedily when I needed a non-greedy match, and just wanted the bit of syntax needed to do that. So I asked AI Help: I've cropped the image there, because that's where I stopped reading. I had already appended So I think, when approached with appropriate scepticism (which I guess we tend to have as more experienced developers, and a more junior developer may not have in this context), these answers can be super helpful despite the inherent possibility for inaccuracies due to the fundamental nature of LLMs. The problem then is no longer the fact that those inaccuracies can exist (which we can never really solve for), but how we instil that scepticism into the users of this feature, and that's a very concrete, easier-to-work-on problem. |
Beta Was this translation helpful? Give feedback.
-
@LeoMcA - thanks for the elaborate response. I've read through your answer and will let it sink in. I've marked it as the answer to the question for now. (there's a lot to digest and time, like everything, is limited ;)) |
Beta Was this translation helpful? Give feedback.
I'll probably end up doing a worse job explaining this than I did in the community call, but let's have a go: (reading this bit again when giving it the once-over before submitting, ha, I've written a bit of an essay here, apologies, and buckle up!)
To answer the very direct question in the title:
AI Help provides sources at the end of its answer, always pointing to our reference documentation, which a developer can read and fact-check against the response provided. These sources are not generated by the LLM (they're the result of our embedding similarity search, and concatenated onto the response) so they can't b…