When building AI solutions, we face many challenges. It is a multidisciplinary task that requires careful planning and effective execution in various areas: business goals, logic, data, models, and infrastructure. However, let's focus on one of the key issues that has recently become increasingly apparent—hallucinations, or false responses generated by AI.
Where Do Hallucinations Come From?
In the rush to create ever-new AI-based gadgets, it is often overlooked that AI relies on data. A lack of appropriate data leads to poorer responses, which encourages unrealistic results. On the other hand, overfitting the model to the available data results in another issue—"memorization" (as known from school), or a lack of generalization.
It's worth noting that simply adding the phrase "don't hallucinate" to a prompt—despite good intentions—won’t solve the problem. The foundation of effective AI solutions is a solid data layer. Without the right data, it is impossible to create a meaningful product.
How Can AI Help in Data Acquisition?
Interestingly, large language models (LLMs) can also assist in the strategy of enriching a data repository. So, how can artificial intelligence be used for this purpose?
1. Leveraging Knowledge Contained in the Model
LLMs are trained on vast collections of digital materials, which is one of their major strengths. This shouldn’t be underestimated, as it can be used effectively. There are examples of successfully applying LLMs in describing large financial data sets—due to the widespread availability of this information online, the margin of error was minimal. This is an effective way of enriching large sets of private data.
However, there are also pitfalls. When we approach specific areas of knowledge that are poorly represented online, LLMs can start "hallucinating." For example, when we asked a model in one test to generate a list of all the aromas present in a particular bottle of wine, it did so with the grace of a consultant and wine expert. The response was flawless, but... completely false. This was a classic case of hallucination.
2. Extracting Knowledge from Unstructured Materials
Another way to enrich data using language models is by extracting information from private, unstructured data sets that contain real, useful knowledge. This is an ideal application for LLMs, as it allows valuable information to be extracted from materials that were previously difficult to process.
Hallucinations in AI are a real issue to be mindful of when building AI solutions. An effective remedy for this phenomenon is to heavily constrain the potential results generated by the model using available data or private knowledge. Proper constraints ensure that the solution delivers accurate results. To achieve this, not only is a good strategy needed, but also a willingness to test and incrementally improve.
Author: Mariusz Jażdżyk