Emerging retrieval-augmented generation (RAG)-powered enterprise conversational QA applications leverage massive internal knowledge bases and conversational data to improve customer service, increase employee productivity, and help make better decisions.
Generative Large Language Models (LLMs) are instrumental in RAG and can produce human-like responses. However, LLMs are limited by context length. This is because of the quadratically increasing computational time and space complexity for the attention layer and embedding dimension with increasing token size.
To address this limitation, researchers use vector databases to improve both vanilla QA and RAG performance, store enormous amounts of text data, and retrieve relevant data efficiently. However, vector databases have several challenges, including high latency, high cost, repeat-indexing cost with model upgrades, and low reliability due to their relatively new technology.
FAISS, a new indexing solution, appears to tackle the challenges mentioned above.
Introduction to RAGs / RAGs to riches
Emerging Retrieval-Augmented generation (RAG)-powered enterprise conversational QA applications leverage massive internal knowledge bases. These QA applications can help companies to:
- Improve customer service: QA applications can be used to build chatbots and other conversational applications that can answer customer questions accurately and efficiently. The QA bot can mine millions of records of customers' issues to answer “What are the top 3 refund requests or tech issues for last month for our product”.
- Increase employee productivity: QA applications can be used to build internal knowledge bases and other resources that can help employees find the information they need quickly and easily. QA bot can answer, “What are the top 3 key areas and challenges to resolving poor adaptability?” for software product development team from various product feedback and issue documents, emails and messages
- Help make better decisions: The QA application can be used to analyze large datasets of conversational data to identify trends and patterns. This information can then be used to make better decisions about product development, marketing, and customer service.
Question Answering can be taken to the next level using Generative Question Answering with the use of large language models (LLMs) like Llama, Falcon, Vicuna, or GPT to create human-like, creative, and personalized responses to user queries beyond extracting answers.
The challenge of context
Apart from cost and latency, the biggest challenge is context length, as these LLMs have a maximum token limit that encompasses both the input prompt, data, and generated output. While handling long documents, the model will truncate the text to fit within the allowable context length. LLMs have a restricted input size (example 4,096 for GPT3.5). While fine-tuning LLM with confidential data is a viable option, it may be hard to get good results, and it can be a costly operation.
Recently, the hot technology of vector databases aims to address the above problems and implement the above use cases. Using an external vector database to store data and enable LLMs to get relevant data to answer the prompted question may improve LLMs performance. Vector databases are designed for efficient similarity search and can scale to accommodate massive volumes of data because of advanced indexing and distributed systems.
But vector databases have their own issues. As presented in this article on latency, the cost of generating indexes and maintaining millions of these data-chunks can be in the tens of thousands of dollars on a monthly basis and is susceptible to scenarios where related information is scattered across chunks, causing poor quality and missing information retrieval.
This article shows step-by-step instructions for improving vector search by drastically reducing the indexing footprint to massively reduce indexing costs, achieving search precision, and using targeted search to get users the answers they want.
Custom Architecture for Scalable Real-Time Queries
Data labeling entails creating labels for data chunks in large volumes of text data. These labels have to be search, intent and problem-statement focussed.
For example, assume 10 documents, each with 1500 words across 50-75 sentences, are divided into 10-15 chunks. By using data labeling, these chunks would result in indexing and maintenance costs of only a few words compared to 1500 words (more than 99% reduction) per document. This additionally results in much better search precision as it can quickly identify which tranches of text have relevant information to answer the query.
Labels or tags are ideal for indexing data; however, unstructured data doesn't have any. Without the e-commerce equivalent of tags (metadata), this article presents an ML technique for automatically tagging and labeling millions of documents, including paragraphs, which will serve as indexes for search.
In addition to the aforementioned issues, excessively casual language is problematic when used in informal texts such as client reviews and comments. For the exact information-based search that prospective users want, this must be eliminated.
Figure-1 - Tag-indexed document: indexing architecture offline and real-time queries
Solution walkthrough with a search problem
This example is for Yelp restaurant recommendations, but it could equally be used for automobiles, cuisine, health and services, beauty and spas, etc. within Yelp data.
The idea is to use a retrieval-based chatbot with semantic search to find the best answer among numerous responses. Tags help natural language processing models discover and analyze content by identifying text portions with bounding boxes. Yelp reviews from restaurants in the Tampa, Florida, area were utilized to create this example search.
How does it work?
These recommendations were based on five-star evaluations from Tampa, Florida, for "crab" searches.
Figure-2 - RAG output using LLMs and FAISS running on index-tags
Steps for creating this search
- Dataset Used
Yelp provides a selection of businesses, reviews, and user data for academic research. A diverse 1,987,897 individuals left 908,915 reviews. More than 1.2 million business features include operation hours, parking, product and service availability, and ambiance.
Figure-3 - Yelp dataset dimensions for query build - Data Aggregation
The city was used to filter the Yelp dataset. This LLM-based consumer recommendation system is intended to analyze Yelp reviews from consumers.
The project concentrates on gathering Tampa-based reviews. The selection procedure involves choosing five-star reviews. - Data Removal for emotional sentences and retaining only food related sentences
Statements with emotional content are classified for exclusion, while statements linked to food are retained for subsequent modeling stages. The code can be found on Github.
Used few shot Learning: A collection of 300 instances was manually labeled. These examples were then used to train and evaluate a model, which was then deployed on a dataset containing over 300,000 reviews to differentiate between emotional and food-related sentences.
Training Labels{'review': ' So excited to find a great Mexican place in Tampa',
'label': 'emotional'},
{'review': ' And OMG the warm carrot cake', 'label': 'food_talk'},
Prediction based on a customer review:{'This place is a Tampa classic': 'emotional', ' The Cuban sandwich was on point': 'food_talk',}
Original‘This place is a Tampa classic. The Cuban sandwich was on point. the breakfast was another winner.I enjoyed also the devil crabs. Appreciate the reasonable prices, can't wait to come back and have again’
Results after selecting for food-only content‘The Cuban sandwich was on point. I enjoyed also the devil crabs’
- Named Entity Recognition-powered tagging
The practice of generating "tags" is to label texts. Consequently, the use of these tags enables expedited searching of databases for relevant material. For example: hashtags, such as those seen on Twitter, facilitate discoverability among previously untapped audiences.
On these principles, a food-related entity labeling model is used in this project to enhance food item search. The InstaFoodRoBERTa-NER model is a BERT model that has been fine-tuned and optimized for the task of Named Entity Recognition (NER) in the context of food entities.
Figure-4 - Yelp Example Output of InstaFoodRoBERTa-NER. - Vector search using billion-scale FAISS search
The FAISS library encompasses a variety of techniques for doing similarity searches. Approaches, such as using binary vectors and compact quantization codes, eliminate the need to retain the original vectors. Typically, there is a trade-off between search precision and scalability when using these techniques, since they can effectively handle billions of vectors inside the primary memory of a single server.
The FAISS library is used for named entity recognition (NER) food tag semantic search. The highest-ranking review results are obtained.
Examples of top results with the FAISS search for ‘crab’ query:Gaspar's Grotto has a little something for everyone! This is my go to bar during the week and the weekend, whether you're in the mood to chill, dance, or brunch with the best of them, there's something for everyone. They have the best deviled crab I've had since in Tampa.
So delicious!!! Do not hesitate to come eat your little Crab Heart out. Crab spice is just hot enough, and I love hot.
I'm blown away by the service and the food! I ordered the ayce crab and they don't lie! Big claws and clusters and they bring it out faster then I could eat it. I'm a PICKY person and this was hands down the BEST crab I've had in town!!! See ya next week! ….
- Utilize the LLM prompt to get outputs that resemble human-like responses.
LLM prompt used:“I want you to act as a data scientist and analyze the dataset. Provide me with the exact and definitive answer to each question. Do not provide me with the code snippets for the questions. The dataset is provided below. Consider the given dataset for analysis. The first row of the dataset contains the header. Convert the filtered_customer_review column into 20 words. return as summarized food-magazine article to lure foodie customer to restaurant in 'name' column as per food in column filtered_customer_review”
Final results for “crab” as a queryGaspar's Grotto: This popular bar offers something for everyone, including the best deviled crab in Tampa. Come for a chill night, a dance, or brunch with friends.
Mr and Mrs Crab: Delicious crab dishes await at this restaurant. Try the Crab Spice for a hot and tasty treat.
Surf Shack: Ayce crab is the star of the show here. Big claws and clusters are served quickly and are sure to please even the pickiest of eaters.
Brocato's Sandwich Shop: This spot has the best deviled crabs around. Try the Italian sandwich for a meal that weighs 3 pounds.
Crafty Crab: Take your mom here for a great time and delicious crab legs. Seascapes Beach Club: Outstanding crab legs are the highlight of this beach club. Enjoy a delicious meal with a view….
Another search for the query “Cubano” food:Pepo's Cafe: Enjoy the best Cuban food in Tampa at Pepo's Cafe.
Franci's Cafe: Franci's Cafe offers fresh, authentic Cuban food that will delight your taste buds.
Columbia Restaurant: When you're craving Cuban, head to Columbia Restaurant for a delicious meal.
The Stone Soup Company: The Stone Soup Company offers award-winning Cuban cuisine.
La Teresita Cafe: La Teresita Cafe is the best place to get Cuban food in Tampa.
Pinky's: Pinky's is the go-to spot for the best breakfast Cuban in Tampa.
Summary
Prior to vector search, domain-based reduction in indexing footprint
Figure-5 - Example of reduction in number of tokens to be indexed
The use of a vector database and semantic search to improves the performance of GPT in answering questions about massive amounts of text data. It provides step-by-step instructions on reducing indexing costs by automatically tagging and labeling documents for indexing.
The data presented above indicates an 85% decrease in the quantity of tokens that need to be indexed. Used A tokenizer OpenAI example on counting tokens with Tiktoken.
Authors: Anshuman Guha, Sid Kashiramka, Ravi Krishnan