import React from 'react';
import Header from '../components/Header';
import Footer from '../components/Footer';

import "./EmbeddingsArticle.css";

const EmbeddingsArticle = () => {

    return (
        <div className="article-section">

            <Header lightOrDark="dark" />

            <div className="article-container">
                <div className="article-title-container">
                    <p className="article-title-text">Embeddings Are Not All You Need</p>
                    <p className="article-name-text">Zach McCormick</p>
                    <p className="article-date-text">July 3, 2024</p>
                </div>

                <div className="article-section-row">
                    <p className="article-paragraph">LLMs are severely limited by their lack of access to external data. Most valuable use cases for LLMs require connecting external data in some way. What’s the use of an enterprise AI assistant if it doesn’t know anything about your company? The now-standard solution is retrieval augmented generation, or RAG. Embeddings and vector databases are at the core of most RAG systems. Embeddings have gotten very good, but RAG still comes up short for many use cases. Why is that?</p>
                    <p className="article-paragraph">Consider the personal assistant use case: we'd want the AI to understand our preferences, habits, and needs based on our emails, documents, and other personal data. It should be able to suggest a vacation spot not just by finding documents mentioning "vacation," but by understanding our past travel preferences, budget constraints, and current work schedule - much like a human assistant would.</p>
                    <p className="article-paragraph">We need the LLM to be “connected” to our data and to “understand” it at a deep level. We need the LLM to have seamless access to this data and to be able to do any reasonable task that a human could do with it. But that’s not how RAG systems work.</p>
                    <p className="article-paragraph">RAG systems have their roots in traditional information retrieval (IR), a field with decades of research behind it. Some argue that RAG could benefit from adopting more advanced IR techniques, such as multi-factor ranking or relevance feedback. I agree with this argument. And yet, borrowing more from IR isn’t going to get us to that vision of a personal assistant with deep context about our lives and preferences. Or to the countless other LLM use cases that aren’t quite possible with current RAG techniques. Connecting LLMs to data is more than a pure IR problem.</p>
                </div>

                <div className="article-section-row">
                    <p className="article-header-text">The problem, precisely</p>
                    <p className="article-paragraph">Let’s get more specific. What exactly doesn’t work with standard RAG systems?</p>
                    <p className="article-bold-paragraph">1. User inputs that ask for summarization of things, listing of things, or aggregating information across multiple documents</p>
                    <p className="article-paragraph indented-paragraph">RAG systems are generally designed to retrieve specific pieces of information, not to aggregate large amounts of information. The problem is that lots of use cases require aggregation and summarization of large amounts of information. Even simple sounding queries like “What is Chapter 1 about?” require this capability.</p>
                    <p className="article-bold-paragraph">2. Data that has conflicting or out-of-date information</p>
                    <p className="article-paragraph indented-paragraph">Real-world data is messy and full of errors. If LLM systems are restricted to only using data that is guaranteed to be error-free and always up-to-date, then they're going to be severely limited. Tons of useful information is contained in messy data, but current retrieval systems generally can't handle data like this.</p>
                    <p className="article-bold-paragraph">3. Use cases where factors other than just relevancy, such as quality or recency, need to be taken into account</p>
                    <p className="article-paragraph indented-paragraph">If your data isn't uniformly high-quality and up-to-date (and most real-world data isn't), then you need to bias your results towards higher quality and more recent ones for reliable results. This is the sort of thing that traditional search systems are actually very good at, but the current RAG paradigm has mostly ignored this need. Beyond quality and recency, many use cases have additional factors that need to be considered in the retrieval process.</p>
                    <p className="article-bold-paragraph">4. Data that is really raw, like chat logs, call transcripts, or emails</p>
                    <p className="article-paragraph indented-paragraph">TThe problem with really raw data is that the questions we would want to ask over it are generally higher level questions that require aggregating information across large amounts of data. For example, “List all instances of unhappy customers this week and see if there are any commonalities.” This isn’t something that the current RAG paradigm can handle.</p>
                    <p className="article-bold-paragraph">5. Data that is multimodal, such as charts and graphs embedded in mostly text documents</p>
                    <p className="article-paragraph indented-paragraph">Most of the leading LLMs are multimodal now, but retrieval systems have not caught up. For data that includes visual data like images, charts, and graphs, alongside text, the LLM is going to really struggle if it only gets to see the text data.</p>
                    <p className="article-bold-paragraph">6. Semi-structured data, like CSVs and JSONs</p>
                    <p className="article-paragraph indented-paragraph">All kinds of problems get introduced when you treat semi-structured data the same way as unstructured data. CSVs are an especially tricky file type, because some CSVs can be treated like unstructured data with some minor special handling, such as carrying over column headers to each chunk. But other types of CSVs, especially those that are larger or contain mostly quantitative data, need to be treated more like structured data and put into a SQL table for effective querying.</p>
                    <p className="article-bold-paragraph">7. Use cases where you need the LLM to have a holistic understanding of the data</p>
                    <p className="article-paragraph indented-paragraph">Consider the personal assistant use case. You connect your email inbox, personal docs, and whatever other data sources may be useful. Now you want your AI assistant to find you somewhere to go for your next vacation. Ideally it’ll be able to figure out what kind of places you’d enjoy directly from the connected data. The problem is there isn’t going to be a single document in that connected data that says “I love the mountains, and don’t really like the beach or big cities.” That’s the kind of information that has to be inferred through a holistic understanding of the data. It can’t just be retrieved.</p>
                    <p className="article-bold-paragraph">8. Data that has some implicit or explicit structure that needs to be taken into consideration, like code or legal statutes</p>
                    <p className="article-paragraph indented-paragraph">Code and legal statutes are two examples of data with a clear structure of references. For code, the references come from imports, and for legal statutes the references come from one statute referencing another statute. Making sense of data like this requires being able to follow these chains of references during the retrieval process.</p>
                    <p className="article-bold-paragraph">9. Any use case where completeness matters</p>
                    <p className="article-paragraph indented-paragraph">Suppose you want to search across the local government regulations for a given municipality to find all rules related to short-term rentals. A search system that just returns the top 5 most relevant rules is going to lead to a very incomplete and misleading answer from the LLM.</p>
                    <p className="article-bold-paragraph">10. Any use case where hallucinations can't be tolerated</p>
                    <p className="article-paragraph indented-paragraph">Even on simple tasks, accuracy rates for RAG systems often hover in the 70-90% range, which is unacceptable for production applications.</p>
                </div>

                <div className="article-section-row">
                    <p className="article-header-text">How do we solve this?</p>
                    <p className="article-bold-paragraph">1. Maximize context to the LLM and embedding model</p>
                    <p className="article-paragraph indented-paragraph">This was the focus of an open-source RAG project I recently launched, called <b id="sp-rag-link" onClick={() => window.open("https://github.com/D-Star-AI/dsRAG", "_blank")}>dsRAG</b>. There were two main ideas there. First, we prepend concise summaries of documents to each chunk prior to embedding. This brings document-level context into the chunk embedding, allowing for much more reliable retrieval for chunks that otherwise would not contain sufficient context. Second, we dynamically reconstruct multi-chunk segments of text at query time. This provides more complete context to the LLM, and makes it much less likely for the LLM to misunderstand a chunk of text. There are still improvements to be made here, but this approach has shown very promising results.</p>
                    <p className="article-bold-paragraph">2. Use hybrid search and multi-factor ranking</p>
                    <p className="article-paragraph indented-paragraph">Hybrid search (using a combination of embeddings and keyword search) has become fairly common practice now, but multi-factor ranking is a lot less common. Incorporating additional ranking factors, like quality and recency, in addition to relevancy has many potential benefits for a wide range of applications.</p>
                    <p className="article-bold-paragraph">3. Return complete results when needed, not just the top-k most relevant ones</p>
                    <p className="article-paragraph indented-paragraph">This seems pretty simple on the surface. The challenge is that some queries only require a handful of results, while others require many pages of results. You only want to return a comprehensive set of results when it’s really needed. If you’re looking for a specific piece of information, and it happens to be contained in fifty different documents, it would be a waste to return all fifty of those results.</p>
                    <p className="article-bold-paragraph">4. Respect the heterogeneity of data</p>
                    <p className="article-paragraph indented-paragraph">Different types of data require different types of retrieval processes. This means you need to split your raw data into multiple knowledge bases, each with its own tailored preprocessing and retrieval systems. Some data, for example, has both unstructured and structured elements to it. Effectively querying data like this requires representing it as both structured and unstructured.</p>
                    <p className="article-bold-paragraph">5. Use an agent approach to querying</p>
                    <p className="article-paragraph indented-paragraph">No matter how good your retrieval system is, you’re not going to get all the information needed on the first try 100% of the time. By using an agent approach you can have the agent generate queries, evaluate the information that was retrieved, and generate follow-up queries if needed. For more complex tasks, you can use a query planning and/or task decomposition approach. The downside of course is that this adds cost and latency.</p>
                    <p className="article-bold-paragraph">6. Organize and consolidate raw data into knowledge</p>
                    <p className="article-paragraph indented-paragraph">Similar to how the human brain compresses raw information into consolidated knowledge, and then uses that knowledge whenever relevant, we need a system that can generate knowledge and learnings from raw data and experiences and then provide that as context to the agent whenever relevant. This will enable a holistic understanding of data that goes far beyond pure retrieval, and it will allow agents to learn and improve over time.</p>
                    <p className="article-paragraph">Of course there are tons of other avenues people are exploring, so this is by no means a comprehensive list. But these are the major areas for improvement that I’ve seen from my time helping customers build and debug RAG systems.</p>
                </div>

                <div className="article-section-row">
                    <p className="article-header-text">Introducing D-Star</p>
                    <p className="article-paragraph">To this end, my longtime co-founder Nick and I are starting a new project called D-Star. We’re going directly after these fundamental problems that are standing in the way of so many potential use cases for LLMs.</p>
                    <p className="article-paragraph">We’ve already built out a lot of the core technology. We’ll be releasing D-Star as an open-source project later this summer. We’re currently looking for a few more design partners that we can work closely with to solve their most challenging LLM+data problems. This will help us iterate on the project and get the core abstractions right before the public launch. If you’d like to be one of these design partners, reach out to me at <b id="email-link">zach@d-star.ai</b> and tell me about your use case.</p>
                </div>

            </div>

            <Footer />

        </div>
    )

}

// https://www.youtube.com/embed/zlhCIqa8s9M?si=27Xn4vf044W-Dt5z

/*

<iframe
                    src={"https://www.youtube.com/embed/zlhCIqa8s9M?si=27Xn4vf044W-Dt5z"}
                    title={"D-Star Demo"}
                    frameBorder="0"
                    loading="lazy"
                    allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
                    allowFullScreen>
                </iframe>
                */

export default EmbeddingsArticle;