import React from 'react';
import Header from '../components/Header';
import Footer from '../components/Footer';

import ChunkHeaderImage from '../assets/chunk_header_image.png';
import ChunkHeaderPerformanceImage from '../assets/chunk_header_performance_image.png';
import FullRSEImage from '../assets/full_rse_image.png';
import RseCloseImage from '../assets/rse_close_image.png';
import RseFromDocumentImage from '../assets/rse_from_document_image.png';
import SectionsImage from '../assets/sections_image.png';

import "./ContextArticle.css";

const ContextArticle = () => {

    return (
        <div className="article-section">

            <Header lightOrDark="dark" />

            <div className="article-container">
                <div className="article-title-container">
                    <p className="article-title-text">Solving the out-of-context chunk problem for RAG</p>
                    <p className="article-name-text">Zach McCormick</p>
                    <p className="article-date-text">July 17, 2024</p>
                </div>

                <div className="article-section-row">
                    <p className="article-paragraph">Many of the problems developers face with RAG come down to this: Individual chunks don’t contain sufficient context to be properly used by the retrieval system or the LLM. This leads to the inability to answer seemingly simple questions and, more worryingly, hallucinations.</p>
                    <p className="article-paragraph">Examples of this problem</p>
                    <ul>
                        <li className="article-bullet-point">Chunks oftentimes refer to their subject via implicit references and pronouns. This causes them to not be retrieved when they should be, or to not be properly understood by the LLM.</li>
                        <li className="article-bullet-point">Individual chunks oftentimes don’t contain the complete answer to a question. The answer may be scattered across a few adjacent chunks.</li>
                        <li className="article-bullet-point">Adjacent chunks presented to the LLM out of order cause confusion and can lead to hallucinations.</li>
                        <li className="article-bullet-point">Naive chunking can lead to text being split “mid-thought” leaving neither chunk with useful context.</li>
                        <li className="article-bullet-point">Individual chunks oftentimes only make sense in the context of the entire section or document, and can be misleading when read on their own.</li>
                    </ul>
                </div>

                <div className="article-section-row">
                    <p className="article-header-text">What would a solution look like?</p>
                    <p className="article-paragraph">We’ve found that there are two methods that together solve the bulk of these problems.</p>
                    <p className="context-article-bold-paragraph">Contextual chunk headers</p>
                    <p className="article-paragraph">The idea here is to add in higher-level context to the chunk by prepending a chunk header. This chunk header could be as simple as just the document title, or it could use a combination of document title, a concise document summary, and the full hierarchy of section and sub-section titles.</p>
                    <p className="context-article-bold-paragraph">{"Chunks -> segments"}</p>
                    <p className="article-paragraph">Large chunks provide better context to the LLM than small chunks, but they also make it harder to precisely retrieve specific pieces of information. Some queries (like simple factoid questions) are best handled by small chunks, while other queries (like higher-level questions) require very large chunks. What we really need is a more dynamic system that can retrieve short chunks when that's all that's needed, but can also retrieve very large chunks when required. How do we do that?</p>
                </div>

                <div className="article-section-row">
                    <p className="article-header-text">Break the document into sections</p>
                    <p className="article-paragraph">Information about the section a chunk comes from can provide important context, so our first step will be to break the document into semantically cohesive sections. There are many ways to do this, but we’ll use a semantic sectioning approach. This works by annotating the document with line numbers and then prompting an LLM to identify the starting and ending lines for each “semantically cohesive section.” These sections should be anywhere from a few paragraphs to a few pages long. These sections will then get broken into smaller chunks if needed.</p>
                    <p className="article-paragraph">We’ll use Nike’s 2023 10-K to illustrate this. Here are the first 10 sections we identified:</p>
                    <img src={SectionsImage} className="context-blog-image" id="sections-image" />
                </div>

                <div className="article-section-row">
                    <p className="article-header-text">Add contextual chunk headers</p>
                    <img src={ChunkHeaderImage} className="context-blog-image" id="sections-image" />
                    <p className="article-paragraph">The purpose of the chunk header is to add context to the chunk text. Rather than using the chunk text by itself when embedding and reranking the chunk, we use the concatenation of the chunk header and the chunk text, as shown in the image above. This helps the ranking models (embeddings and rerankers) retrieve the correct chunks, even when the chunk text itself has implicit references and pronouns that make it unclear what it’s about. For this example, we just use the document title and the section title as context. But there are many ways to do this. We’ve also seen great results with using a concise document summary as the chunk header, for example.</p>
                    <p className="article-paragraph">Let’s see how much of an impact the chunk header has for the chunk shown above.</p>
                    <img src={ChunkHeaderPerformanceImage} id="chunk-performance-image" />
                </div>

                <div className="article-section-row">
                    <p className="article-header-text">{"Chunks -> segments"}</p>
                    <p className="article-paragraph">Now let’s run a query and visualize chunk relevance across the entire document. We’ll use the query “Nike stock-based compensation expenses.”</p>
                    <img src={FullRSEImage} className="context-blog-image" id="full-rse-image" />
                    <p className="article-paragraph">In the plot above, the x-axis represents the chunk index. The first chunk in the document has index 0, the next chunk has index 1, etc. There are 483 chunks in total for this document. The y-axis represents the relevance of each chunk to the query. Viewing it this way lets us see how relevant chunks tend to be clustered in one or more sections of a document. For this query we can see that there’s a cluster of relevant chunks around index 400, which likely indicates there’s a multi-page section of the document that covers the topic we’re interested in. Not all queries will have clusters of relevant chunks like this. Queries for specific pieces of information where the answer is likely to be contained in a single chunk may just have one or two isolated chunks that are relevant.</p>
                    <p className="context-article-bold-paragraph">What can we do with these clusters of relevant chunks?</p>
                    <p className="article-paragraph">The core idea is that clusters of relevant chunks, in their original contiguous form, provide much better context to the LLM than individual chunks can. Now for the hard part: how do we actually identify these clusters?</p>
                    <p className="article-paragraph">If we can calculate chunk values in such a way that the value of a segment is just the sum of the values of its constituent chunks, then finding the optimal segment is a version of the maximum subarray problem, for which a solution can be found relatively easily. How do we define chunk values in such a way? We'll start with the idea that highly relevant chunks are good, and irrelevant chunks are bad. We already have a good measure of chunk relevance (shown in the plot above), on a scale of 0-1, so all we need to do is subtract a constant threshold value from it. This will turn the chunk value of irrelevant chunks to a negative number, while keeping the values of relevant chunks positive. We call this the irrelevant_chunk_penalty. A value around 0.2 seems to work well empirically. Lower values will bias the results towards longer segments, and higher values will bias them towards shorter segments.</p>
                    <p className="article-paragraph">For this query, the algorithm identifies chunks 397-410 as the most relevant segment of text from the document. It also identifies chunk 362 as sufficiently relevant to include in the results. Here is what the first segment looks like:</p>
                    <img src={RseFromDocumentImage} className="context-blog-image" id="full-rse-image" />
                    <p className="article-paragraph">This looks like a great result. Let’s zoom in on the chunk relevance plot for this segment.</p>
                    <img src={RseCloseImage} className="context-blog-image" id="full-rse-image" />
                    <p className="article-paragraph">Looking at the content of each of these chunks, it's clear that chunks 397-401 are highly relevant, as expected. But looking closely at chunks 402-404 (this is the section about stock options), we can see they're actually also relevant, despite being marked as irrelevant by our ranking model. This is a common theme: chunks that are marked as not relevant, but are sandwiched between highly relevant chunks, are oftentimes quite relevant. In this case, the chunks were about stock option valuation, so while they weren't explicitly discussing stock-based compensation expenses (which is what we were searching for), in the context of the surrounding chunks it's clear that they are actually relevant. So in addition to providing more complete context to the LLM, this method of dynamically constructing segments of relevant text also makes our retrieval system less sensitive to mistakes made by the ranking model.</p>
                </div>

                <div className="article-section-row">
                    <p className="article-header-text">Try it for yourself</p>
                    <p className="article-paragraph">
                        If you want to give these methods a try, we’ve open-sourced a retrieval engine that implements these methods, called <b className="context-article-link" onClick={() => window.open("https://github.com/D-Star-AI/dsRAG", "_blank")}>dsRAG</b>. You can also play around with the <b className="context-article-link" onClick={() => window.open("https://github.com/D-Star-AI/dsRAG/blob/main/examples/dsRAG_motivation.ipynb", "_blank")}>iPython notebook</b> we used to run these examples and generate the plots. And if you want to use this with LangChain, we have a <b className="context-article-link" onClick={() => window.open("https://github.com/D-Star-AI/dsRAG/blob/main/integrations/langchain_retriever.py", "_blank")}>LangChain custom retriever</b> implementation as well.
                    </p>
                </div>

            </div>

            <Footer />

        </div>
    )

}


export default ContextArticle;