Goals

This is part 4 of 6 in our tutorial on Arabic RAG. I’ll be using this blog as a guide, but to actually run this tutorial, its best that you run this notebook as described in part 1.

Before diving in, I want to you to think about how much money it costs to embed 2M articles. Make an estimate and see how accurate your guess is at the end of the blog. You can find out what other people thought in this poll I conducted (spoiler in the comments).

Info

In this blog you will learn:

What the challenges are with embedding for RAG at scale
How to solve this quickly using 🤗 TEI and Inference Endpoints
How to embed your chunks in a cost-effective manner

Getting your embeddings isn’t talked about enough. It’s always surprisingly difficult as you scale and most tutorials aren’t at any real scale. The right answer used to be exporting to ONNX with O4 level optimization and running it from there. While not difficult it took a little know-how and some preparation. This usually led to people just running it out of the box and not putting some love and care into the process. Recently there have been a lot of developments.

Why TEI

The problem with getting the embeddings efficiently is that there are techniques that exist but they are not widely applied. 🤗 TEI solves a number of these:

Token Based Dynamic Batching
Using latest optimizations (Flash Attention, Candle and cuBLASLt)
Fast and safe loading with safetensors
Parallel Tokenization workers

Applying most of these is doable but quite tedious. Big thanks to Olivier for creating this!

Set up TEI

There are 2 ways you can go about running TEI, locally or with 🤗 Inference Endpoints. Given not everyone will want to use Inference Endpoints as it is paid I have instructions on how to do this locally to be more inclusive!

Start TEI Locally

I have this running in a nvidia-docker container, but there are other ways to install too. Note that I ran the following docker startup cell in a different terminal for monitoring and separation.

Note

The way I have written the docker command will pull the latest image each time its run. TEI is at a very early stage at the time of writing. You may want to pin this to a specific version if you want more repeatability.

As described in part 3, I chose sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 based on the STS ar-ar performance on mteb/leaderboard, it’s the top performer and half the size of second place! TEI is fast, but this will also make our life easy for storage and retrieval.

I use the revision=refs/pr/8 because this has the pull request with safetensors which is optional but faster. Check out the pull request if you want to use a different embedding model and it doesnt have safetensors you can convert it here.

echo "Copy and paste this in another terminal. Make sure you have installed nvidia-docker and build as described here:
https://github.com/huggingface/text-embeddings-inference#docker-build"

volume=$pwd/tei
model=sentence-transformers/paraphrase-multilingual-minilm-l12-v2
revision=refs/pr/8
docker run \
    --gpus all \
    -p 8080:80 \
    -v $volume:/data \
    -v /home/ec2-user/.cache/huggingface/token:/root/.cache/huggingface/token \
    --pull always \
    ghcr.io/huggingface/text-embeddings-inference:latest \
    --model-id $model \
    --revision $revision \
    --pooling mean \
    --max-batch-tokens 65536

Test Endpoint

echo "This is just a sanity check so you don't get burned later."
response_code=$(curl -s -o /dev/null -w "%{http_code}" 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":"What is Deep Learning?"}' \
    -H 'Content-Type: application/json')

if [ "$response_code" -eq 200 ]; then
    echo "passed"
else
    echo "failed"
fi

Warning

I’ll be running the rest of the notebook with Inference Endpoints. If you are using the local version, you will need to make some minor code updates to point to the right endpoint.

Start TEI with Inference Endpoints

Another option is to run TEI on Inference Endpoints. Its cheap and fast. It took me less than 5 minutes to get it up and running!

Check here for a comprehensive guide. For our tutorial make sure to set these options IN ORDER:

Model Repository = transformers/paraphrase-multilingual-minilm-l12-v2
Name your endpoint TEI-is-the-best
Choose a GPU, I chose Nvidia A10G which is $1.3/hr.
Advanced Configuration
1. Task = Sentence Embeddings
2. Revision (based on this pull request for safetensors = a21e6630
3. Container Type = Text Embeddings Inference

example

Set the other options as you prefer.

Test Endpoint

I chose to keep these hidden because its sensitive, getpass is a great way to hide sensitive data entry for public facing code.

import getpass
API_URL = getpass.getpass(prompt='What is your API_URL?')
bearer_token = getpass.getpass(prompt='What is your BEARER TOKEN? Check your endpoint.')

What is your API_URL? ········
What is your BEARER TOKEN? Check your endpoint. ········

Here we have a couple constants. The headers for our request and our MAX_WORKERS for parallel inference.

# Constants
HEADERS = {
	"Authorization": f"Bearer {bearer_token}",
	"Content-Type": "application/json"
}
MAX_WORKERS = 512

Let’s make sure everything is running correctly. Error messages with 512 workers is no fun.

import requests

def query(payload):
	response = requests.post(API_URL, headers=HEADERS, json=payload)
	return response.json()
	
output = query({
	"inputs": "This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music!",
})
print(f'{output[0][:5]}...')

[0.0047912598, -0.03164673, -0.018051147, -0.057739258, -0.04498291]...

Ok great! We have TEI up and running!

Get Embeddings

Imports

import asyncio
from pathlib import Path
import json
import time


from aiohttp import ClientSession, ClientTimeout
from tqdm.notebook import tqdm

proj_dir = Path.cwd().parent
print(proj_dir)

/home/ec2-user/arabic-wiki

Config

As mentioned before I think it’s always a good idea to have a central place where you can parameterize/configure your notebook.

files_in = list((proj_dir / 'data/processed/').glob('*.ndjson'))
folder_out = proj_dir / 'data/embedded/'
folder_out_str = str(folder_out)

Strategy

TEI allows multiple concurrent requests, so its important that we dont waste the potential we have. I used the default max-concurrent-requests value of 512, so I want to use that many MAX_WORKERS. This way we can maximize the utilization of the GPU.

Im using an async way of making requests that uses aiohttp as well as a nice progress bar. What better way to do Arabic NLP than to use tqdm which has an Arabic name?

Note that Im using 'truncate':True as even with our 225 word split earlier, there are always exceptions. For some use-cases this wouldn’t be the right answer as throwing away information our retriever could use can lower performance and cause insidious error propogation. You can pre-tokenize, assess token count, and re-chunk as needed.

This is our code to efficiently leverage TEI.

async def request(document, semaphore):
    # Semaphore guard
    async with semaphore:
        payload = {
            "inputs": document['content'],
            "truncate": True
        }
        
        timeout = ClientTimeout(total=10)  # Set a timeout for requests (10 seconds here)

        async with ClientSession(timeout=timeout, headers=HEADERS) as session:
            async with session.post(API_URL, json=payload) as resp:
                if resp.status != 200:
                    raise RuntimeError(await resp.text())
                result = await resp.json()
                
        document['embedding'] = result[0]  # Assuming the API's output can be directly assigned
        return document

async def main(documents):
    # Semaphore to limit concurrent requests. Adjust the number as needed.
    semaphore = asyncio.BoundedSemaphore(512)

    # Creating a list of tasks
    tasks = [request(document, semaphore) for document in documents]
    
    # Using tqdm to show progress. It's been integrated into the async loop.
    for f in tqdm(asyncio.as_completed(tasks), total=len(documents)):
        await f

Now that we have all the pieces in place we can get embeddings. Our high level approach:

Read each processed .ndjson chunk file into memory
Use parallel workers to:
1. Get embeddings for each document
2. Update each document with the corresponding embedding
Verify that we got embeddings (always error check)
Write these to file.

start = time.perf_counter()
for i, file_in in tqdm(enumerate(files_in)):

    with open(file_in, 'r') as f:
        documents = [json.loads(line) for line in f]
        
    # Get embeddings
    await main(documents)
        
    # Make sure we got it all
    count = 0
    for document in documents:
        if document['embedding'] and len(document['embedding']) == 384:
            count += 1
    print(f'Batch {i+1}: Embeddings = {count} documents = {len(documents)}')

    # Write to file
    with open(folder_out/file_in.name, 'w', encoding='utf-8') as f:
        for document in documents:
            json_str = json.dumps(document, ensure_ascii=False)
            f.write(json_str + '\n')
            
# Print elapsed time
elapsed_time = time.perf_counter() - start
minutes, seconds = divmod(elapsed_time, 60)
print(f"{int(minutes)} min {seconds:.2f} sec")

0it [00:00, ?it/s]
  0%|          | 0/243068 [00:00<?, ?it/s]
Batch 1: Embeddings = 243068 documents = 243068
  ...
  0%|          | 0/70322 [00:00<?, ?it/s]
Batch 23: Embeddings = 70322 documents = 70322
104 min 32.33 sec

Wow, it only took ~1hr 45 min!

Lets make sure that we have all our documents:

!echo "$folder_out_str" && cat "$folder_out_str"/*.ndjson | wc -l

/home/ec2-user/arabic-wiki/data/embedded
2094596

Great, it looks like everything is correct.

Performance and Cost Analysis

You can see that we are quite cost-effective! For only $2.30 we got text embeddings for ~2M articles!! How does that compare with your expectations?

Cost

Note that the performance shown is over just the last 30 min window. Observations:

We have a throughput of ~334/s
Our median latency per request is ~50ms

Metrics

Next Steps

In the next blogpost I’ll show an exciting solution on VectorDB choice and usage. I was blown away since I expected it to take about a day to do the full ingestion, but it was MUCH FASTER 😱.

Goals#

Why TEI#

Set up TEI#

Start TEI Locally#

Test Endpoint#

Start TEI with Inference Endpoints#

Test Endpoint#

Get Embeddings#

Imports#

Config#

Strategy#

Performance and Cost Analysis#

Next Steps#

Goals

Why TEI

Set up TEI

Start TEI Locally

Test Endpoint

Start TEI with Inference Endpoints

Test Endpoint

Get Embeddings

Imports

Config

Strategy

Performance and Cost Analysis

Next Steps