Goal
This is part 1 of 6 on a tutorial for Arabic RAG. RAG is short for Retrieval Augmented Generation. It took it’s name from Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks though it’s current usage is much more similar to the RALM Paper.
In this tutorial you will learn:
- Why is RAG important
- How to download Wikipedia
- How to format Wikipedia for scalable processing
Addressing Hallucinations
Large Language Models (LLMs) get blamed (though unfairly IMHO) for “hallucinating”. Often we anthropomorphize LLMs 1 like they are a competant knowledge base when in fact they don’t really “know” things like humans do. The reality is that they are just good at predicting the next token (similar to a word) in a sequence of text.
How does RAG work
RAG is a way of conditioning the prompt with grounded truth. We retrieve relevant documents and ask the LLM to answer questions only based on the information in your documents. This is a much easier problem and we have seen great results and much fewer “hallucinations”.
Tutorial Notes
I’ll be downloading and processing Arabic Wikipedia. as a knowledge base for RAG. I chose this as it’s a great source of knowledge on a variety of tasks. With millions of articles, it will also force us to use best practice for scale which is missing in many tutorials in general. Using Arabic Wikipedia is a best case scenario source of information since all (or most) LLMs are pre-trained on Wikipedia. I’ll be using jais-13B from core42 as the LLM. Check out the paper here to see the training details.
Note
There are a couple sticking points in this tutorial since we need a directory structure and supporting files and these don’t easily translate in a tutorial.
Warning
The easiest way to run this would be to:
git clone https://huggingface.co/spaces/derek-thomas/arabic-RAG
cd arabic-RAG
jupyter lab
- Use this as a guide.
Get Data
The data from wikipedia starts in XML, our approach will be converting this into a series of .ndjson
files for easy
processing.
If you have any ideas on how this could be improved please do let me know in the comments!
Initialize
from pathlib import Path
import sys
proj_dir_path = Path.cwd().parent
proj_dir = str(proj_dir_path)
# So we can import later
sys.path.append(proj_dir)
Install
Note
This is the first sticking point. You can find the file here
%pip install -q -r "$proj_dir"/requirements.txt
Note: you may need to restart the kernel to use updated packages.
Download Arabic Wikipedia
Tip
Im getting “latest” but it’s good practice to see what version it is.
!curl -I https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles-multistream.xml.bz2 --silent | grep "Last-Modified"
Last-Modified: Sat, 21 Oct 2023 02:57:42 GMT
Download simple wikipedia.
Note
Note that there is an implied directory structure here. And you will be putting your data in
../data
for this and the subsequent tutorials in this series.
!wget -nc -P "$proj_dir"/data/raw https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles-multistream.xml.bz2
--2023-10-28 08:09:45-- https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles-multistream.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142, 2620:0:861:2:208:80:154:142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1671369109 (1.6G) [application/octet-stream]
Saving to: ‘/home/ec2-user/arabic-wiki/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2’
100%[====================================>] 1,671,369,109 4.54MB/s in 5m 54s
2023-10-28 08:15:39 (4.51 MB/s) - ‘/home/ec2-user/arabic-wiki/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2’ saved [1671369109/1671369109]
Extract
The download format from wikipedia is in XML. wikiextractor
will convert this into a jsonl format split into many
folders and files.
!wikiextractor -o "$proj_dir"/data/raw/output --json "$proj_dir"/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2
INFO: Preprocessing '/home/ec2-user/arabic-wiki/data/raw/arwiki-latest-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
...
INFO: Extracted 2200000 articles (2416.5 art/s)
INFO: Finished 3-process extraction of 2254650 articles in 641.2s (3516.3 art/s)
Consolidate
The split format is tedious to deal with, so now we we will consolidate this into a series of .ndjson
files. This is
important as our data processing machine might not have enough RAM to fit the whole dataset.
I like .ndjson
because it:
- Allows us to stream easily due to being able to split on rows/lines (regular
.json
is too messy to easily get the firstx rows
) - Has higher fault tolerance: If one line is corrupted the file is not corrupted
- Facilitates parallel processing
- Allows incremental loading
Dive into the consolidate functionality for more details.
Note
This is the second sticking point, as it’s not easy to place extra code. Here is consolidate.py in case you aren’t using the repository.
from src.preprocessing.consolidate import folder_to_json
folder = proj_dir_path / 'data/raw/output'
folder_out = proj_dir_path / 'data/consolidated/'
folder_to_json(folder, folder_out, 'ar_wiki')
Processing: 100%|█████████████████| 6119/6119 [01:11<00:00, 85.38file/s, File: wiki_18 | Dir: /home/ec2-user/arabic-wiki/data/raw/output/CJ]
Wiki processed in 72.87 seconds!
We did it! We now have the latest Arabic Wikipedia in a digestable format. Lets do some analysis in Part 2 to understand how we should chunk our data.
Tutorial Design Decisions
I structured this in a repo for a couple reasons:
- It allows good software practices like abstraction
- It is easier to create a data system that will work for the future tutorials
- Its cleaner and easier to maintain
It does pose a challenge when I describe my work. Many times readers would rather just click a button and rush through a notebook. Thats fair, I’ve been there. In this case I wanted to take a deeper more intentional look. Feel free to comment with your thoughts!