Ask HN: 16 yo Nephew, in E. Africa, wants to train an LLM with on disk Wikipedia

Hello HN!

My 16 year old nephew lives in an East African nation where there is practically no internet access.

Last week he asked me for advise as to how to go about training an open source LLM using an on disk Wikipedia (~80 GB).

Any suggestions? Thanks!

14 points | by a_w 9 days ago

5 comments

runjake 8 days ago
In addition to the other great suggestions, point him to Karpathy's YouTube channel[1]. Karpathy has an approachable communication style.
Here's his "1 hour intro to LLMs" video: https://www.youtube.com/watch?v=zjkBMFhNj_g
1. https://www.youtube.com/c/AndrejKarpathy
[-]
- a_w 7 days ago
  Thanks!
  I will try to download it and send it to him.
FrenchDevRemote 7 days ago
Not an expert, but maybe using RAG/embeddings on the on-disk wikipedia would be better than finetuning on wikipedia?
Most decent LLMs probably were already trained on wikipedia, that doesn't stop them from hallucinating when asked questions about it.
[-]
- more_corn 3 days ago
  ^ This is the way
- a_w 7 days ago
  Thanks for the suggestion! I will look into this.
throwaway11460 5 days ago
Would it be possible to ship him a Starlink terminal? Internet access could do wonders for a young interested guy like that... And he could share that connectivity with people around him too.
[-]
- a_w 2 days ago
  I have been thinking about that, but I haven't gotten around to researching its availability in the country yet.
  I will do some research over the weekend. Thanks for mentioning it!
icsa 9 days ago
Use a model already trained on Wiki[edia using llamafile.
You can download llamafile and several models, put them on a USB drive or hard drive, them send the drive to him via DHL.
[-]
- a_w 9 days ago
  That is a great suggestion, thank you!
  I think he wants to tinker, and learn more about how they work. What I neglected to mention is that he's already learned to program (developing Android apps, and he's also learned Python). He is a very bright and curious kid.
  [-]
  - icsa 9 days ago
    Have him check out:
    LLM training in simple, raw C/CUDA
    ----------------------------------
    https://github.com/karpathy/llm.c
    It is only 1,000 lines of easy to read C code. There is also Python reference code.
  - icsa 6 days ago
    Btw, I support some Kenyan high school students and am looking at supplying a few schools with llamafile+models on flash drives for their computer science curricula.
    [-]
    - a_w 5 days ago
      That's interesting. Could you expand on this a bit more? Which models, and I am curious about how the CS teachers/students will be using this?
      [-]
      - icsa 3 days ago
        I'm reviewing models, at the moment. Model selection will depend greatly on the hardware capabilities at each school. Phi-3 could be a good starting point.
        The project is an idea at the moment. My contact in Kenya has direct access to the Principals of the schools that our supported students attend.
        My thought is that the teachers would not have to do much. Many of the students already know python and could do self-learning individually or in groups.
        A flash drive with llamafile+models and documentation might be all that it would take to get them started - even offline.
        Bonus: Using llamafile, the same binary distribution works on MacOS, Linux, and Windows.
        [-]
        a_w 2 days ago
        Thanks for the detailed response.
        I wasn't aware of Phi-3 - I will look into it.
joegibbs 9 days ago
What kind of GPUs does he have?
[-]
- a_w 9 days ago
  I believe he has a laptop with an Intel i5 with integrated graphics.