Man

Development & AI | Alper Akgun

September, 2023

My local AI Copilot & LLM Inference Server for Emacs

As an Emacs user, in my quest to AIfy my coding experience, I have experimented with several cloud AI code completion a few months ago. To my horror, my private org mode data was sent to the AI provider.

This prompted me to create a local inference runner for code completion, ensuring data privacy and customization: Kodlokal Inference Server was born. I've also created kodlokal.el which is an emacs client for it.

Benefits with a Kodlokal include enhanced security, privacy, offline access, customization. Drawbacks are slow response times and lower quality code completion. But, there's a hope that in some near future we shall have a small code completion model with a Human Eval in the viable range.

Here are the steps I have used in emacs to run kodlokal.el.


# 1. Clone the kodlokal.el repo
git clone https://github.com/kodlokal/kodlokal.el.git ~/.emacs.d/kodlokal.el

# 2. Add kodlokal as a code completion provider
(add-to-list 'load-path "~/.emacs.d/kodlokal.el")

(use-package kodlokal
  :init
  (add-to-list 'completion-at-point-functions #'kodlokal-completion-at-point)
  :config
  (setq use-dialog-box nil))

# 3. Make sure you have a company mode configuration
(use-package company
  :defer 0.1
  :config
  (global-company-mode t)
  (setq-default
   company-idle-delay 0.900
   company-require-match nil
   company-minimum-prefix-length 3
   company-frontends '(company-preview-frontend)
   ))

# 4. I also set a shortcut to extend code completion
(global-set-key (kbd "s-/") 'company-complete-common-or-cycle)
            

Here's how I run the kodlokal inference server


# Install the repo, you need python 3.10 or later
it clone https://github.com/kodlokal/kodlokal.git
cd kodlokal
python -m venv v
source v/bin/activate
pip install -r requirements.txt
# pip install ctransformers[cuda] # only if you have CUDA environment for an Nvidia GPU

# Install some models.
mkdir models && cd models
wget https://huggingface.co/TheBloke/stablecode-completion-alpha-3b-4k-GGML/resolve/main/stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin
wget https://huggingface.co/SlyEcho/open_llama_3b_v2_ggml/resolve/main/open-llama-3b-v2-q4_0.bin

# Create a config.py from the config.py.sample
HOST = '127.0.0.1'
PORT = 3737
THREADS = 1
TIMEOUT=60
MODELS_FOLDER = './models/'
TEXT_MODEL = 'open-llama-3b-v2-q4_0.bin'
TEXT_MODEL_TYPE = 'llama'
TEXT_TEMPERATURE = 0.37
TEXT_MAX_NEW_TOKENS = 73
TEXT_GPU_LAYERS = 0
CODE_MODEL =  'stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin'
CODE_MODEL_TYPE = 'gpt-neox'
CODE_TEMPERATURE = 0.20
CODE_MAX_NEW_TOKENS = 37
CODE_GPU_LAYERS = 0

# Finally run
python run.py