Run local model: Image 2 Text Model

Table of Contents

Setup Overview
#

My setup consists of an NVIDIA GTX 1060 GPU with CachyOS Linux.

Python Prerequisites
#

Create Python Venv
#

CachyOS uses fish as the default shell, adapt activate.fish if necessary.

# Create virtual environment
python3 -m venv .venv-hf-project-1

# Active environment
source .venv-hf-project-1/bin/activate.fish

# Upgrade pip
pip install --upgrade pip

# (Deactivate venv)
deactivate

Pip Requirements
#

Create Pinned Version List
#

Find CUDA version: https://download.pytorch.org/whl/

# List available versions
pip index versions torchaudio \
  --index-url https://download.pytorch.org/whl/cu118

# Shell output:
torch (2.7.1+cu118)
Available versions: 2.7.1+cu118, 2.7.0+cu118, 2.6.0+cu118, 2.5.1+cu118, 2.5.0+cu118

Create a file with the intented packages:

requirements.in

--index-url https://download.pytorch.org/whl/cu118
--extra-index-url https://pypi.org/simple

torch==2.7.1+cu118

python-dotenv
pillow

transformers
accelerate
datasets
huggingface_hub

# Install pip tools
pip install pip-tools

# Create a txt file with the exact pinned package and dependencies versions
pip-compile -o requirements.txt requirements.in

The pinned down requirements and their dependencies look like this:

requirements.txt

#
# This file is autogenerated by pip-compile with Python 3.13
# by the following command:
#
#    pip-compile --output-file=requirements.txt requirements.in
#
--index-url https://download.pytorch.org/whl/cu118
--extra-index-url https://pypi.org/simple

accelerate==1.12.0
    # via -r requirements.in
aiohappyeyeballs==2.6.1
    # via aiohttp
aiohttp==3.13.2
    # via fsspec
aiosignal==1.4.0
    # via aiohttp
anyio==4.12.0
    # via httpx
attrs==25.4.0
    # via aiohttp
certifi==2025.11.12
    # via
    #   httpcore
    #   httpx
    #   requests
charset-normalizer==3.4.4
    # via requests
datasets==4.4.1
    # via -r requirements.in
dill==0.4.0
    # via
    #   datasets
    #   multiprocess
filelock==3.20.0
    # via
    #   datasets
    #   huggingface-hub
    #   torch
    #   transformers
frozenlist==1.8.0
    # via
    #   aiohttp
    #   aiosignal
fsspec[http]==2025.10.0
    # via
    #   datasets
    #   huggingface-hub
    #   torch
h11==0.16.0
    # via httpcore
hf-xet==1.2.0
    # via huggingface-hub
httpcore==1.0.9
    # via httpx
httpx==0.28.1
    # via datasets
huggingface-hub==0.36.0
    # via
    #   -r requirements.in
    #   accelerate
    #   datasets
    #   tokenizers
    #   transformers
idna==3.11
    # via
    #   anyio
    #   httpx
    #   requests
    #   yarl
jinja2==3.1.6
    # via torch
markupsafe==3.0.3
    # via jinja2
mpmath==1.3.0
    # via sympy
multidict==6.7.0
    # via
    #   aiohttp
    #   yarl
multiprocess==0.70.18
    # via datasets
networkx==3.6.1
    # via torch
numpy==2.3.5
    # via
    #   accelerate
    #   datasets
    #   pandas
    #   transformers
nvidia-cublas-cu11==11.11.3.6
    # via
    #   nvidia-cudnn-cu11
    #   nvidia-cusolver-cu11
    #   torch
nvidia-cuda-cupti-cu11==11.8.87
    # via torch
nvidia-cuda-nvrtc-cu11==11.8.89
    # via torch
nvidia-cuda-runtime-cu11==11.8.89
    # via torch
nvidia-cudnn-cu11==9.1.0.70
    # via torch
nvidia-cufft-cu11==10.9.0.58
    # via torch
nvidia-curand-cu11==10.3.0.86
    # via torch
nvidia-cusolver-cu11==11.4.1.48
    # via torch
nvidia-cusparse-cu11==11.7.5.86
    # via torch
nvidia-nccl-cu11==2.21.5
    # via torch
nvidia-nvtx-cu11==11.8.86
    # via torch
packaging==25.0
    # via
    #   accelerate
    #   datasets
    #   huggingface-hub
    #   transformers
pandas==2.3.3
    # via datasets
pillow==12.0.0
    # via -r requirements.in
propcache==0.4.1
    # via
    #   aiohttp
    #   yarl
psutil==7.1.3
    # via accelerate
pyarrow==22.0.0
    # via datasets
python-dateutil==2.9.0.post0
    # via pandas
python-dotenv==1.2.1
    # via -r requirements.in
pytz==2025.2
    # via pandas
pyyaml==6.0.3
    # via
    #   accelerate
    #   datasets
    #   huggingface-hub
    #   transformers
regex==2025.11.3
    # via transformers
requests==2.32.5
    # via
    #   datasets
    #   huggingface-hub
    #   transformers
safetensors==0.7.0
    # via
    #   accelerate
    #   transformers
six==1.17.0
    # via python-dateutil
sympy==1.14.0
    # via torch
tokenizers==0.22.1
    # via transformers
torch==2.7.1+cu118
    # via
    #   -r requirements.in
    #   accelerate
tqdm==4.67.1
    # via
    #   datasets
    #   huggingface-hub
    #   transformers
transformers==4.57.3
    # via -r requirements.in
triton==3.3.1
    # via torch
typing-extensions==4.15.0
    # via
    #   huggingface-hub
    #   torch
tzdata==2025.2
    # via pandas
urllib3==2.6.2
    # via requests
xxhash==3.6.0
    # via datasets
yarl==1.22.0
    # via aiohttp

# The following packages are considered to be unsafe in a requirements file:
# setuptools

Insall Requirements
#

# Install pinned requirements and dependencies
pip install -r requirements.txt

Python App
#

Hugging Face Model
#

Model Link: https://huggingface.co/Salesforce/blip-image-captioning-base

File and Folder Structure
#

The file and folder structure of the project looks like this:

hf-project-1
├── app.py
├── Dockerfile
├── .env
├── image1.jpg
├── image2.jpg
├── image3.jpg
├── requirements.in
└── requirements.txt

.env File & HF API Token
#

Create Hugging Face API token:

Go to: “Settings” > “Access Tokens” > “Create new token”
Select “Read”
Click “Create token”

Add the token to the .env file:

.env

# Hugging Face read token
HUGGINGFACEHUB_API_TOKEN=mysecuretoken

Python app.py
#

from dotenv import find_dotenv, load_dotenv
from transformers import pipeline
import sys
import os

# Load variables from .env file
load_dotenv(find_dotenv())

# Function to load image2text model
def load_model_img_to_text():
    return pipeline(
        "image-to-text",
        model="Salesforce/blip-image-captioning-base",
        device="cuda:0"
    )  # Omit device to let it auto-detect

# Call function for image2text model > become pipeline object
model_img_to_text = load_model_img_to_text()

# Function to use image2text model 
def img_to_text(image_path):
    textresult = model_img_to_text(image_path)
    print(textresult)  # Print output to the terminal
    return textresult  # Return the result if further process is needed

# Run model
if __name__ == "__main__":
    if len(sys.argv) < 2:
        sys.exit(1)

    image_path = sys.argv[1]

    if not os.path.isfile(image_path):
        print(f"File not found: {image_path}")
        sys.exit(1)

    # Run image2text function
    img_to_text(image_path)

Command python app.py foo.jpg > sys.argv: [“app.py”, “foo.jpg”]

Run App
#

# Run app: Define image
python app.py image2.jpg

# Shell output:
[{'generated_text': 'a man feeding seaguls on a boat'}]

Hugging Face Cache
#

The model is downloaded to the local Hugging Face cache:

# Verify model cache
ls ~/.cache/huggingface/hub/models--Salesforce--blip-image-captioning-base

# Shell output:
drwxr-xr-x - cachyos  6 Dez 18:14  .no_exist
drwxr-xr-x - cachyos  6 Dez 18:15  blobs
drwxr-xr-x - cachyos  6 Dez 18:14  refs
drwxr-xr-x - cachyos  6 Dez 18:14  snapshots

Setup Overview #

Python Prerequisites #

Create Python Venv #

Pip Requirements #

Create Pinned Version List #

Insall Requirements #

Python App #

Hugging Face Model #

File and Folder Structure #

.env File & HF API Token #

Python app.py #

Run App #

Hugging Face Cache #