Hugging Face: Run Image 2 Text Model with Streamlit based UI

Setup Overview
#

My setup consists of an NVIDIA GTX 1060 GPU with CachyOS Linux.

Python Prerequisites
#

Create Python Venv
#

CachyOS uses fish as the default shell, adapt activate.fish if necessary.

# Create virtual environment
python3 -m venv .venv-hf-project-1

# Active environment
source .venv-hf-project-1/bin/activate.fish

# Upgrade pip
pip install --upgrade pip

# (Deactivate venv)
deactivate

Install Pip Requirements
#

Find CUDA version: https://download.pytorch.org/whl/

requirements-cuda.txt

--index-url https://download.pytorch.org/whl/cu118  # CUDA 11.8

torch  # PyTorch framework
torchvision  # PyTorch companion package for computer vision
torchaudio  # PyTorch companion package for audio processing / speech & audio models

requirements-core.txt

# Utilities
python-dotenv

# Hugging Face ecosystem
transformers[torch]  # Hugging Face main library for LLMs
accelerate  # Hugging Face utility toolkit for device management and training support
datasets  # library for handling datasets
huggingface_hub  # API for interacting with the Hugging Face Model Hub
streamlit
pillow

# Install requirements
pip install -r requirements-cuda.txt
pip install -r requirements-core.txt

Python App
#

Hugging Face Model
#

Model Link: https://huggingface.co/Salesforce/blip-image-captioning-base

File and Folder Structure
#

The file and folder structure of the project looks like this:

hf-project-1
├── app.py
├── .env
├── requirements-core.txt
└── requirements-cuda.txt

.env File & HF API Token
#

Create Hugging Face API token:

Go to: “Settings” > “Access Tokens” > “Create new token”
Select “Read”
Click “Create token”

Add the token to the .env file:

.env

# Hugging Face read token
HUGGINGFACEHUB_API_TOKEN=mysecuretoken

Python app.py
#

from dotenv import find_dotenv, load_dotenv
from transformers import pipeline
from PIL import Image
import streamlit as st

# Load variables from .env file
load_dotenv(find_dotenv())

# Load model
@st.cache_resource  # Cache the model
def load_model_img_to_text():
    return pipeline(
        "image-to-text",
        model="Salesforce/blip-image-captioning-base",
        device="cuda:0"
    )  # Omit device to let it auto-detect

model_img_to_text = load_model_img_to_text()

# ---- Streamlit UI ----
st.title("Image to Text Demo")
st.write("Upload an image and output the description")

uploaded_file = st.file_uploader(
    "Upload image:",
    type=["jpg", "jpeg", "png"]
)

if uploaded_file is not None:
    # Open and show the image
    image = Image.open(uploaded_file).convert("RGB")
    st.image(image, caption="Uploaded image", width=700)

    if st.button("Generate caption"):
        with st.spinner("Running model..."):
            result = model_img_to_text(image)
            caption = result[0]["generated_text"]

        st.success("Caption:")
        st.write(caption)

Run App
#

# Run app
streamlit run app.py

# Shell output:
You can now view your Streamlit app in your browser.

Local URL: http://localhost:8501
Network URL: http://192.168.70.21:8501

Hugging Face Cache
#

The model is downloaded to the local Hugging Face cache:

# Verify model cache
ls ~/.cache/huggingface/hub/models--Salesforce--blip-image-captioning-base

# Shell output:
drwxr-xr-x - cachyos  6 Dez 18:14  .no_exist
drwxr-xr-x - cachyos  6 Dez 18:15  blobs
drwxr-xr-x - cachyos  6 Dez 18:14  refs
drwxr-xr-x - cachyos  6 Dez 18:14  snapshots

Setup Overview #

Python Prerequisites #

Create Python Venv #

Install Pip Requirements #

Python App #

Hugging Face Model #

File and Folder Structure #

.env File & HF API Token #

Python app.py #

Run App #

Hugging Face Cache #