Setup Overview #
My setup consists of an NVIDIA GTX 1060 GPU with CachyOS Linux.
Python Prerequisites #
Create Python Venv #
CachyOS uses fish as the default shell, adapt activate.fish if necessary.
# Create virtual environment
python3 -m venv .venv-hf-project-1
# Active environment
source .venv-hf-project-1/bin/activate.fish
# Upgrade pip
pip install --upgrade pip
# (Deactivate venv)
deactivate
Install Pip Requirements #
Find CUDA version: https://download.pytorch.org/whl/
- requirements-cuda.txt
--index-url https://download.pytorch.org/whl/cu118 # CUDA 11.8
torch # PyTorch framework
torchvision # PyTorch companion package for computer vision
torchaudio # PyTorch companion package for audio processing / speech & audio models
- requirements-core.txt
# Utilities
python-dotenv
# Hugging Face ecosystem
transformers[torch] # Hugging Face main library for LLMs
accelerate # Hugging Face utility toolkit for device management and training support
datasets # library for handling datasets
huggingface_hub # API for interacting with the Hugging Face Model Hub
streamlit
pillow
# Install requirements
pip install -r requirements-cuda.txt
pip install -r requirements-core.txt
Python App #
Hugging Face Model #
Model Link: https://huggingface.co/Salesforce/blip-image-captioning-base
File and Folder Structure #
The file and folder structure of the project looks like this:
hf-project-1
├── app.py
├── .env
├── requirements-core.txt
└── requirements-cuda.txt
.env File & HF API Token #
Create Hugging Face API token:
-
Go to: “Settings” > “Access Tokens” > “Create new token”
-
Select “Read”
-
Click “Create token”
Add the token to the .env file:
- .env
# Hugging Face read token
HUGGINGFACEHUB_API_TOKEN=mysecuretoken
Python app.py #
from dotenv import find_dotenv, load_dotenv
from transformers import pipeline
from PIL import Image
import streamlit as st
# Load variables from .env file
load_dotenv(find_dotenv())
# Load model
@st.cache_resource # Cache the model
def load_model_img_to_text():
return pipeline(
"image-to-text",
model="Salesforce/blip-image-captioning-base",
device="cuda:0"
) # Omit device to let it auto-detect
model_img_to_text = load_model_img_to_text()
# ---- Streamlit UI ----
st.title("Image to Text Demo")
st.write("Upload an image and output the description")
uploaded_file = st.file_uploader(
"Upload image:",
type=["jpg", "jpeg", "png"]
)
if uploaded_file is not None:
# Open and show the image
image = Image.open(uploaded_file).convert("RGB")
st.image(image, caption="Uploaded image", width=700)
if st.button("Generate caption"):
with st.spinner("Running model..."):
result = model_img_to_text(image)
caption = result[0]["generated_text"]
st.success("Caption:")
st.write(caption)
Run App #
# Run app
streamlit run app.py
# Shell output:
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://192.168.70.21:8501
Hugging Face Cache #
The model is downloaded to the local Hugging Face cache:
# Verify model cache
ls ~/.cache/huggingface/hub/models--Salesforce--blip-image-captioning-base
# Shell output:
drwxr-xr-x - cachyos 6 Dez 18:14 .no_exist
drwxr-xr-x - cachyos 6 Dez 18:15 blobs
drwxr-xr-x - cachyos 6 Dez 18:14 refs
drwxr-xr-x - cachyos 6 Dez 18:14 snapshots