Setting Up A GPU-Accelerated Ollama LangChain Workflow With RAG Brokers, Multi-Session Chat Effectivity Monitoring

On this tutorial, we assemble a GPU‑succesful native LLM stack that unifies Ollama and LangChain. We arrange the required libraries, launch the Ollama server, pull a model, and wrap it in a custom-made LangChain LLM, allowing us to control temperature, token limits, and context. We add a Retrieval-Augmented Expertise layer that ingests PDFs or textual content material, chunks them, embeds them with Sentence-Transformers, and serves grounded options. We deal with multi‑session chat memory, register devices (internet search + RAG query), and spin up an agent that causes about when to call them.

import os
import sys
import subprocess
import time
import threading
import queue
import json
from typing import Guidelines, Dict, Any, Optionally accessible, Tuple
from dataclasses import dataclass
from contextlib import contextmanager
import asyncio
from concurrent.futures import ThreadPoolExecutor


def install_packages():
    """Arrange required packages for Colab setting"""
    packages = [
        "langchain",
        "langchain-community",
        "langchain-core",
        "chromadb",
        "sentence-transformers",
        "faiss-cpu",
        "pypdf",
        "python-docx",
        "requests",
        "psutil",
        "pyngrok",
        "gradio"
    ]
   
    for bundle deal in packages:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])


install_packages()


import requests
import psutil
import threading
from queue import Queue
from langchain.llms.base import LLM
from langchain.callbacks.supervisor import CallbackManagerForLLMRun
from langchain.schema import BaseMessage, HumanMessage, AIMessage, SystemMessage
from langchain.memory import ConversationBufferWindowMemory, ConversationSummaryBufferMemory
from langchain.chains import ConversationChain, RetrievalQA
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS, Chroma
from langchain.brokers import AgentType, initialize_agent, Instrument
from langchain.devices import DuckDuckGoSearchRun

We import the necessary Python utilities in Colab for concurrency, system calls, and JSON coping with. We define and run install_packages() to tug LangChain, embeddings, vector outlets, doc loaders, monitoring, and UI dependencies. We then import LangChain LLM, memory, retrieval, and agent devices (along with DuckDuckGo search) to assemble an extensible RAG and agent workflow.

[Download the full codes with notebook here]

@dataclass
class OllamaConfig:
    """Configuration for Ollama setup"""
    model_name: str = "llama2"
    base_url: str = "http://localhost:11434"
    max_tokens: int = 2048
    temperature: float = 0.7
    gpu_layers: int = -1  
    context_window: int = 4096
    batch_size: int = 512
    threads: int = 4

We define an OllamaConfig dataclass so we maintain all Ollama runtime settings in a single clear place. We set the model determine and native API endpoint, along with the expertise conduct (max_tokens, temperature, and context_window). We administration effectivity with gpu_layers (‑1 = load all to GPU when doable), batch_size, and threads for parallelism.

@dataclass
class OllamaConfig:
    """Configuration for Ollama setup"""
    model_name: str = "llama2"
    base_url: str = "http://localhost:11434"
    max_tokens: int = 2048
    temperature: float = 0.7
    gpu_layers: int = -1  
    context_window: int = 4096
    batch_size: int = 512
    threads: int = 4
We define an OllamaConfig dataclass so we maintain all Ollama runtime settings in a single clear place. We set the model determine and native API endpoint, along with the expertise conduct (max_tokens, temperature, and context_window). We administration effectivity with gpu_layers (‑1 = load all to GPU when doable), batch_size, and threads for parallelism.

class OllamaManager:
    """Superior Ollama supervisor for Colab setting"""
   
    def __init__(self, config: OllamaConfig):
        self.config = config
        self.course of = None
        self.is_running = False
        self.models_cache = {}
        self.performance_monitor = PerformanceMonitor()
       
    def install_ollama(self):
        """Arrange Ollama in Colab setting"""
        try:
            subprocess.run([
                "curl", "-fsSL", "https://ollama.com/install.sh", "-o", "/tmp/install.sh"
            ], look at=True)
           
            subprocess.run(["bash", "/tmp/install.sh"], look at=True)
            print("✅ Ollama put in effectively")
           
        apart from subprocess.CalledProcessError as e:
            print(f"❌ Didn't put in Ollama: {e}")
            elevate
   
    def start_server(self):
        """Start Ollama server with GPU help"""
        if self.is_running:
            print("Ollama server is already working")
            return
           
        try:
            env = os.environ.copy()
            env["OLLAMA_NUM_PARALLEL"] = str(self.config.threads)
            env["OLLAMA_MAX_LOADED_MODELS"] = "3"
           
            self.course of = subprocess.Popen(
                ["ollama", "serve"],
                env=env,
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE
            )
           
            time.sleep(5)
           
            if self.health_check():
                self.is_running = True
                print("✅ Ollama server started effectively")
                self.performance_monitor.start()
            else:
                elevate Exception("Server failed to begin out accurately")
               
        apart from Exception as e:
            print(f"❌ Failed to begin out Ollama server: {e}")
            elevate
   
    def health_check(self) -> bool:
        """Look at if Ollama server is healthful"""
        try:
            response = requests.get(f"{self.config.base_url}/api/tags", timeout=10)
            return response.status_code == 200
        apart from:
            return False
   
    def pull_model(self, model_name: str) -> bool:
        """Pull a model from Ollama registry"""
        try:
            print(f"🔄 Pulling model: {model_name}")
            consequence = subprocess.run(
                ["ollama", "pull", model_name],
                capture_output=True,
                textual content material=True,
                timeout=1800  
            )
           
            if consequence.returncode == 0:
                print(f"✅ Model {model_name} pulled effectively")
                self.models_cache[model_name] = True
                return True
            else:
                print(f"❌ Didn't tug model {model_name}: {consequence.stderr}")
                return False
               
        apart from subprocess.TimeoutExpired:
            print(f"❌ Timeout pulling model {model_name}")
            return False
        apart from Exception as e:
            print(f"❌ Error pulling model {model_name}: {e}")
            return False
   
    def list_models(self) -> Guidelines[str]:
        """Guidelines accessible native fashions"""
        try:
            consequence = subprocess.run(
                ["ollama", "list"],
                capture_output=True,
                textual content material=True
            )
           
            fashions = []
            for line in consequence.stdout.break up('n')[1:]:
                if line.strip():
                    model_name = line.break up()[0]
                    fashions.append(model_name)
                   
            return fashions
           
        apart from Exception as e:
            print(f"❌ Error itemizing fashions: {e}")
            return []
   
    def stop_server(self):
        """Stop Ollama server"""
        if self.course of:
            self.course of.terminate()
            self.course of.wait()
            self.is_running = False
            self.performance_monitor.stop()
            print("✅ Ollama server stopped")

We create the OllamaManager class to place in, start, monitor, and deal with the Ollama server throughout the Colab setting. We set setting variables for GPU parallelism, run the server throughout the background, and make sure it’s up with a nicely being look at. We pull fashions on demand, cache them, file accessible ones domestically, and gracefully shut down the server when the obligation is full, all whereas monitoring effectivity.

[Download the full codes with notebook here]

class PerformanceMonitor:
    """Monitor system effectivity and helpful useful resource utilization"""
   
    def __init__(self):
        self.monitoring = False
        self.stats = {
            "cpu_usage": [],
            "memory_usage": [],
            "gpu_usage": [],
            "inference_times": []
        }
        self.monitor_thread = None
   
    def start(self):
        """Start effectivity monitoring"""
        self.monitoring = True
        self.monitor_thread = threading.Thread(objective=self._monitor_loop)
        self.monitor_thread.daemon = True
        self.monitor_thread.start()
   
    def stop(self):
        """Stop effectivity monitoring"""
        self.monitoring = False
        if self.monitor_thread:
            self.monitor_thread.be part of()
   
    def _monitor_loop(self):
        """Principal monitoring loop"""
        whereas self.monitoring:
            try:
                cpu_percent = psutil.cpu_percent(interval=1)
                memory = psutil.virtual_memory()
               
                self.stats["cpu_usage"].append(cpu_percent)
                self.stats["memory_usage"].append(memory.p.c)
               
                for key in ["cpu_usage", "memory_usage"]:
                    if len(self.stats[key]) > 100:
                        self.stats[key] = self.stats[key][-100:]
               
                time.sleep(5)
               
            apart from Exception as e:
                print(f"Monitoring error: {e}")
   
    def get_stats(self) -> Dict[str, Any]:
        """Get current effectivity statistics"""
        return {
            "avg_cpu": sum(self.stats["cpu_usage"][-10:]) / max(len(self.stats["cpu_usage"][-10:]), 1),
            "avg_memory": sum(self.stats["memory_usage"][-10:]) / max(len(self.stats["memory_usage"][-10:]), 1),
            "total_inferences": len(self.stats["inference_times"]),
            "avg_inference_time": sum(self.stats["inference_times"]) / max(len(self.stats["inference_times"]), 1)
        }

We define a PerformanceMonitor class to hint CPU, memory, and inference events in real-time whereas the Ollama server runs. We launch a background thread to assemble stats every few seconds, retailer newest metrics, and provide frequent utilization summaries. This helps us monitor system load and optimize effectivity all through model inference.

[Download the full codes with notebook here]

class OllamaLLM(LLM):
    """Personalized LangChain LLM for Ollama"""
   
    model_name: str = "llama2"
    base_url: str = "http://localhost:11434"
    temperature: float = 0.7
    max_tokens: int = 2048
    performance_monitor: Optionally accessible[PerformanceMonitor] = None
   
    @property
    def _llm_type(self) -> str:
        return "ollama"
   
    def _call(
        self,
        rapid: str,
        stop: Optionally accessible[List[str]] = None,
        run_manager: Optionally accessible[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> str:
        """Make API identify to Ollama"""
        start_time = time.time()
       
        try:
            payload = {
                "model": self.model_name,
                "rapid": rapid,
                "stream": False,
                "decisions": {
                    "temperature": self.temperature,
                    "num_predict": self.max_tokens,
                    "stop": stop or []
                }
            }
           
            response = requests.publish(
                f"{self.base_url}/api/generate",
                json=payload,
                timeout=120
            )
           
            response.raise_for_status()
            consequence = response.json()
           
            inference_time = time.time() - start_time
           
            if self.performance_monitor:
                self.performance_monitor.stats["inference_times"].append(inference_time)
           
            return consequence.get("response", "")
           
        apart from Exception as e:
            print(f"❌ Ollama API error: {e}")
            return f"Error: {str(e)}"

We wrap the Ollama API inside a custom-made OllamaLLM class applicable with LangChain’s LLM interface. We define how prompts are despatched to the Ollama server and doc each inference time for effectivity monitoring. This lets us plug Ollama straight into LangChain chains, brokers, and memory components whereas monitoring effectivity.

class RAGSystem:
    """Retrieval-Augmented Expertise system"""
   
    def __init__(self, llm: OllamaLLM, embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2"):
        self.llm = llm
        self.embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
        self.vector_store = None
        self.qa_chain = None
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            length_function=len
        )
   
    def add_documents(self, file_paths: Guidelines[str]):
        """Add paperwork to the vector retailer"""
        paperwork = []
       
        for file_path in file_paths:
            try:
                if file_path.endswith('.pdf'):
                    loader = PyPDFLoader(file_path)
                else:
                    loader = TextLoader(file_path)
               
                docs = loader.load()
                paperwork.extend(docs)
               
            apart from Exception as e:
                print(f"❌ Error loading {file_path}: {e}")
       
        if paperwork:
            splits = self.text_splitter.split_documents(paperwork)
           
            if self.vector_store is None:
                self.vector_store = FAISS.from_documents(splits, self.embeddings)
            else:
                self.vector_store.add_documents(splits)
           
            self.qa_chain = RetrievalQA.from_chain_type(
                llm=self.llm,
                chain_type="stuff",
                retriever=self.vector_store.as_retriever(search_kwargs={"okay": 3}),
                return_source_documents=True
            )
           
            print(f"✅ Added {len(splits)} doc chunks to vector retailer")
   
    def query(self, question: str) -> Dict[str, Any]:
        """Query the RAG system"""
        if not self.qa_chain:
            return {"reply": "No paperwork loaded. Please add paperwork first."}
       
        try:
            consequence = self.qa_chain({"query": question})
            return {
                "reply": consequence["result"],
                "sources": [doc.metadata for doc in result.get("source_documents", [])]
            }
        apart from Exception as e:
            return {"reply": f"Error: {str(e)}"}

We use ConversationManager to deal with multi-session memory, enabling every buffer-based and summary-based chat histories for each session. Then, in OllamaLangChainSystem, we convey all components collectively, server, LLM, RAG, memory, devices, and brokers, into one unified interface. We configure the system to place in Ollama, pull fashions, assemble brokers with devices like internet search and RAG, and expose chat, doc add, and model-switching capabilities for seamless interaction.

class ConversationManager:
    """Deal with dialog historic previous and memory"""
   
    def __init__(self, llm: OllamaLLM, memory_type: str = "buffer"):
        self.llm = llm
        self.conversations = {}
        self.memory_type = memory_type
       
    def get_conversation(self, session_id: str) -> ConversationChain:
        """Get or create dialog for session"""
        if session_id not in self.conversations:
            if self.memory_type == "buffer":
                memory = ConversationBufferWindowMemory(okay=10)
            elif self.memory_type == "summary":
                memory = ConversationSummaryBufferMemory(
                    llm=self.llm,
                    max_token_limit=1000
                )
            else:
                memory = ConversationBufferWindowMemory(okay=10)
           
            self.conversations[session_id] = ConversationChain(
                llm=self.llm,
                memory=memory,
                verbose=True
            )
       
        return self.conversations[session_id]
   
    def chat(self, session_id: str, message: str) -> str:
        """Chat with specific session"""
        dialog = self.get_conversation(session_id)
        return dialog.predict(enter=message)
   
    def clear_session(self, session_id: str):
        """Clear dialog historic previous for session"""
        if session_id in self.conversations:
            del self.conversations[session_id]


class OllamaLangChainSystem:
    """Principal system integrating all components"""
   
    def __init__(self, config: OllamaConfig):
        self.config = config
        self.supervisor = OllamaManager(config)
        self.llm = None
        self.rag_system = None
        self.conversation_manager = None
        self.devices = []
        self.agent = None
       
    def setup(self):
        """Full system setup"""
        print("🚀 Organising Ollama + LangChain system...")
       
        self.supervisor.install_ollama()
        self.supervisor.start_server()
       
        if not self.supervisor.pull_model(self.config.model_name):
            print("❌ Didn't tug default model")
            return False
       
        self.llm = OllamaLLM(
            model_name=self.config.model_name,
            base_url=self.config.base_url,
            temperature=self.config.temperature,
            max_tokens=self.config.max_tokens,
            performance_monitor=self.supervisor.performance_monitor
        )
       
        self.rag_system = RAGSystem(self.llm)
       
        self.conversation_manager = ConversationManager(self.llm)
       
        self._setup_tools()
       
        print("✅ System setup full!")
        return True
   
    def _setup_tools(self):
        """Setup devices for the agent"""
        search = DuckDuckGoSearchRun()
       
        self.devices = [
            Tool(
                name="Search",
                func=search.run,
                description="Search the internet for current information"
            ),
            Tool(
                name="RAG_Query",
                func=lambda q: self.rag_system.query(q)["answer"],
                description="Query loaded paperwork using RAG"
            )
        ]
       
        self.agent = initialize_agent(
            devices=self.devices,
            llm=self.llm,
            agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
            verbose=True
        )
   
    def chat(self, message: str, session_id: str = "default") -> str:
        """Straightforward chat interface"""
        return self.conversation_manager.chat(session_id, message)
   
    def rag_chat(self, question: str) -> Dict[str, Any]:
        """RAG-based chat"""
        return self.rag_system.query(question)
   
    def agent_chat(self, message: str) -> str:
        """Agent-based chat with devices"""
        return self.agent.run(message)
   
    def switch_model(self, model_name: str) -> bool:
        """Swap to fully completely different model"""
        if self.supervisor.pull_model(model_name):
            self.llm.model_name = model_name
            print(f"✅ Switched to model: {model_name}")
            return True
        return False
   
    def load_documents(self, file_paths: Guidelines[str]):
        """Load paperwork into RAG system"""
        self.rag_system.add_documents(file_paths)
   
    def get_performance_stats(self) -> Dict[str, Any]:
        """Get system effectivity statistics"""
        return self.supervisor.performance_monitor.get_stats()
   
    def cleanup(self):
        """Clear up property"""
        self.supervisor.stop_server()
        print("✅ System cleanup full")

We use the ConversationManager to maintain up separate chat courses, each with its memory type, each buffer-based or summary-based, allowing us to guard or summarize context as needed. Inside the OllamaLangChainSystem, we mix the whole thing: we arrange and launch Ollama, pull the required model, wrap it in a LangChain-compatible LLM, be a part of a RAG system, initialize chat memory, and register exterior devices like internet search.

def principal():
    """Principal function demonstrating the system"""
   
    config = OllamaConfig(
        model_name="llama2",
        temperature=0.7,
        max_tokens=2048
    )
   
    system = OllamaLangChainSystem(config)
   
    try:
        if not system.setup():
            return
       
        print("n🗣️ Testing major chat:")
        response = system.chat("Hey! How are you?")
        print(f"Response: {response}")
       
        print("n🔄 Testing model switching:")
        fashions = system.supervisor.list_models()
        print(f"On the market fashions: {fashions}")
       
       
        print("n🤖 Testing agent:")
        agent_response = system.agent_chat("What's the current local weather like?")
        print(f"Agent Response: {agent_response}")
       
        print("n📊 Effectivity Statistics:")
        stats = system.get_performance_stats()
        print(json.dumps(stats, indent=2))
       
    apart from KeyboardInterrupt:
        print("n⏹️ Interrupted by client")
    apart from Exception as e:
        print(f"❌ Error: {e}")
    lastly:
        system.cleanup()


def create_gradio_interface(system: OllamaLangChainSystem):
    """Create a Gradio interface for simple interaction"""
    try:
        import gradio as gr
       
        def chat_interface(message, historic previous, mode):
            if mode == "Basic Chat":
                response = system.chat(message)
            elif mode == "RAG Chat":
                consequence = system.rag_chat(message)
                response = consequence["answer"]
            elif mode == "Agent Chat":
                response = system.agent_chat(message)
            else:
                response = "Unknown mode"
           
            historic previous.append((message, response))
            return "", historic previous
       
        def upload_docs(data):
            if data:
                file_paths = [f.name for f in files]
                system.load_documents(file_paths)
                return f"Loaded {len(file_paths)} paperwork into RAG system"
            return "No data uploaded"
       
        def get_stats():
            stats = system.get_performance_stats()
            return json.dumps(stats, indent=2)
       
        with gr.Blocks(title="Ollama + LangChain System") as demo:
            gr.Markdown("# 🦙 Ollama + LangChain Superior System")
           
            with gr.Tab("Chat"):
                chatbot = gr.Chatbot()
                mode = gr.Dropdown(
                    ["Basic Chat", "RAG Chat", "Agent Chat"],
                    value="Basic Chat",
                    label="Chat Mode"
                )
                msg = gr.Textbox(label="Message")
                clear = gr.Button("Clear")
               
                msg.submit(chat_interface, [msg, chatbot, mode], [msg, chatbot])
                clear.click on on(lambda: ([], ""), outputs=[chatbot, msg])
           
            with gr.Tab("Doc Add"):
                file_upload = gr.File(file_count="numerous", label="Add Paperwork")
                upload_btn = gr.Button("Add to RAG System")
                upload_status = gr.Textbox(label="Standing")
               
                upload_btn.click on on(upload_docs, file_upload, upload_status)
           
            with gr.Tab("Effectivity"):
                stats_btn = gr.Button("Get Effectivity Stats")
                stats_output = gr.Textbox(label="Effectivity Statistics")
               
                stats_btn.click on on(get_stats, outputs=stats_output)
       
        return demo
       
    apart from ImportError:
        print("Gradio not put in. Skipping interface creation.")
        return None


if __name__ == "__main__":
    print("🚀 Ollama + LangChain System for Google Colab")
    print("=" * 50)
   
    principal()
   
    # Or create a system event for interactive use
    # config = OllamaConfig(model_name="llama2")
    # system = OllamaLangChainSystem(config)
    # system.setup()
   
    # # Create Gradio interface
    # demo = create_gradio_interface(system)
    # if demo:
    #     demo.launch(share=True)  # share=True for public hyperlink

We wrap the whole thing up within the major function to run a full demo, organising the system, testing chat, agent devices, model itemizing, and effectivity statistics. Then, in create_gradio_interface(), we assemble a user-friendly Gradio app with tabs for chatting, importing paperwork to the RAG system, and monitoring effectivity. Lastly, we identify principal() throughout the __main__ block for direct Colab execution, or optionally launch the Gradio UI for interactive exploration and public sharing.

In conclusion, we’ve a flexible playground: we swap Ollama fashions, converse with buffered or summary memory, question our private paperwork, attain out to go searching when context is missing, and monitor major helpful useful resource stats to stay inside Colab limits. The code is modular, allowing us to extend the instrument file, tune inference decisions (temperature, most tokens, concurrency) in OllamaConfig, or adapt the RAG pipeline to greater corpora or fully completely different embedding fashions. We launch the Gradio app with share=True to collaborate or embed these components in our duties. We now private an extensible template for fast native LLM experimentation.

Attempt the Codes. All credit score rating for this evaluation goes to the researchers of this mission. SUBSCRIBE NOW to our AI E-newsletter

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is devoted to harnessing the potential of Artificial Intelligence for social good. His newest endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth safety of machine learning and deep learning data that’s every technically sound and easily understandable by a big viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

Elevate your perspective with NextTech Info, the place innovation meets notion.
Uncover the most recent breakthroughs, get distinctive updates, and be a part of with a worldwide group of future-focused thinkers.
Unlock tomorrow’s traits as we communicate: be taught additional, subscribe to our e-newsletter, and alter into part of the NextTech neighborhood at NextTech-news.com

Keep forward of the curve with NextBusiness 24. Discover extra tales, subscribe to our e-newsletter, and be a part of our rising group at nextbusiness24.com

What's Hot

The place Did China’s ‘Wolf Warrior Diplomacy’ Come From (and The place Did It Go)? – The Diplomat

Verkehr: Sechs Verletzte bei Unfall auf der B216

3 ‘Robust Purchase’ Dividend Kings That Wall Avenue Loves Most in 2026

Setting up A GPU-Accelerated Ollama LangChain Workflow With RAG Brokers, Multi-Session Chat Effectivity Monitoring

OpenAI Introduces Codex Security In Evaluation Preview For Context-Aware Vulnerability Detection, Validation, And Patch Expertise All through Codebases

UAE Factors Emergency Alert In Dubai Over Potential Missile Menace

Li Auto Would possibly Launch Its First Two-Wheeled Robotic This Yr

The place Did China’s ‘Wolf Warrior Diplomacy’ Come From (and The place Did It Go)? – The Diplomat

Verkehr: Sechs Verletzte bei Unfall auf der B216

3 ‘Robust Purchase’ Dividend Kings That Wall Avenue Loves Most in 2026

Ero Copper Corp. (ERO:CA) This fall 2025 Earnings Name Transcript

The place Did China’s ‘Wolf Warrior Diplomacy’ Come From (and The place Did It Go)? – The Diplomat

Verkehr: Sechs Verletzte bei Unfall auf der B216

3 ‘Robust Purchase’ Dividend Kings That Wall Avenue Loves Most in 2026

Topics

-

Regional Insights

What's Hot

Setting up A GPU-Accelerated Ollama LangChain Workflow With RAG Brokers, Multi-Session Chat Effectivity Monitoring

Related Posts

Topics

-

Regional Insights