What It Does

A pocket-sized study partner that listens to lectures and talks back — like having a tutor who attended every class and never forgets.

The Elevator Pitch

Lecture Buddy is a Raspberry Pi app that records your lectures, transcribes them with AI, and then lets you have a voice conversation with an AI that knows exactly what was taught.

🎤

Record

Captures audio and video simultaneously from the Pi’s microphone and camera. Records entire lectures hands-free.

📝

Transcribe

Converts recorded audio to text using Whisper — either locally on the Pi or via the cloud.

💬

Study

Two study modes: text chat (type a question, read an answer) and voice conversation (speak and hear the AI respond in real-time).

👁

See

Extracts key frames from the lecture video — whiteboard photos, slide changes — so the AI knows what was shown, not just what was said.

A Day with Lecture Buddy

Imagine you are sitting in a calculus lecture with Lecture Buddy on your desk. Here is what happens.

1

Tap “Record” on the 3.5" touchscreen

The app starts capturing audio at 16,000 samples/second and video at 2 frames/second — simultaneously, in separate background threads.

2

Lecture happens — you just listen

The Pi quietly records. A live waveform on screen shows audio is being captured. A timer counts up.

3

Tap “Stop” — Whisper transcribes

Audio is saved as a WAV file, video as MP4. Whisper converts the audio to text. Smart algorithms extract key video frames.

4

Open “My Lectures” and tap your lecture

Choose “Voice Study” — the flagship mode. Hold the red button and ask: “Can you explain the chain rule?”

5

The AI responds — with full lecture context

It knows the transcript, the whiteboard photos, and the formulas. It answers in a natural voice, referencing what the professor actually said.

The Tech Stack

Under the hood, Lecture Buddy connects seven Python modules and three external AI services.

PySide6 (Qt6) The entire touchscreen UI — buttons, screens, animations, touch gestures. Qt is a professional-grade UI framework used in KDE, Tesla dashboards, and VLC.

OpenAI Whisper Speech-to-text AI. Runs locally on the Pi (free but slow) or via cloud API (fast but costs money).

OpenAI Realtime API Voice-to-voice conversations over WebSocket. The AI speaks back as it thinks — no waiting.

GPT-4o-mini Text chat + Vision. Answers typed questions and describes whiteboard photos in words so the voice AI can “see.”

OpenCV + scikit-image Computer vision libraries for video capture, frame comparison, and keyframe extraction.

SQLite A local database that stores lecture metadata, study sessions, and cached AI descriptions.

Check Your Understanding

What makes the voice study mode feel like a real conversation instead of a slow back-and-forth?

Why does the app have BOTH text chat and voice conversation modes?

How does the voice AI know what was written on the whiteboard?

02

Meet the Cast

Like a film crew where each person has a specialized job — seven files, seven roles, one production.

The Cast of Characters

Lecture Buddy is built from seven Python modules. Each file has one job, just like each person on a film crew.

🎬

main.py — “The Director”

~3,370 lines. The central hub: all 6 UI screens, the database, the audio recorder, and the transcriber. Everything flows through here.

🎙

realtime_client.py — “The Translator”

454 lines. Manages WebSocket voice conversations with OpenAI’s Realtime API. Streams audio in both directions simultaneously.

📋

config.py — “The Stage Manager”

57 lines. Every setting and constant in one place: API keys, model names, audio rates, screen sizes. The single source of truth.

📸

frame_selector.py — “The Photographer”

499 lines. Picks the best keyframes from lecture videos using 4 different strategies. Finds the moments where the whiteboard actually changed.

📖

vision_context.py — “The Narrator”

461 lines. Sends keyframe images to GPT-4o-mini Vision and gets back text descriptions. Caches results so the same image is never described twice.

🎥

video_capture.py — “The Camera Operator”

218 lines. Handles video recording and the live camera preview. Runs in a background thread so the UI stays responsive.

🗃

import_lecture.py — “The Archivist”

161 lines. A CLI tool for importing existing video files from outside the app. Useful when you already have lecture recordings.

The Tech Stack Pyramid

Four layers, from physical hardware at the bottom to AI brains at the top. Each layer depends on the one below it.

🧠

AI Layer

Whisper for transcription, GPT-4o-mini for text chat and vision descriptions, and the Realtime API for live voice conversations.

🛠

Framework Layer

PySide6/Qt6 builds the touchscreen UI, OpenCV handles video capture and processing, and sounddevice records audio from the USB microphone.

💾

Storage Layer

SQLite stores lecture metadata and study sessions. Audio files (WAV), video files (MP4), and transcripts live on the local filesystem.

💻

Hardware Layer

Raspberry Pi as the brain, a 3.5″ touchscreen for input/output, a USB microphone for audio capture, and a camera module for video.

Startup Sequence: “Voice Study”

When a student taps “Voice Study” on a lecture, here’s the behind-the-scenes conversation between the files.

User tapped Voice Study on “Calculus Lecture 3”. Let me load the transcript and find keyframes.

Here are the settings: voice=‘cedar’, model=‘gpt-4o-realtime-preview’, audio format=‘pcm16’ at 24kHz.

I found 8 keyframes for this lecture. Let me check if I already have descriptions cached… Yes! Returning cached descriptions.

Got the transcript + visual descriptions. Opening WebSocket to OpenAI. Sending session config with instructions.

Connection established. Showing the conversation screen. Student can hold the red button to speak.

Code Translation: Database Schema

This is the schema that stores every lecture and study session. Let’s read it line by line.

-- main.py (DatabaseManager)

c.execute('''CREATE TABLE IF NOT EXISTS lectures
             (id INTEGER PRIMARY KEY,
              title TEXT NOT NULL,
              filename TEXT NOT NULL,
              date_created TEXT NOT NULL,
              duration_seconds INTEGER,
              transcription_path TEXT,
              video_path TEXT)''')

CREATE TABLE IF NOT EXISTS lectures → “Create a table called lectures if it doesn’t already exist.”

id INTEGER PRIMARY KEY → “Each lecture gets a unique number. This is its ID — like a student ID card.”

title TEXT NOT NULL → “The lecture’s name (e.g., ‘Calculus Lecture 3’). Can’t be blank.”

filename TEXT NOT NULL → “The audio file name on disk. Also required.”

date_created TEXT NOT NULL → “When the recording was made. Stored as text (e.g., ‘2025-03-15 09:30:00’).”

duration_seconds INTEGER → “How long the lecture was, in seconds. Optional.”

transcription_path TEXT → “Where to find the text transcript file. Optional until transcription finishes.”

video_path TEXT → “Where to find the video file. Optional — some lectures might be audio-only.”

c.execute('''CREATE TABLE IF NOT EXISTS study_sessions
             (id INTEGER PRIMARY KEY,
              lecture_id INTEGER NOT NULL,
              session_number INTEGER NOT NULL,
              title TEXT NOT NULL,
              date_created TEXT NOT NULL,
              duration_seconds INTEGER,
              conversation_log TEXT,
              FOREIGN KEY (lecture_id) REFERENCES lectures(id))''')

CREATE TABLE IF NOT EXISTS study_sessions → “Create a table called study_sessions to track every study conversation.”

id INTEGER PRIMARY KEY → “Each study session gets its own unique number.”

lecture_id INTEGER NOT NULL → “Which lecture was this session about? Links back to the lectures table.”

session_number INTEGER NOT NULL → “Is this the 1st, 2nd, or 5th time studying this lecture?”

title TEXT NOT NULL → “A name for this session (e.g., ‘Study Session 2’).”

date_created TEXT NOT NULL → “When the study session started.”

duration_seconds INTEGER → “How long you studied. Optional.”

conversation_log TEXT → “The full transcript of your conversation with the AI. Optional.”

FOREIGN KEY (lecture_id) REFERENCES lectures(id) → “A rule: lecture_id must match an actual lecture. You can’t have an orphan study session.”

💡

Separation of Concerns — Each file has one job. The video system (video_capture.py, frame_selector.py) doesn’t know anything about AI conversations (realtime_client.py). You can swap out the keyframe algorithm without touching a single line of voice chat code.

Check Your Understanding

Which file would you modify to change the AI’s voice from ‘cedar’ to a different voice?

Why is main.py so much larger than the other files?

What does import_lecture.py do that main.py doesn’t?

03

How They Talk

Like a relay race where each runner passes the baton to the next — signals fly between components, and data flows from microphone to AI tutor.

Signals and Slots

Signals and slots are Qt’s way of letting components talk without knowing about each other. This is the framework pattern used everywhere in Lecture Buddy.

1

A signal is an announcement

“Hey, something happened!” For example: recording_finished, transcription_complete. The sender doesn’t care who’s listening.

2

A slot is a listener

A regular function that runs when it hears a specific signal. For example: on_recording_finished saves the file and starts transcription.

3

You connect them

signal.connect(slot) — now whenever the signal fires, the slot runs automatically. One line of code creates the relationship.

4

Loose coupling

This is how Qt components talk without knowing about each other. The recorder doesn’t import the transcriber. It just announces “I’m done” and whoever is listening takes over.

Code Translation: Signal/Slot in Action

Here’s the real code from the audio recorder. Two signals declared, two connections made.

# main.py (AudioRecorder)

class AudioRecorder(QThread):
    recording_finished = Signal(str, int)  # filename, duration
    recording_error = Signal(str)          # error message

class AudioRecorder(QThread): → “Create a class called AudioRecorder that runs in its own thread.”

recording_finished = Signal(str, int) → “Declare an announcement called recording_finished that will carry two pieces of data: a filename (text) and a duration (number).”

recording_error = Signal(str) → “Declare another announcement called recording_error that carries an error message (text).”

# main.py (RecorderScreen)

self.recorder.recording_finished.connect(self.on_recording_finished)
self.recorder.recording_error.connect(self.on_recording_error)

self.recorder.recording_finished.connect(self.on_recording_finished) → “When the recorder announces recording_finished, automatically call my on_recording_finished method.”

self.recorder.recording_error.connect(self.on_recording_error) → “When the recorder announces recording_error, automatically call my on_recording_error method.”

💡

Why this matters: AudioRecorder declares two announcements it can make. RecorderScreen says “when recording finishes, call my on_recording_finished method.” The recorder never imports RecorderScreen. It just shouts into the void — and whoever connected gets the message.

The Full Journey: Record to Study

Follow the baton from the moment a student taps Record to the moment they’re chatting with the AI about their lecture.

👤

Student

📺

RecorderScreen

🎤

AudioRecorder

📝

WhisperTranscriber

🗃

Database

💬

ConversationScreen

1 Student taps Record on the touchscreen

2 RecorderScreen creates AudioRecorder and VideoRecorder threads

3 AudioRecorder captures audio chunks into a queue

4 Student taps Stop — threads save WAV + MP4 files

5 WhisperTranscriber loads the WAV and converts speech to text

6 Database stores the lecture title, transcript path, and video path

7 Student opens My Lectures, taps Voice Study

8 ConversationScreen loads transcript + visual descriptions from database

The Config Hub

config.py is the single source of truth. Every setting lives in one 57-line file — no hunting through 3,000 lines of main.py.

OPENAI_API_KEY Loaded from a .env file — never hardcoded! If this key leaked, anyone could run up your OpenAI bill.

OPENAI_MODEL = "gpt-4o-mini" The text chat brain. Used for answering typed questions and generating vision descriptions of keyframes.

OPENAI_VOICE_MODEL = "gpt-4o-realtime-preview" The voice brain. Used for real-time voice conversations over WebSocket.

AUDIO_RATE = 16000 16kHz sample rate, optimized for Whisper. Higher rates waste storage without improving transcription quality for speech.

VIDEO_FPS = 2.0 Only 2 frames per second. Lectures are mostly static — saving 24 FPS would waste gigabytes of space for no benefit.

MIN_TOUCH_TARGET = 40 Minimum button size in pixels. On a tiny 3.5″ touchscreen, buttons smaller than 40px are nearly impossible to tap accurately.

AUTO_FULLSCREEN = True Fills the tiny 480×320 screen completely. Every pixel matters when your display is the size of a business card.

🎯

Single Source of Truth — Every setting lives in one file. Change the voice, the model, the recording quality — all from config.py. No hunting through 3,000 lines of main.py. If two files disagree about a setting, config.py wins.

Check Your Understanding

A signal fires but nothing happens. What’s the most likely cause?

Why does config.py load the API key from a .env file instead of writing it directly in the code?

What would break if you changed AUDIO_RATE from 16000 to 44100?

04

The Recording Pipeline

Two streams running in parallel — like a recording studio with separate audio and video booths capturing simultaneously, all triggered by a single button.

Two Streams, One Button

When the student taps Record, four things happen in rapid sequence. Two parallel pipelines spin up and run independently until the student taps Stop.

1

AudioRecorder thread starts

Captures audio via sounddevice at 16 kHz. Each chunk of audio data goes into a queue for later file writing.

2

VideoRecorder thread starts

Captures video via OpenCV at 2 frames per second. Each frame is written directly to an MP4 file on disk.

3

Both run via QThread

Each recorder lives in its own background thread. The main UI thread stays responsive — the student can still interact with the touchscreen while recording.

4

User taps Stop — files saved, transcription begins

Both threads finish cleanly. Audio is saved as a WAV file, video as MP4. Then the Whisper transcription pipeline kicks off automatically.

Inside the Audio Recorder

This callback function fires every time the microphone captures a chunk of sound.

CODE — main.py AudioRecorder

def callback(self, indata, frames, time, status):
    if status:
        logger.warning("Audio callback: %s", status)
    if self.recording:
        self.audio_queue.put(indata.copy())

PLAIN ENGLISH

Every time the microphone captures a tiny chunk of sound, this function runs automatically.

If there was a hardware problem (like a glitch), log a warning so we can debug later.

If we are currently in recording mode...

...make a copy of the sound data and add it to our collection queue. The copy is critical — the original buffer gets recycled by the audio system and would be overwritten.

Inside the Video Recorder

This is the main loop that runs as long as recording is active. It grabs frames from the camera, rate-limits them to 2 FPS, and queues copies for live analysis.

CODE — video_capture.py VideoRecorder.run()

while self.recording:
    ret, frame = self._cap.read()
    if not ret:
        QThread.msleep(10)
        continue

    elapsed = (datetime.datetime.now() - self.start_time).total_seconds()

    if elapsed - last_frame_time >= frame_interval:
        self._writer.write(frame)
        self.frame_count += 1
        last_frame_time = elapsed

        try:
            self._frame_queue.put_nowait((frame.copy(), elapsed))
        except queue.Full:
            pass  # Drop frame if queue is full

PLAIN ENGLISH

Keep looping as long as we are recording...

Ask the camera for the next frame.

If the camera did not give us a frame (maybe it was busy)...

...wait 10 milliseconds and try again.

Skip to the next loop iteration.

Calculate how many seconds have passed since recording started.

Only save a frame if enough time has passed (this is the rate limiter — limits us to 2 FPS).

Write the frame to the MP4 video file on disk.

Increment our frame counter.

Remember when we last saved a frame.

Try to send a copy of the frame to the live-analysis queue...

...but if the processing queue is already full...

...just drop this frame silently. Better to lose one frame than crash from running out of memory.

Import Lecture: The Secondary Feature

Not every lecture is recorded live on the Pi. The import_lecture.py script lets you bring in recordings made elsewhere — a CLI tool for batch importing.

python import_lecture.py video.mp4 Imports an existing video file that was recorded outside the app — from a phone, webcam, or lecture capture system.

ffmpeg → extract audio Uses ffmpeg to rip the audio track from the video file into a separate WAV for transcription.

Whisper → transcription Runs the same Whisper transcription pipeline as live recordings — audio in, text out.

SQLite → database Saves the imported lecture to the same SQLite database as live recordings. It shows up in the library alongside everything else.

frame_selector.py → keyframes Extracts keyframes using the same smart algorithms that live recordings use — content change, stability detection, audio cues.

💻

CLI vs GUI

import_lecture.py is a command-line tool, not a touchscreen interface. It lets you batch-import old recordings from a laptop or desktop without needing to interact with the 3.5-inch screen. Same pipeline, different entry point.

Why These Specific Numbers?

Every setting in config.py was chosen deliberately for the Pi’s limited hardware. Here is the reasoning behind each one.

AUDIO_RATE = 16000 16 kHz sample rate — Whisper works best at exactly this frequency. Higher rates waste storage without improving transcription accuracy.

VIDEO_FPS = 2.0 2 frames per second — lectures are mostly static. That is 7,200 frames per hour, plenty to catch every slide change and whiteboard update.

VIDEO_WIDTH = 640 640 pixels wide — low resolution means smaller files and faster processing on the Pi’s limited CPU.

AUDIO_FORMAT = "int16" 16-bit integer audio — the sweet spot for speech. int16 captures voice clearly without the file bloat of 32-bit float.

💡

Concurrent Programming

Running audio and video capture in separate threads is like having two employees work simultaneously instead of taking turns. This is called “concurrency” — without it, capturing audio would freeze the video, and capturing video would freeze the UI.

Check Your Understanding

What happens when the video frame queue is full and a new frame arrives?

Why does the audio callback use `indata.copy()` instead of just `indata`?

What would happen if VIDEO_FPS was set to 30 instead of 2?

05

The AI Brain

Three AI systems working together — like a multilingual interpreter converting between spoken words, written text, and AI understanding in real-time.

Three AI Systems

Lecture Buddy uses three separate AI systems, each optimized for a different job. They never compete — each handles a distinct piece of the puzzle.

🎧

Whisper (Local or Cloud)

Converts audio to text. Can run ON the Pi (free but slow) or via OpenAI’s API (fast, costs money). Toggle between modes with the USE_WHISPER_API flag in config.py.

💬

GPT-4o-mini (Cloud)

Text chat + Vision. Student types a question, gets formatted text back with LaTeX math rendering. Also describes keyframe images in words for the voice AI.

🗣

Realtime API (Cloud)

Voice-to-voice over WebSocket. Student speaks, AI speaks back in real-time. The flagship study mode — feels like talking to a real tutor.

Dual-Mode Transcription

The WhisperTranscriber can work two ways — cloud or local — controlled by a single boolean flag.

CODE — main.py WhisperTranscriber

def run(self):
    try:
        self.transcription_progress.emit(10)
        if USE_WHISPER_API:
            client = openai.OpenAI()
            with open(self.audio_file, "rb") as f:
                result = client.audio.transcriptions.create(
                    model="whisper-1", file=f,
                    language=WHISPER_LANGUAGE)
            self.transcription_progress.emit(100)
            self.transcription_complete.emit(result.text)
        else:
            result = self.model.transcribe(
                self.audio_file, language=WHISPER_LANGUAGE,
                fp16=False, verbose=False)
            self.transcription_progress.emit(100)
            self.transcription_complete.emit(result["text"])
    except Exception as e:
        self.transcription_error.emit(str(e))

PLAIN ENGLISH

This function runs in a background thread when transcription starts.

Wrap everything in error handling...

Tell the UI we are 10% done (progress bar updates).

Check which mode we are in — cloud or local?

Cloud mode: Create an OpenAI API client.

Open the audio file for reading...

Send it to OpenAI’s Whisper API...

...using the whisper-1 model and our configured language.

Tell the UI we are 100% done.

Send the transcribed text back to the UI via a signal.

Local mode: (no internet needed)

Run Whisper directly on the Pi...

...with our language setting.

fp16=False because the Pi lacks GPU support for half-precision math. verbose=False to keep logs clean.

Tell the UI we are done.

Send the text back. Local Whisper returns a dict, so we access ["text"].

If anything goes wrong (bad file, network error, out of memory)...

...send the error message to the UI so the student sees what happened.

The System Prompt: Grounding the AI

This is the most important piece of prompt engineering in the entire app. It tells the AI what it knows and what it must refuse to discuss.

CODE — main.py OpenAIChat

{"role": "system", "content": f"""You are a helpful study buddy.
The student has provided a lecture transcription.
Help them understand concepts, answer questions,
create summaries, and provide study guidance
based SOLELY on the lecture content.
If asked about content not in the lecture, say
you can only discuss the lecture material.

Lecture content: {lecture_text[:8000]}"""}

PLAIN ENGLISH

This is a system prompt — the AI’s hidden instructions.

Tell the AI it has a lecture transcription to work from.

Define what it can do: explain concepts, answer questions...

...create summaries, give study advice.

The critical word: SOLELY. The AI must ONLY use lecture content — no making things up.

If the student asks something outside the lecture...

...the AI must refuse and say it can only discuss what was taught.

Paste the actual lecture text, but truncate to 8,000 characters to fit within the AI’s context window.

🎯

Why “SOLELY” Matters

Without this word, the AI would happily answer any question — even about things the professor never mentioned. The SOLELY instruction grounds the AI in the lecture, preventing it from inventing facts. This is the difference between a reliable study tool and a hallucination machine.

Configuring the Voice Session

When a voice study session starts, the app sends this configuration to OpenAI over the WebSocket. Every setting shapes how the conversation will work.

CODE — realtime_client.py

session_update = {
    "type": "session.update",
    "session": {
        "voice": OPENAI_VOICE,
        "instructions": self.instructions,
        "modalities": ["text", "audio"],
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "whisper-1",
            "language": "en",
        },
        "turn_detection": None,
    },
}
self._send(session_update)

PLAIN ENGLISH

Build a configuration message to send to OpenAI...

This is a “session update” command.

Configure the session with these settings:

Use the configured AI voice (a natural-sounding speaker).

Give the AI the full lecture context as its instructions — so it knows what was taught.

Enable both text AND audio responses simultaneously.

Send audio as raw 16-bit PCM16 (uncompressed, fast).

Receive audio back in the same raw format.

Also transcribe what the student says using Whisper...

...in English.

turn_detection set to None — this means push-to-talk. More reliable in noisy classrooms than automatic voice detection.

Send this configuration over the WebSocket connection.

The Resampling Trick

The microphone records at 16,000 samples/second but OpenAI’s Realtime API expects 24,000. This code mathematically “stretches” the audio to fill the gap.

CODE — realtime_client.py

src_positions = np.linspace(0.0, duration, num=data.size, endpoint=False)
dst_positions = np.linspace(0.0, duration, num=target_samples, endpoint=False)
resampled = np.interp(dst_positions, src_positions, data).astype(np.int16)

PLAIN ENGLISH

Create 16,000 evenly-spaced time markers for the original audio samples (where we HAVE data).

Create 24,000 evenly-spaced time markers for the target rate (where we NEED data).

Use interpolation to calculate what the audio would sound like at each new marker. Convert back to 16-bit integers for the API.

⚡

Streaming vs. Waiting

The Realtime API streams audio chunks as they are generated — the AI starts speaking before it has finished thinking. This is like how a human interpreter starts translating mid-sentence rather than waiting for you to finish. It makes conversations feel natural instead of robotic.

Check Your Understanding

Why does Lecture Buddy use push-to-talk instead of automatic voice detection?

What does the system prompt’s “SOLELY” instruction prevent?

The mic records at 16 kHz but OpenAI expects 24 kHz. How is the gap resolved?

06

Voice Study Session

The flagship feature under a microscope — like watching a symphony conductor’s hands and understanding every gesture.

The Flagship Feature

The ConversationScreen class is ~877 lines of code — the single largest component in the entire app. It orchestrates push-to-talk recording, real-time audio streaming, live transcription display, visual context integration, and session saving.

1

Load the lecture context

When the screen opens, it loads the full transcript from the database. A background worker thread fetches or generates visual descriptions of keyframes.

2

Build the AI’s instructions

The transcript + visual descriptions are combined into a single instructions string. This is what the AI “knows” about the lecture.

3

Open a WebSocket connection

The RealtimeConversationClient connects to OpenAI, sends the session config, and waits for the student to speak.

4

Push-to-talk loop

Student holds the red button → audio captured → resampled 16kHz→24kHz → sent via WebSocket → AI responds with streaming audio + text.

5

Session saved automatically

When the student leaves, the full conversation log (every question and answer) is saved to the database with a timestamp and duration.

Data Flow: One Question, One Answer

Follow the data as the student asks “What is the chain rule?” and hears the AI respond.

👤

Student

🎤

Mic

🔌

RealtimeClient

☁️

OpenAI

🔊

Speaker

Click “Next Step” to trace the data

Giving the Voice AI “Eyes”

The Realtime API can only handle text and audio — no images. So before the session starts, a VisualDescriptionWorker runs in the background to convert keyframe images into text descriptions.

CODE — main.py (ConversationScreen)

self.visual_worker = VisualDescriptionWorker(lecture_id, title)
self.visual_worker.finished.connect(
    self._on_visual_descriptions_ready)
self.visual_worker.start()

def _on_visual_descriptions_ready(self, descriptions):
    if descriptions:
        self.instructions += "\n\nVisual content:\n" + descriptions
        if self.realtime_client:
            self.realtime_client.update_instructions(
                self.instructions)

PLAIN ENGLISH

Create a background worker to generate visual descriptions (won’t freeze the UI).

When the worker finishes, call our handler method.

Start the background work.

When descriptions arrive from the background worker...

If we got valid descriptions...

...append them to the AI’s instructions (“here’s what the whiteboard showed”).

If the WebSocket is already connected...

...hot-update the AI’s instructions mid-session.

The AI now “sees” the whiteboard through text descriptions.

🎨

Hot-Updating Context

Notice how the visual descriptions can arrive after the WebSocket is already open. The code handles this gracefully — it appends the descriptions and sends a session update to the live connection. This means the student can start talking immediately while the visual context loads in the background.

Saving Study Sessions

Every voice conversation is saved so students can review what they discussed. The session includes the full conversation log, duration, and automatic naming.

CODE — main.py (DatabaseManager)

def save_study_session(self, lecture_id, lecture_title,
                        duration_seconds, conversation_log):
    session_number = self.get_next_session_number(lecture_id)
    short_title = lecture_title[:30]
    session_title = f"{short_title} - Session {session_number}"

    c.execute(
        """INSERT INTO study_sessions
           (lecture_id, session_number, title,
            date_created, duration_seconds, conversation_log)
           VALUES (?, ?, ?, ?, ?, ?)""",
        (lecture_id, session_number, session_title,
         datetime.datetime.now().isoformat(),
         duration_seconds, json.dumps(conversation_log)))

PLAIN ENGLISH

Save a study session for a given lecture...

Figure out what number this session is (Session 1, 2, 3...).

Truncate the lecture title to 30 characters so it fits on the small screen.

Auto-generate a name like “Calculus Lecture 3 - Session 2.”

Insert a new row into the study_sessions table...

...with the lecture ID, session number, auto-generated title,

current timestamp,

how long the session lasted, and the full conversation as JSON.

When the Connection Drops

WebSocket connections are fragile. WiFi glitches, server timeouts, or network switches can kill the connection mid-conversation. The app handles this gracefully with auto-reconnect.

Connection lost. Error: network timeout after 30 seconds of silence.

Got it. Showing “Reconnecting...” on the student’s screen. Setting a 2-second timer before retry.

Timer fired. Calling _attempt_reconnect() now. Creating a new RealtimeConversationClient with the same lecture context.

New WebSocket connection established. Re-sent session config with transcript and visual descriptions. Ready for push-to-talk.

Connection restored! Updating the UI. Note: the previous conversation context is lost, but the lecture context is intact.

🛡

Graceful Degradation

The reconnection preserves the lecture context (transcript + visual descriptions) but loses the conversation history from the current session. This is a deliberate trade-off — rebuilding the full conversation context would be slow and expensive. The student can simply re-ask their last question.

Check Your Understanding

What happens if the visual descriptions take 5 seconds to generate but the student starts talking after 2 seconds?

What does the app preserve when a WebSocket connection drops and reconnects?

How is the conversation log stored in the database?

07

The Smart Eye

Picking the few frames that matter from hours of video — like a photographer culling thousands of shots down to the best twenty.

Why Not Save Every Frame?

At 2 frames per second, a one-hour lecture produces 7,200 frames. Most are nearly identical. Sending all of them to an AI would be slow, expensive, and pointless.

frame_selector.py uses four strategies to find only the frames worth keeping, then removes duplicates and enforces a budget.

🔎

Content Change (SSIM)

Compare each frame to the previous one using SSIM. If they differ enough, something changed — keep the new frame.

✋

Stability Detection

If the frame has been unchanged for 3+ seconds, the professor finished writing. Capture the completed work.

🎤

Audio Cue Matching

Scan the transcript for phrases like “as you can see” or “this equation shows.” The professor is pointing at something visual.

👁

Perceptual Dedup

After all candidates are chosen, remove frames that look the same using perceptual hashing.

Strategy 1: Content Change Detection

The core comparison converts frames to grayscale, blurs to remove noise, then uses SSIM to measure similarity. Below the 0.85 threshold = meaningful change.

CODE — frame_selector.py

gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (5, 5), 0)

if self._prev_gray is not None:
    similarity = ssim(self._prev_gray, gray)
    if similarity < self.ssim_threshold:
        trigger = "content_change"
        score = 1.0 - similarity

PLAIN ENGLISH

Convert the color frame to grayscale — color is irrelevant for change detection.

Blur slightly to ignore tiny sensor noise.

If we have a previous frame to compare against...

Calculate how similar this frame is to the last one (0 = totally different, 1 = identical).

If similarity dropped below 0.85, something meaningful changed...

...mark this as a “content change” trigger.

Score = how MUCH it changed (more change = higher score = more important).

Strategy 3: Listening to the Professor

The code has 25+ regex patterns that match common phrases professors use when referencing visual content.

"as you can see here" Pointing phrases — professor directing attention. Confidence: 0.9

"this equation shows" Equation phrases — formula being presented. Confidence: 0.85

"which completes the proof" Completion phrases — something just finished. Confidence: 0.9

"draw a line from..." Diagram phrases — spatial references. Confidence: 0.7–0.85

CODE — frame_selector.py

def _check_audio_cues(self, timestamp, segments, window=3.0):
    for segment in segments:
        if not (segment.start <= timestamp + window
                and segment.end >= timestamp - window):
            continue
        text = segment.text.lower()
        for pattern, confidence in self._cue_patterns:
            if pattern.search(text):
                return VisualCue(timestamp, segment.text, ...)

PLAIN ENGLISH

Check if the professor said something visual near this moment (3-second window).

Look through each transcript segment...

If this segment doesn’t overlap our time window...

...skip it.

Convert to lowercase for case-insensitive matching.

Test every regex pattern against the text...

If a pattern matches, we found a visual cue!

Return the cue with its timestamp, text, and confidence score.

From Frames to Words

vision_context.py sends keyframes to GPT-4o-mini Vision, which describes what it sees. Those descriptions are cached in SQLite so you never pay for the same frames twice.

CODE — vision_context.py

def get_or_generate_visual_descriptions(lecture_id, title):
    c.execute("SELECT visual_descriptions FROM lectures WHERE id = ?",
              (lecture_id,))
    row = c.fetchone()
    if row and row[0]:
        return row[0]  # Cached! Free!

    descriptions = generate_visual_descriptions(lecture_id, title)
    if descriptions:
        c.execute("UPDATE lectures SET visual_descriptions = ?",
                  (descriptions, lecture_id))
    return descriptions

PLAIN ENGLISH

Given a lecture, get visual descriptions — from cache if possible.

Ask the database: “Do we already have descriptions for this lecture?”

Get the result.

If cached descriptions exist...

...return them. No API call needed — free!

No cache hit. Call GPT-4o-mini Vision to describe the keyframes.

If we got descriptions back...

...save them to the database for next time.

Return descriptions (cached or freshly generated).

💰

Cache-or-Compute Pattern

This is a universal pattern in software: check the cache first, compute only if needed, then cache the result. The first study session pays for the Vision API call. Every session after that gets the descriptions for free from SQLite. At ~85 tokens per image in “low” detail mode, 10 keyframes cost about $0.000128 — but only once.

Strategy 4: Removing Lookalikes

After all candidates are selected, perceptual deduplication removes frames that look essentially the same despite passing the other filters.

CODE — frame_selector.py

def deduplicate_frames(self, keyframes, hash_threshold=10):
    unique = [keyframes[0]]
    hashes = [self._compute_phash(keyframes[0].image)]

    for kf in keyframes[1:]:
        h = self._compute_phash(kf.image)
        is_duplicate = any(h - existing < hash_threshold
                           for existing in hashes)
        if not is_duplicate:
            unique.append(kf)
            hashes.append(h)
    return unique

PLAIN ENGLISH

Remove visually similar frames. Threshold of 10 means “hashes within 10 bits are duplicates.”

Start with the first frame — it’s always unique.

Compute its perceptual hash (digital fingerprint).

For each remaining candidate frame...

Compute its hash.

Check if this hash is too close to ANY hash we’ve already kept.

(Subtracting hashes gives the “distance” — how different the images look.)

If it’s NOT a duplicate...

...keep it and add its hash to our collection.

Return only the unique frames.

Bonus: LaTeX Rendering via Matplotlib

When the AI includes math formulas in text chat responses, they need to look good. The app renders LaTeX expressions as PNG images using matplotlib — a plotting library repurposed as a math typesetter.

CODE — main.py

def _render_latex_to_image(latex_expr):
    fig, ax = plt.subplots(figsize=(0.01, 0.01))
    ax.axis('off')
    text = ax.text(0, 0, latex_expr, fontsize=10,
                   color='white')
    fig.canvas.draw()
    bbox = text.get_window_extent()
    fig.set_size_inches((bbox.width + 6) / dpi,
                        (bbox.height + 6) / dpi)
    buf = io.BytesIO()
    fig.savefig(buf, format='png', transparent=True)
    b64 = base64.b64encode(buf.getvalue()).decode()
    return f'<img src="data:image/png;base64,{b64}" />'

PLAIN ENGLISH

Turn a LaTeX math expression into an inline image.

Create a tiny matplotlib figure (a “canvas” for drawing).

Hide the axes — we only want the math, not a chart.

Place the LaTeX text on the canvas with white color

(for the dark theme).

Render the figure to calculate the exact size of the text.

Measure how big the text actually is.

Resize the figure to tightly fit the text

(with a tiny 6-pixel margin).

Create an in-memory buffer (no file on disk).

Save the figure as a transparent PNG into the buffer.

Convert the PNG bytes to a base64 string.

Return an HTML img tag with the image embedded inline. No file needed!

Check Your Understanding

Why capture a frame when the video has been stable for 3 seconds?

Why generate text descriptions of keyframes instead of sending images to the voice AI?

How does Lecture Buddy render mathematical formulas in the chat?

08

The Interface

Fitting a full cockpit dashboard into a wristwatch — every pixel counts on a 3.5-inch touchscreen.

480 × 320 Pixels

The Raspberry Pi’s touchscreen is smaller than most phones. Every design choice revolves around fat fingers on a tiny display.

1

40px minimum touch targets

Defined in config.py as MIN_TOUCH_TARGET. Apple and Google both recommend this as the minimum tappable size.

2

Dark theme with cyan accent

Dark backgrounds (#1E1E1E) with bright cyan (#00BCD4). High contrast for readability in dim lecture halls.

3

Auto-fullscreen mode

AUTO_FULLSCREEN = True — no title bar, no taskbar. Every pixel goes to the UI.

4

Six screens, one stack

All screens live in a QStackedWidget. Switching is instant — no loading, no lag.

The Six Screens

Each screen is a separate QWidget class, created once at startup and stacked invisibly.

🏠

MainMenu

🎤

Recorder

📚

Library

🗣

Voice Study

Click “Next Step” to walk through the screens

Instant Screen Switching

All six screens are created once at startup. Navigating is a single line of code.

CODE — main.py (MainWindow)

self.stack = QStackedWidget()
self.setCentralWidget(self.stack)

self.main_menu = MainMenu(self)
self.recorder_screen = RecorderScreen(self)
self.library_screen = LibraryScreen(self)
self.chat_screen = ChatScreen(self)
self.conversation_screen = ConversationScreen(self)
self.device_test_screen = DeviceTestScreen(self)

def show_recorder(self):
    self.stack.setCurrentWidget(self.recorder_screen)

PLAIN ENGLISH

Create a stack that holds all screens (only one visible at a time).

Make it the main content area of the window.

Create all six screens upfront:

Recording with timer, waveform, and mic controls.

Library for browsing saved lectures.

Text chat (type a question, read an answer).

Voice conversation (the flagship feature).

Device test (camera preview, mic recording test).

To navigate, just tell the stack which screen to show.

One line. Instant switch. Zero loading time.

The Xbox 360 Carousel

The home screen uses a custom carousel inspired by the Xbox 360’s “Aurora” dashboard. The center item is large; adjacent items shrink smoothly. This is done with linear interpolation — every visual property transitions fluidly.

CODE — main.py (CarouselItem)

def apply_continuous(self, abs_dist):
    d = max(0.0, min(abs_dist, 2.0))
    anchors = {
        0: (60, 11, 3, "#00bcd4", "#1e1e2e"),
        1: (44,  9, 2, "#444444", "#1a1a28"),
        2: (32,  8, 1, "#333333", "#161622"),
    }
    lo = int(d);  hi = min(lo + 1, 2)
    frac = d - lo
    icon_sz = int(a0[0] + frac * (a1[0] - a0[0]))

PLAIN ENGLISH

Given how far this card is from center (0=center, 1=adjacent, 2=outer)...

Clamp distance to 0–2.

Define three “anchor” states:

Center: big icon (60px), large font (11pt), thick cyan border, dark bg.

Adjacent: medium icon (44px), smaller font, thin gray border.

Outer: small icon (32px), smallest font, thinnest border, darkest bg.

Find the two nearest anchors and how far between them we are.

E.g., distance 1.3 = 30% between “adjacent” and “outer.”

Blend the icon size smoothly between the two anchor values.

🎨

Lerp: The One Formula

result = start + fraction * (end - start). At fraction 0, you get start. At 1, you get end. Anywhere between, a smooth blend. This one formula powers the carousel’s icon sizes, font sizes, border widths, border colors, and background colors — all interpolated simultaneously for buttery-smooth swiping.

TouchButton: Fat-Finger Friendly

Every button inherits from TouchButton, enforcing 40px minimum height and adding a drop shadow for depth.

CODE — main.py

class TouchButton(QPushButton):
    def __init__(self, text, parent=None):
        super().__init__(text, parent)
        self.setMinimumHeight(MIN_TOUCH_TARGET)
        self.setStyleSheet("""
            QPushButton { background: #00bcd4;
                color: white; border-radius: 8px; }
            QPushButton:pressed { background: #0097a7; }
        """)
        shadow = QGraphicsDropShadowEffect(self)
        shadow.setBlurRadius(10)
        shadow.setColor(QColor(0, 0, 0, 60))
        shadow.setOffset(0, 2)
        self.setGraphicsEffect(shadow)

PLAIN ENGLISH

Create a custom button that inherits everything from Qt’s standard button.

When created...

...do the default button setup.

Enforce 40px minimum height — fingers are big, buttons must be too.

Apply visual styling (CSS-like syntax in Qt):

Cyan background, white text, rounded corners.

When pressed, darken to deeper teal — instant visual feedback.

Add a subtle drop shadow below the button...

...10px blur for a soft look.

Black at 24% opacity — visible but not harsh.

Offset 2px down (light from above).

Apply the shadow. Every TouchButton now “floats.”

📱

Touch UI Design Rules

Three rules: (1) 40px minimum touch targets. (2) Instant visual feedback — the :pressed state darkens the button so users know their tap registered. (3) Generous spacing — every layout.setSpacing() and setContentsMargins() call enforces breathing room to prevent accidental taps.

The Elevator Pitch

Record

Transcribe

Study

See

A Day with Lecture Buddy

The Tech Stack

Check Your Understanding

What makes the voice study mode feel like a real conversation instead of a slow back-and-forth?

Why does the app have BOTH text chat and voice conversation modes?

How does the voice AI know what was written on the whiteboard?

The Cast of Characters

The Tech Stack Pyramid

AI Layer

Framework Layer

Storage Layer

Hardware Layer

Startup Sequence: “Voice Study”

Code Translation: Database Schema

Check Your Understanding

Which file would you modify to change the AI’s voice from ‘cedar’ to a different voice?

Why is main.py so much larger than the other files?

What does import_lecture.py do that main.py doesn’t?

Signals and Slots

Code Translation: Signal/Slot in Action

The Full Journey: Record to Study

The Config Hub

Check Your Understanding

A signal fires but nothing happens. What’s the most likely cause?

Why does config.py load the API key from a .env file instead of writing it directly in the code?

What would break if you changed AUDIO_RATE from 16000 to 44100?

Two Streams, One Button

Inside the Audio Recorder

Inside the Video Recorder

Import Lecture: The Secondary Feature

Why These Specific Numbers?

Check Your Understanding

What happens when the video frame queue is full and a new frame arrives?

Why does the audio callback use indata.copy() instead of just indata?

What would happen if VIDEO_FPS was set to 30 instead of 2?

Three AI Systems

Whisper (Local or Cloud)

GPT-4o-mini (Cloud)

Realtime API (Cloud)

Dual-Mode Transcription

The System Prompt: Grounding the AI

Configuring the Voice Session

The Resampling Trick

Check Your Understanding

Why does Lecture Buddy use push-to-talk instead of automatic voice detection?

What does the system prompt’s “SOLELY” instruction prevent?

The mic records at 16 kHz but OpenAI expects 24 kHz. How is the gap resolved?

The Flagship Feature

Data Flow: One Question, One Answer

Giving the Voice AI “Eyes”

Saving Study Sessions

When the Connection Drops

Check Your Understanding

What happens if the visual descriptions take 5 seconds to generate but the student starts talking after 2 seconds?

What does the app preserve when a WebSocket connection drops and reconnects?

How is the conversation log stored in the database?

Why Not Save Every Frame?

Content Change (SSIM)

Stability Detection

Audio Cue Matching

Perceptual Dedup

Strategy 1: Content Change Detection

Strategy 3: Listening to the Professor

From Frames to Words

Strategy 4: Removing Lookalikes

Bonus: LaTeX Rendering via Matplotlib

Check Your Understanding

Why capture a frame when the video has been stable for 3 seconds?

Why generate text descriptions of keyframes instead of sending images to the voice AI?

How does Lecture Buddy render mathematical formulas in the chat?

480 × 320 Pixels

The Six Screens

Instant Screen Switching

The Xbox 360 Carousel

TouchButton: Fat-Finger Friendly

Why does the audio callback use `indata.copy()` instead of just `indata`?