What It Does
A pocket-sized study partner that listens to lectures and talks back — like having a tutor who attended every class and never forgets.
The Elevator Pitch
Lecture Buddy is a Raspberry Pi app that records your lectures, transcribes them with AI, and then lets you have a voice conversation with an AI that knows exactly what was taught.
Record
Captures audio and video simultaneously from the Pi’s microphone and camera. Records entire lectures hands-free.
Transcribe
Converts recorded audio to text using Whisper — either locally on the Pi or via the cloud.
Study
Two study modes: text chat (type a question, read an answer) and voice conversation (speak and hear the AI respond in real-time).
See
Extracts key frames from the lecture video — whiteboard photos, slide changes — so the AI knows what was shown, not just what was said.
A Day with Lecture Buddy
Imagine you are sitting in a calculus lecture with Lecture Buddy on your desk. Here is what happens.
The app starts capturing audio at 16,000 samples/second and video at 2 frames/second — simultaneously, in separate background threads.
The Pi quietly records. A live waveform on screen shows audio is being captured. A timer counts up.
Audio is saved as a WAV file, video as MP4. Whisper converts the audio to text. Smart algorithms extract key video frames.
Choose “Voice Study” — the flagship mode. Hold the red button and ask: “Can you explain the chain rule?”
It knows the transcript, the whiteboard photos, and the formulas. It answers in a natural voice, referencing what the professor actually said.
The Tech Stack
Under the hood, Lecture Buddy connects seven Python modules and three external AI services.
PySide6 (Qt6)
The entire touchscreen UI — buttons, screens, animations, touch gestures. Qt is a professional-grade UI framework used in KDE, Tesla dashboards, and VLC.
OpenAI Whisper
Speech-to-text AI. Runs locally on the Pi (free but slow) or via cloud API (fast but costs money).
OpenAI Realtime API
Voice-to-voice conversations over WebSocket. The AI speaks back as it thinks — no waiting.
GPT-4o-mini
Text chat + Vision. Answers typed questions and describes whiteboard photos in words so the voice AI can “see.”
OpenCV + scikit-image
Computer vision libraries for video capture, frame comparison, and keyframe extraction.
SQLite
A local database that stores lecture metadata, study sessions, and cached AI descriptions.
Check Your Understanding
What makes the voice study mode feel like a real conversation instead of a slow back-and-forth?
Why does the app have BOTH text chat and voice conversation modes?
How does the voice AI know what was written on the whiteboard?
Meet the Cast
Like a film crew where each person has a specialized job — seven files, seven roles, one production.
The Cast of Characters
Lecture Buddy is built from seven Python modules. Each file has one job, just like each person on a film crew.
~3,370 lines. The central hub: all 6 UI screens, the database, the audio recorder, and the transcriber. Everything flows through here.
454 lines. Manages WebSocket voice conversations with OpenAI’s Realtime API. Streams audio in both directions simultaneously.
57 lines. Every setting and constant in one place: API keys, model names, audio rates, screen sizes. The single source of truth.
499 lines. Picks the best keyframes from lecture videos using 4 different strategies. Finds the moments where the whiteboard actually changed.
461 lines. Sends keyframe images to GPT-4o-mini Vision and gets back text descriptions. Caches results so the same image is never described twice.
218 lines. Handles video recording and the live camera preview. Runs in a background thread so the UI stays responsive.
161 lines. A CLI tool for importing existing video files from outside the app. Useful when you already have lecture recordings.
The Tech Stack Pyramid
Four layers, from physical hardware at the bottom to AI brains at the top. Each layer depends on the one below it.
AI Layer
Whisper for transcription, GPT-4o-mini for text chat and vision descriptions, and the Realtime API for live voice conversations.
Framework Layer
PySide6/Qt6 builds the touchscreen UI, OpenCV handles video capture and processing, and sounddevice records audio from the USB microphone.
Storage Layer
SQLite stores lecture metadata and study sessions. Audio files (WAV), video files (MP4), and transcripts live on the local filesystem.
Hardware Layer
Raspberry Pi as the brain, a 3.5″ touchscreen for input/output, a USB microphone for audio capture, and a camera module for video.
Startup Sequence: “Voice Study”
When a student taps “Voice Study” on a lecture, here’s the behind-the-scenes conversation between the files.
Code Translation: Database Schema
This is the schema that stores every lecture and study session. Let’s read it line by line.
-- main.py (DatabaseManager)
c.execute('''CREATE TABLE IF NOT EXISTS lectures
(id INTEGER PRIMARY KEY,
title TEXT NOT NULL,
filename TEXT NOT NULL,
date_created TEXT NOT NULL,
duration_seconds INTEGER,
transcription_path TEXT,
video_path TEXT)''')
CREATE TABLE IF NOT EXISTS lectures → “Create a table called lectures if it doesn’t already exist.”
id INTEGER PRIMARY KEY → “Each lecture gets a unique number. This is its ID — like a student ID card.”
title TEXT NOT NULL → “The lecture’s name (e.g., ‘Calculus Lecture 3’). Can’t be blank.”
filename TEXT NOT NULL → “The audio file name on disk. Also required.”
date_created TEXT NOT NULL → “When the recording was made. Stored as text (e.g., ‘2025-03-15 09:30:00’).”
duration_seconds INTEGER → “How long the lecture was, in seconds. Optional.”
transcription_path TEXT → “Where to find the text transcript file. Optional until transcription finishes.”
video_path TEXT → “Where to find the video file. Optional — some lectures might be audio-only.”
c.execute('''CREATE TABLE IF NOT EXISTS study_sessions
(id INTEGER PRIMARY KEY,
lecture_id INTEGER NOT NULL,
session_number INTEGER NOT NULL,
title TEXT NOT NULL,
date_created TEXT NOT NULL,
duration_seconds INTEGER,
conversation_log TEXT,
FOREIGN KEY (lecture_id) REFERENCES lectures(id))''')
CREATE TABLE IF NOT EXISTS study_sessions → “Create a table called study_sessions to track every study conversation.”
id INTEGER PRIMARY KEY → “Each study session gets its own unique number.”
lecture_id INTEGER NOT NULL → “Which lecture was this session about? Links back to the lectures table.”
session_number INTEGER NOT NULL → “Is this the 1st, 2nd, or 5th time studying this lecture?”
title TEXT NOT NULL → “A name for this session (e.g., ‘Study Session 2’).”
date_created TEXT NOT NULL → “When the study session started.”
duration_seconds INTEGER → “How long you studied. Optional.”
conversation_log TEXT → “The full transcript of your conversation with the AI. Optional.”
FOREIGN KEY (lecture_id) REFERENCES lectures(id) → “A rule: lecture_id must match an actual lecture. You can’t have an orphan study session.”
Check Your Understanding
Which file would you modify to change the AI’s voice from ‘cedar’ to a different voice?
Why is main.py so much larger than the other files?
What does import_lecture.py do that main.py doesn’t?
How They Talk
Like a relay race where each runner passes the baton to the next — signals fly between components, and data flows from microphone to AI tutor.
Signals and Slots
Signals and slots are Qt’s way of letting components talk without knowing about each other. This is the framework pattern used everywhere in Lecture Buddy.
“Hey, something happened!” For example: recording_finished, transcription_complete. The sender doesn’t care who’s listening.
A regular function that runs when it hears a specific signal. For example: on_recording_finished saves the file and starts transcription.
signal.connect(slot) — now whenever the signal fires, the slot runs automatically. One line of code creates the relationship.
This is how Qt components talk without knowing about each other. The recorder doesn’t import the transcriber. It just announces “I’m done” and whoever is listening takes over.
Code Translation: Signal/Slot in Action
Here’s the real code from the audio recorder. Two signals declared, two connections made.
# main.py (AudioRecorder)
class AudioRecorder(QThread):
recording_finished = Signal(str, int) # filename, duration
recording_error = Signal(str) # error message
class AudioRecorder(QThread): → “Create a class called AudioRecorder that runs in its own thread.”
recording_finished = Signal(str, int) → “Declare an announcement called recording_finished that will carry two pieces of data: a filename (text) and a duration (number).”
recording_error = Signal(str) → “Declare another announcement called recording_error that carries an error message (text).”
# main.py (RecorderScreen)
self.recorder.recording_finished.connect(self.on_recording_finished)
self.recorder.recording_error.connect(self.on_recording_error)
self.recorder.recording_finished.connect(self.on_recording_finished) → “When the recorder announces recording_finished, automatically call my on_recording_finished method.”
self.recorder.recording_error.connect(self.on_recording_error) → “When the recorder announces recording_error, automatically call my on_recording_error method.”
The Full Journey: Record to Study
Follow the baton from the moment a student taps Record to the moment they’re chatting with the AI about their lecture.
The Config Hub
config.py is the single source of truth. Every setting lives in one 57-line file — no hunting through 3,000 lines of main.py.
OPENAI_API_KEY
Loaded from a .env file — never hardcoded! If this key leaked, anyone could run up your OpenAI bill.
OPENAI_MODEL = "gpt-4o-mini"
The text chat brain. Used for answering typed questions and generating vision descriptions of keyframes.
OPENAI_VOICE_MODEL = "gpt-4o-realtime-preview"
The voice brain. Used for real-time voice conversations over WebSocket.
AUDIO_RATE = 16000
16kHz sample rate, optimized for Whisper. Higher rates waste storage without improving transcription quality for speech.
VIDEO_FPS = 2.0
Only 2 frames per second. Lectures are mostly static — saving 24 FPS would waste gigabytes of space for no benefit.
MIN_TOUCH_TARGET = 40
Minimum button size in pixels. On a tiny 3.5″ touchscreen, buttons smaller than 40px are nearly impossible to tap accurately.
AUTO_FULLSCREEN = True
Fills the tiny 480×320 screen completely. Every pixel matters when your display is the size of a business card.
Check Your Understanding
A signal fires but nothing happens. What’s the most likely cause?
Why does config.py load the API key from a .env file instead of writing it directly in the code?
What would break if you changed AUDIO_RATE from 16000 to 44100?
The Recording Pipeline
Two streams running in parallel — like a recording studio with separate audio and video booths capturing simultaneously, all triggered by a single button.
Two Streams, One Button
When the student taps Record, four things happen in rapid sequence. Two parallel pipelines spin up and run independently until the student taps Stop.
Captures audio via sounddevice at 16 kHz. Each chunk of audio data goes into a queue for later file writing.
Captures video via OpenCV at 2 frames per second. Each frame is written directly to an MP4 file on disk.
Each recorder lives in its own background thread. The main UI thread stays responsive — the student can still interact with the touchscreen while recording.
Both threads finish cleanly. Audio is saved as a WAV file, video as MP4. Then the Whisper transcription pipeline kicks off automatically.
Inside the Audio Recorder
This callback function fires every time the microphone captures a chunk of sound.
def callback(self, indata, frames, time, status):
if status:
logger.warning("Audio callback: %s", status)
if self.recording:
self.audio_queue.put(indata.copy())
Every time the microphone captures a tiny chunk of sound, this function runs automatically.
If there was a hardware problem (like a glitch), log a warning so we can debug later.
If we are currently in recording mode...
...make a copy of the sound data and add it to our collection queue. The copy is critical — the original buffer gets recycled by the audio system and would be overwritten.
Inside the Video Recorder
This is the main loop that runs as long as recording is active. It grabs frames from the camera, rate-limits them to 2 FPS, and queues copies for live analysis.
while self.recording:
ret, frame = self._cap.read()
if not ret:
QThread.msleep(10)
continue
elapsed = (datetime.datetime.now() - self.start_time).total_seconds()
if elapsed - last_frame_time >= frame_interval:
self._writer.write(frame)
self.frame_count += 1
last_frame_time = elapsed
try:
self._frame_queue.put_nowait((frame.copy(), elapsed))
except queue.Full:
pass # Drop frame if queue is full
Keep looping as long as we are recording...
Ask the camera for the next frame.
If the camera did not give us a frame (maybe it was busy)...
...wait 10 milliseconds and try again.
Skip to the next loop iteration.
Calculate how many seconds have passed since recording started.
Only save a frame if enough time has passed (this is the rate limiter — limits us to 2 FPS).
Write the frame to the MP4 video file on disk.
Increment our frame counter.
Remember when we last saved a frame.
Try to send a copy of the frame to the live-analysis queue...
...but if the processing queue is already full...
...just drop this frame silently. Better to lose one frame than crash from running out of memory.
Import Lecture: The Secondary Feature
Not every lecture is recorded live on the Pi. The import_lecture.py script lets you bring in recordings made elsewhere — a CLI tool for batch importing.
python import_lecture.py video.mp4
Imports an existing video file that was recorded outside the app — from a phone, webcam, or lecture capture system.
ffmpeg → extract audio
Uses ffmpeg to rip the audio track from the video file into a separate WAV for transcription.
Whisper → transcription
Runs the same Whisper transcription pipeline as live recordings — audio in, text out.
SQLite → database
Saves the imported lecture to the same SQLite database as live recordings. It shows up in the library alongside everything else.
frame_selector.py → keyframes
Extracts keyframes using the same smart algorithms that live recordings use — content change, stability detection, audio cues.
import_lecture.py is a command-line tool, not a touchscreen interface. It lets you batch-import old recordings from a laptop or desktop without needing to interact with the 3.5-inch screen. Same pipeline, different entry point.
Why These Specific Numbers?
Every setting in config.py was chosen deliberately for the Pi’s limited hardware. Here is the reasoning behind each one.
AUDIO_RATE = 16000
16 kHz sample rate — Whisper works best at exactly this frequency. Higher rates waste storage without improving transcription accuracy.
VIDEO_FPS = 2.0
2 frames per second — lectures are mostly static. That is 7,200 frames per hour, plenty to catch every slide change and whiteboard update.
VIDEO_WIDTH = 640
640 pixels wide — low resolution means smaller files and faster processing on the Pi’s limited CPU.
AUDIO_FORMAT = "int16"
16-bit integer audio — the sweet spot for speech. int16 captures voice clearly without the file bloat of 32-bit float.
Running audio and video capture in separate threads is like having two employees work simultaneously instead of taking turns. This is called “concurrency” — without it, capturing audio would freeze the video, and capturing video would freeze the UI.
Check Your Understanding
What happens when the video frame queue is full and a new frame arrives?
Why does the audio callback use indata.copy() instead of just indata?
What would happen if VIDEO_FPS was set to 30 instead of 2?
The AI Brain
Three AI systems working together — like a multilingual interpreter converting between spoken words, written text, and AI understanding in real-time.
Three AI Systems
Lecture Buddy uses three separate AI systems, each optimized for a different job. They never compete — each handles a distinct piece of the puzzle.
Whisper (Local or Cloud)
Converts audio to text. Can run ON the Pi (free but slow) or via OpenAI’s API (fast, costs money). Toggle between modes with the USE_WHISPER_API flag in config.py.
GPT-4o-mini (Cloud)
Text chat + Vision. Student types a question, gets formatted text back with LaTeX math rendering. Also describes keyframe images in words for the voice AI.
Realtime API (Cloud)
Voice-to-voice over WebSocket. Student speaks, AI speaks back in real-time. The flagship study mode — feels like talking to a real tutor.
Dual-Mode Transcription
The WhisperTranscriber can work two ways — cloud or local — controlled by a single boolean flag.
def run(self):
try:
self.transcription_progress.emit(10)
if USE_WHISPER_API:
client = openai.OpenAI()
with open(self.audio_file, "rb") as f:
result = client.audio.transcriptions.create(
model="whisper-1", file=f,
language=WHISPER_LANGUAGE)
self.transcription_progress.emit(100)
self.transcription_complete.emit(result.text)
else:
result = self.model.transcribe(
self.audio_file, language=WHISPER_LANGUAGE,
fp16=False, verbose=False)
self.transcription_progress.emit(100)
self.transcription_complete.emit(result["text"])
except Exception as e:
self.transcription_error.emit(str(e))
This function runs in a background thread when transcription starts.
Wrap everything in error handling...
Tell the UI we are 10% done (progress bar updates).
Check which mode we are in — cloud or local?
Cloud mode: Create an OpenAI API client.
Open the audio file for reading...
Send it to OpenAI’s Whisper API...
...using the whisper-1 model and our configured language.
Tell the UI we are 100% done.
Send the transcribed text back to the UI via a signal.
Local mode: (no internet needed)
Run Whisper directly on the Pi...
...with our language setting.
fp16=False because the Pi lacks GPU support for half-precision math. verbose=False to keep logs clean.
Tell the UI we are done.
Send the text back. Local Whisper returns a dict, so we access ["text"].
If anything goes wrong (bad file, network error, out of memory)...
...send the error message to the UI so the student sees what happened.
The System Prompt: Grounding the AI
This is the most important piece of prompt engineering in the entire app. It tells the AI what it knows and what it must refuse to discuss.
{"role": "system", "content": f"""You are a helpful study buddy.
The student has provided a lecture transcription.
Help them understand concepts, answer questions,
create summaries, and provide study guidance
based SOLELY on the lecture content.
If asked about content not in the lecture, say
you can only discuss the lecture material.
Lecture content: {lecture_text[:8000]}"""}
This is a system prompt — the AI’s hidden instructions.
Tell the AI it has a lecture transcription to work from.
Define what it can do: explain concepts, answer questions...
...create summaries, give study advice.
The critical word: SOLELY. The AI must ONLY use lecture content — no making things up.
If the student asks something outside the lecture...
...the AI must refuse and say it can only discuss what was taught.
Paste the actual lecture text, but truncate to 8,000 characters to fit within the AI’s context window.
Without this word, the AI would happily answer any question — even about things the professor never mentioned. The SOLELY instruction grounds the AI in the lecture, preventing it from inventing facts. This is the difference between a reliable study tool and a hallucination machine.
Configuring the Voice Session
When a voice study session starts, the app sends this configuration to OpenAI over the WebSocket. Every setting shapes how the conversation will work.
session_update = {
"type": "session.update",
"session": {
"voice": OPENAI_VOICE,
"instructions": self.instructions,
"modalities": ["text", "audio"],
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {
"model": "whisper-1",
"language": "en",
},
"turn_detection": None,
},
}
self._send(session_update)
Build a configuration message to send to OpenAI...
This is a “session update” command.
Configure the session with these settings:
Use the configured AI voice (a natural-sounding speaker).
Give the AI the full lecture context as its instructions — so it knows what was taught.
Enable both text AND audio responses simultaneously.
Send audio as raw 16-bit PCM16 (uncompressed, fast).
Receive audio back in the same raw format.
Also transcribe what the student says using Whisper...
...in English.
turn_detection set to None — this means push-to-talk. More reliable in noisy classrooms than automatic voice detection.
Send this configuration over the WebSocket connection.
The Resampling Trick
The microphone records at 16,000 samples/second but OpenAI’s Realtime API expects 24,000. This code mathematically “stretches” the audio to fill the gap.
src_positions = np.linspace(0.0, duration, num=data.size, endpoint=False)
dst_positions = np.linspace(0.0, duration, num=target_samples, endpoint=False)
resampled = np.interp(dst_positions, src_positions, data).astype(np.int16)
Create 16,000 evenly-spaced time markers for the original audio samples (where we HAVE data).
Create 24,000 evenly-spaced time markers for the target rate (where we NEED data).
Use interpolation to calculate what the audio would sound like at each new marker. Convert back to 16-bit integers for the API.
The Realtime API streams audio chunks as they are generated — the AI starts speaking before it has finished thinking. This is like how a human interpreter starts translating mid-sentence rather than waiting for you to finish. It makes conversations feel natural instead of robotic.
Check Your Understanding
Why does Lecture Buddy use push-to-talk instead of automatic voice detection?
What does the system prompt’s “SOLELY” instruction prevent?
The mic records at 16 kHz but OpenAI expects 24 kHz. How is the gap resolved?
Voice Study Session
The flagship feature under a microscope — like watching a symphony conductor’s hands and understanding every gesture.
The Flagship Feature
The ConversationScreen class is ~877 lines of code — the single largest component in the entire app. It orchestrates push-to-talk recording, real-time audio streaming, live transcription display, visual context integration, and session saving.
When the screen opens, it loads the full transcript from the database. A background worker thread fetches or generates visual descriptions of keyframes.
The transcript + visual descriptions are combined into a single instructions string. This is what the AI “knows” about the lecture.
The RealtimeConversationClient connects to OpenAI, sends the session config, and waits for the student to speak.
Student holds the red button → audio captured → resampled 16kHz→24kHz → sent via WebSocket → AI responds with streaming audio + text.
When the student leaves, the full conversation log (every question and answer) is saved to the database with a timestamp and duration.
Data Flow: One Question, One Answer
Follow the data as the student asks “What is the chain rule?” and hears the AI respond.
Giving the Voice AI “Eyes”
The Realtime API can only handle text and audio — no images. So before the session starts, a VisualDescriptionWorker runs in the background to convert keyframe images into text descriptions.
self.visual_worker = VisualDescriptionWorker(lecture_id, title)
self.visual_worker.finished.connect(
self._on_visual_descriptions_ready)
self.visual_worker.start()
def _on_visual_descriptions_ready(self, descriptions):
if descriptions:
self.instructions += "\n\nVisual content:\n" + descriptions
if self.realtime_client:
self.realtime_client.update_instructions(
self.instructions)
Create a background worker to generate visual descriptions (won’t freeze the UI).
When the worker finishes, call our handler method.
Start the background work.
When descriptions arrive from the background worker...
If we got valid descriptions...
...append them to the AI’s instructions (“here’s what the whiteboard showed”).
If the WebSocket is already connected...
...hot-update the AI’s instructions mid-session.
The AI now “sees” the whiteboard through text descriptions.
Notice how the visual descriptions can arrive after the WebSocket is already open. The code handles this gracefully — it appends the descriptions and sends a session update to the live connection. This means the student can start talking immediately while the visual context loads in the background.
Saving Study Sessions
Every voice conversation is saved so students can review what they discussed. The session includes the full conversation log, duration, and automatic naming.
def save_study_session(self, lecture_id, lecture_title,
duration_seconds, conversation_log):
session_number = self.get_next_session_number(lecture_id)
short_title = lecture_title[:30]
session_title = f"{short_title} - Session {session_number}"
c.execute(
"""INSERT INTO study_sessions
(lecture_id, session_number, title,
date_created, duration_seconds, conversation_log)
VALUES (?, ?, ?, ?, ?, ?)""",
(lecture_id, session_number, session_title,
datetime.datetime.now().isoformat(),
duration_seconds, json.dumps(conversation_log)))
Save a study session for a given lecture...
Figure out what number this session is (Session 1, 2, 3...).
Truncate the lecture title to 30 characters so it fits on the small screen.
Auto-generate a name like “Calculus Lecture 3 - Session 2.”
Insert a new row into the study_sessions table...
...with the lecture ID, session number, auto-generated title,
current timestamp,
how long the session lasted, and the full conversation as JSON.
When the Connection Drops
WebSocket connections are fragile. WiFi glitches, server timeouts, or network switches can kill the connection mid-conversation. The app handles this gracefully with auto-reconnect.
The reconnection preserves the lecture context (transcript + visual descriptions) but loses the conversation history from the current session. This is a deliberate trade-off — rebuilding the full conversation context would be slow and expensive. The student can simply re-ask their last question.
Check Your Understanding
What happens if the visual descriptions take 5 seconds to generate but the student starts talking after 2 seconds?
What does the app preserve when a WebSocket connection drops and reconnects?
How is the conversation log stored in the database?
The Smart Eye
Picking the few frames that matter from hours of video — like a photographer culling thousands of shots down to the best twenty.
Why Not Save Every Frame?
At 2 frames per second, a one-hour lecture produces 7,200 frames. Most are nearly identical. Sending all of them to an AI would be slow, expensive, and pointless.
frame_selector.py uses four strategies to find only the frames worth keeping, then removes duplicates and enforces a budget.
Content Change (SSIM)
Compare each frame to the previous one using SSIM. If they differ enough, something changed — keep the new frame.
Stability Detection
If the frame has been unchanged for 3+ seconds, the professor finished writing. Capture the completed work.
Audio Cue Matching
Scan the transcript for phrases like “as you can see” or “this equation shows.” The professor is pointing at something visual.
Perceptual Dedup
After all candidates are chosen, remove frames that look the same using perceptual hashing.
Strategy 1: Content Change Detection
The core comparison converts frames to grayscale, blurs to remove noise, then uses SSIM to measure similarity. Below the 0.85 threshold = meaningful change.
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (5, 5), 0)
if self._prev_gray is not None:
similarity = ssim(self._prev_gray, gray)
if similarity < self.ssim_threshold:
trigger = "content_change"
score = 1.0 - similarity
Convert the color frame to grayscale — color is irrelevant for change detection.
Blur slightly to ignore tiny sensor noise.
If we have a previous frame to compare against...
Calculate how similar this frame is to the last one (0 = totally different, 1 = identical).
If similarity dropped below 0.85, something meaningful changed...
...mark this as a “content change” trigger.
Score = how MUCH it changed (more change = higher score = more important).
Strategy 3: Listening to the Professor
The code has 25+ regex patterns that match common phrases professors use when referencing visual content.
"as you can see here"
Pointing phrases — professor directing attention. Confidence: 0.9
"this equation shows"
Equation phrases — formula being presented. Confidence: 0.85
"which completes the proof"
Completion phrases — something just finished. Confidence: 0.9
"draw a line from..."
Diagram phrases — spatial references. Confidence: 0.7–0.85
def _check_audio_cues(self, timestamp, segments, window=3.0):
for segment in segments:
if not (segment.start <= timestamp + window
and segment.end >= timestamp - window):
continue
text = segment.text.lower()
for pattern, confidence in self._cue_patterns:
if pattern.search(text):
return VisualCue(timestamp, segment.text, ...)
Check if the professor said something visual near this moment (3-second window).
Look through each transcript segment...
If this segment doesn’t overlap our time window...
...skip it.
Convert to lowercase for case-insensitive matching.
Test every regex pattern against the text...
If a pattern matches, we found a visual cue!
Return the cue with its timestamp, text, and confidence score.
From Frames to Words
vision_context.py sends keyframes to GPT-4o-mini Vision, which describes what it sees. Those descriptions are cached in SQLite so you never pay for the same frames twice.
def get_or_generate_visual_descriptions(lecture_id, title):
c.execute("SELECT visual_descriptions FROM lectures WHERE id = ?",
(lecture_id,))
row = c.fetchone()
if row and row[0]:
return row[0] # Cached! Free!
descriptions = generate_visual_descriptions(lecture_id, title)
if descriptions:
c.execute("UPDATE lectures SET visual_descriptions = ?",
(descriptions, lecture_id))
return descriptions
Given a lecture, get visual descriptions — from cache if possible.
Ask the database: “Do we already have descriptions for this lecture?”
Get the result.
If cached descriptions exist...
...return them. No API call needed — free!
No cache hit. Call GPT-4o-mini Vision to describe the keyframes.
If we got descriptions back...
...save them to the database for next time.
Return descriptions (cached or freshly generated).
This is a universal pattern in software: check the cache first, compute only if needed, then cache the result. The first study session pays for the Vision API call. Every session after that gets the descriptions for free from SQLite. At ~85 tokens per image in “low” detail mode, 10 keyframes cost about $0.000128 — but only once.
Strategy 4: Removing Lookalikes
After all candidates are selected, perceptual deduplication removes frames that look essentially the same despite passing the other filters.
def deduplicate_frames(self, keyframes, hash_threshold=10):
unique = [keyframes[0]]
hashes = [self._compute_phash(keyframes[0].image)]
for kf in keyframes[1:]:
h = self._compute_phash(kf.image)
is_duplicate = any(h - existing < hash_threshold
for existing in hashes)
if not is_duplicate:
unique.append(kf)
hashes.append(h)
return unique
Remove visually similar frames. Threshold of 10 means “hashes within 10 bits are duplicates.”
Start with the first frame — it’s always unique.
Compute its perceptual hash (digital fingerprint).
For each remaining candidate frame...
Compute its hash.
Check if this hash is too close to ANY hash we’ve already kept.
(Subtracting hashes gives the “distance” — how different the images look.)
If it’s NOT a duplicate...
...keep it and add its hash to our collection.
Return only the unique frames.
Bonus: LaTeX Rendering via Matplotlib
When the AI includes math formulas in text chat responses, they need to look good. The app renders LaTeX expressions as PNG images using matplotlib — a plotting library repurposed as a math typesetter.
def _render_latex_to_image(latex_expr):
fig, ax = plt.subplots(figsize=(0.01, 0.01))
ax.axis('off')
text = ax.text(0, 0, latex_expr, fontsize=10,
color='white')
fig.canvas.draw()
bbox = text.get_window_extent()
fig.set_size_inches((bbox.width + 6) / dpi,
(bbox.height + 6) / dpi)
buf = io.BytesIO()
fig.savefig(buf, format='png', transparent=True)
b64 = base64.b64encode(buf.getvalue()).decode()
return f'<img src="data:image/png;base64,{b64}" />'
Turn a LaTeX math expression into an inline image.
Create a tiny matplotlib figure (a “canvas” for drawing).
Hide the axes — we only want the math, not a chart.
Place the LaTeX text on the canvas with white color
(for the dark theme).
Render the figure to calculate the exact size of the text.
Measure how big the text actually is.
Resize the figure to tightly fit the text
(with a tiny 6-pixel margin).
Create an in-memory buffer (no file on disk).
Save the figure as a transparent PNG into the buffer.
Convert the PNG bytes to a base64 string.
Return an HTML img tag with the image embedded inline. No file needed!
Check Your Understanding
Why capture a frame when the video has been stable for 3 seconds?
Why generate text descriptions of keyframes instead of sending images to the voice AI?
How does Lecture Buddy render mathematical formulas in the chat?
The Interface
Fitting a full cockpit dashboard into a wristwatch — every pixel counts on a 3.5-inch touchscreen.
480 × 320 Pixels
The Raspberry Pi’s touchscreen is smaller than most phones. Every design choice revolves around fat fingers on a tiny display.
Defined in config.py as MIN_TOUCH_TARGET. Apple and Google both recommend this as the minimum tappable size.
Dark backgrounds (#1E1E1E) with bright cyan (#00BCD4). High contrast for readability in dim lecture halls.
AUTO_FULLSCREEN = True — no title bar, no taskbar. Every pixel goes to the UI.
All screens live in a QStackedWidget. Switching is instant — no loading, no lag.
The Six Screens
Each screen is a separate QWidget class, created once at startup and stacked invisibly.
Instant Screen Switching
All six screens are created once at startup. Navigating is a single line of code.
self.stack = QStackedWidget()
self.setCentralWidget(self.stack)
self.main_menu = MainMenu(self)
self.recorder_screen = RecorderScreen(self)
self.library_screen = LibraryScreen(self)
self.chat_screen = ChatScreen(self)
self.conversation_screen = ConversationScreen(self)
self.device_test_screen = DeviceTestScreen(self)
def show_recorder(self):
self.stack.setCurrentWidget(self.recorder_screen)
Create a stack that holds all screens (only one visible at a time).
Make it the main content area of the window.
Create all six screens upfront:
Recording with timer, waveform, and mic controls.
Library for browsing saved lectures.
Text chat (type a question, read an answer).
Voice conversation (the flagship feature).
Device test (camera preview, mic recording test).
To navigate, just tell the stack which screen to show.
One line. Instant switch. Zero loading time.
The Xbox 360 Carousel
The home screen uses a custom carousel inspired by the Xbox 360’s “Aurora” dashboard. The center item is large; adjacent items shrink smoothly. This is done with linear interpolation — every visual property transitions fluidly.
def apply_continuous(self, abs_dist):
d = max(0.0, min(abs_dist, 2.0))
anchors = {
0: (60, 11, 3, "#00bcd4", "#1e1e2e"),
1: (44, 9, 2, "#444444", "#1a1a28"),
2: (32, 8, 1, "#333333", "#161622"),
}
lo = int(d); hi = min(lo + 1, 2)
frac = d - lo
icon_sz = int(a0[0] + frac * (a1[0] - a0[0]))
Given how far this card is from center (0=center, 1=adjacent, 2=outer)...
Clamp distance to 0–2.
Define three “anchor” states:
Center: big icon (60px), large font (11pt), thick cyan border, dark bg.
Adjacent: medium icon (44px), smaller font, thin gray border.
Outer: small icon (32px), smallest font, thinnest border, darkest bg.
Find the two nearest anchors and how far between them we are.
E.g., distance 1.3 = 30% between “adjacent” and “outer.”
Blend the icon size smoothly between the two anchor values.
result = start + fraction * (end - start). At fraction 0, you get start. At 1, you get end. Anywhere between, a smooth blend. This one formula powers the carousel’s icon sizes, font sizes, border widths, border colors, and background colors — all interpolated simultaneously for buttery-smooth swiping.
TouchButton: Fat-Finger Friendly
Every button inherits from TouchButton, enforcing 40px minimum height and adding a drop shadow for depth.
class TouchButton(QPushButton):
def __init__(self, text, parent=None):
super().__init__(text, parent)
self.setMinimumHeight(MIN_TOUCH_TARGET)
self.setStyleSheet("""
QPushButton { background: #00bcd4;
color: white; border-radius: 8px; }
QPushButton:pressed { background: #0097a7; }
""")
shadow = QGraphicsDropShadowEffect(self)
shadow.setBlurRadius(10)
shadow.setColor(QColor(0, 0, 0, 60))
shadow.setOffset(0, 2)
self.setGraphicsEffect(shadow)
Create a custom button that inherits everything from Qt’s standard button.
When created...
...do the default button setup.
Enforce 40px minimum height — fingers are big, buttons must be too.
Apply visual styling (CSS-like syntax in Qt):
Cyan background, white text, rounded corners.
When pressed, darken to deeper teal — instant visual feedback.
Add a subtle drop shadow below the button...
...10px blur for a soft look.
Black at 24% opacity — visible but not harsh.
Offset 2px down (light from above).
Apply the shadow. Every TouchButton now “floats.”
Three rules: (1) 40px minimum touch targets. (2) Instant visual feedback — the :pressed state darkens the button so users know their tap registered. (3) Generous spacing — every layout.setSpacing() and setContentsMargins() call enforces breathing room to prevent accidental taps.