Step 1: Project scaffold and OBS WebSocket client connection

What we're doing this step

Before any OCR, scene switching, or screen capture can land, the project needs two things in place: a Python package layout that the rest of the series can grow into, and a working OBS WebSocket v5 handshake so future steps have something to talk to OBS Studio with. This step keeps the surface area deliberately small. We add only the connect / Hello / Identify / Identified portion of the OBS protocol, and we drive it through an injected Transport so the tests never touch a real socket. Higher-level RPCs — SetCurrentProgramScene, GetSourceScreenshot, and friends — are postponed until the OCR pipeline in later steps actually needs them. The goal of step 1 is a clean foundation that compiles, tests, and pushes to the companion vytharion repo without dragging in a framework we don't yet need.

Setup

The package is a plain pyproject.toml project laid out as src/obs_scene_switcher/, with tests under tests/. Only one runtime dependency is pulled in (websockets), and pytest is added as a dev extra. The full pyproject.toml:

[project]
name = "obs-scene-switcher"
version = "0.1.0"
description = "OBS scene switcher driven by OCR of the loading screen."
requires-python = ">=3.9"
dependencies = [
    "websockets>=12.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=7.0",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["src/obs_scene_switcher"]

[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "-ra -q"

The src/ layout is intentional: it forces tests to import the installed package rather than pick up a stray sibling directory, which would mask packaging mistakes that only show up after publish. Quiet pytest output (-q) keeps the article's test transcripts short and unambiguous.

The package's __init__.py re-exports exactly the three symbols downstream callers should reach for:

from .obs_client import (
    OBSClient,
    OBSIdentified,
    OBSProtocolError,
)

__all__ = ["OBSClient", "OBSIdentified", "OBSProtocolError"]

Everything else inside obs_client.py — the opcode constants, the helper functions, the dataclass — is treated as a private implementation detail and can change between steps without breaking importers.

Implementation

The OBS WebSocket v5 handshake is small: the server opens with a Hello (opcode 0) that optionally carries an authentication challenge, the client replies with Identify (opcode 1), and the server confirms with Identified (opcode 2). The whole conversation lives in three opcode constants and one method.

The transport is abstracted behind a Protocol. That choice is what makes the tests cheap — no need to stand up a real WebSocket server, no asyncio loop, no port binding:

class Transport(Protocol):
    def send(self, payload: str) -> None: ...
    def recv(self) -> str: ...

The actual handshake is a straight read-write-read driven by connect:

def connect(self) -> OBSIdentified:
    hello = self._recv_message()
    _expect_op(hello, HELLO_OP)
    identify_payload = self._build_identify_payload(hello.get("d", {}))
    self._send_message({"op": IDENTIFY_OP, "d": identify_payload})
    identified = self._recv_message()
    _expect_op(identified, IDENTIFIED_OP)
    return OBSIdentified(
        rpc_version=self._rpc_version,
        negotiated_rpc_version=int(
            identified.get("d", {}).get(
                "negotiatedRpcVersion", self._rpc_version
            )
        ),
    )

Three design choices are worth pointing out:

Opcode checking is its own helper. _expect_op raises OBSProtocolError on a mismatch. Inlining the check inside connect would push the function past the two-level nesting limit the codebase rules enforce, so it lives next to the constants.
Auth digest computation is isolated. When the Hello payload includes an authentication block, the client computes the SHA-256 / base64 chain that OBS expects. Pulling that into _compute_auth_response keeps connect flat and makes the formula auditable against the public OBS WebSocket v5 spec.
Missing password is an error, not silent skip. If OBS asks for auth and the caller didn't supply a password, we raise OBSProtocolError immediately — no Identify is sent. That avoids leaking a partial handshake to the server and gives the caller a clear signal.

The auth helper itself mirrors the protocol exactly:

def _compute_auth_response(*, password: str, salt: str, challenge: str) -> str:
    secret = base64.b64encode(
        hashlib.sha256((password + salt).encode("utf-8")).digest()
    ).decode("ascii")
    return base64.b64encode(
        hashlib.sha256((secret + challenge).encode("utf-8")).digest()
    ).decode("ascii")

Tests cover six scenarios: an unauthenticated handshake, an authenticated one (asserting the digest matches the spec formula), a missing-password rejection, a wrong-first-opcode error, a wrong-second-opcode error, and a non-object payload. The fake transport simply pops scripted JSON strings off a list, which makes failure modes trivial to script.

Test it

pytest

......                                                                   [100%]
6 passed in 0.01s

What we got

A package skeleton (pyproject.toml, src/obs_scene_switcher/, tests/) and an OBSClient that can negotiate the OBS WebSocket v5 handshake — with or without password authentication — and surface a typed OBSIdentified result. Six tests pin down the happy path, the auth digest, and four error branches without booting OBS or opening a socket. The next step builds on this client by sending the first real RPC and wiring in a concrete WebSocket transport.

Repository

The companion code for this article: https://github.com/vytharion/obs-scene-switcher-ocr-loading-screen

The state of the code after this step: a5ba6bd

Key commits to step through:

[a5ba6bd] — step 1: Project scaffold and OBS WebSocket client connection

What we're doing this step

Before OCR can answer the question "is the loading screen up right now?", something has to feed it pixels. Step 2 builds that feeder: a small capture module that takes a fixed rectangular region of the screen and yields image buffers on a steady cadence. The OBS client from step 1 still has nothing to do — switching scenes only matters once we know what the screen looks like — so the focus here is purely the pixel-source half of the pipeline. We introduce a Region value object so the bounding box has a single, validated home; a Frame value object so downstream consumers (the OCR engine, in the next step) never have to guess pixel layout; a CaptureBackend Protocol so the actual screen-grab API stays swappable; and a RegionCapturer that exposes both a one-shot snapshot() and a long-running stream() driven by injected clock + sleep callables. Keeping time injectable is what makes the tests cheap: nine specs run in hundredths of a second instead of burning real wall-clock waiting for cadences to tick.

Setup

Two new modules land under src/obs_scene_switcher/:

region.py — the Region dataclass (x, y, width, height) plus derived right, bottom, and as_box() helpers.
capture.py — Frame, the CaptureBackend Protocol, and RegionCapturer.

Two new test modules land under tests/:

test_region.py — seven specs covering edges, the Pillow-style box tuple, the origin corner case, and the four validation error paths.
test_capture.py — ten specs covering Frame validation, snapshot, the stream cadence under fast, slow, and very-slow backends, the max_frames=0 no-op, the non-positive interval guard, and the structural-typing path that makes CaptureBackend a Protocol rather than an ABC.

The package's __init__.py grows two new re-exports so callers can reach Region, Frame, CaptureBackend, and RegionCapturer directly:

from .capture import CaptureBackend, Frame, RegionCapturer
from .obs_client import (
    OBSClient,
    OBSIdentified,
    OBSProtocolError,
)
from .region import Region

__all__ = [
    "CaptureBackend",
    "Frame",
    "OBSClient",
    "OBSIdentified",
    "OBSProtocolError",
    "Region",
    "RegionCapturer",
]

No new runtime dependency is added in this step. The real production backend will wrap mss (a thin, fast screen-grab library), but we postpone pulling it in until step 3 — when the OCR engine actually needs a real frame — so that step 2's tests stay backend-free.

Implementation

The Region is deliberately a tiny frozen dataclass. Four integers, two derived properties, one tuple converter, and a __post_init__ that pushes each field through a small named guard:

@dataclass(frozen=True)
class Region:
    x: int
    y: int
    width: int
    height: int

    def __post_init__(self) -> None:
        _require_non_negative("x", self.x)
        _require_non_negative("y", self.y)
        _require_positive("width", self.width)
        _require_positive("height", self.height)

    @property
    def right(self) -> int:
        return self.x + self.width

    @property
    def bottom(self) -> int:
        return self.y + self.height

    def as_box(self) -> Tuple[int, int, int, int]:
        return (self.x, self.y, self.right, self.bottom)

The split between _require_non_negative (for x, y) and _require_positive (for width, height) reflects how the OBS WebSocket protocol and mss both treat regions: the origin corner (0, 0) is a perfectly legal place to start grabbing, but a zero-width or zero-height region is a degenerate request that should fail loudly rather than silently produce an empty buffer downstream. The as_box() shape — (left, top, right, bottom) — matches what both Pillow's Image.crop and mss.grab expect, so the boundary between this codebase and either of those libraries is one well-named method call.

The Frame carries the captured pixels along with just enough metadata for the OCR stage to interpret them — width, height, a mode string borrowed from Pillow's vocabulary ("RGB", "RGBA", "L"), and a captured_at timestamp:

@dataclass(frozen=True)
class Frame:
    pixels: bytes
    width: int
    height: int
    mode: str
    captured_at: float

    def __post_init__(self) -> None:
        expected = self.width * self.height * _bytes_per_pixel(self.mode)
        if len(self.pixels) != expected:
            raise ValueError(
                f"pixel buffer length {len(self.pixels)} does not match "
                f"{self.width}x{self.height} {self.mode!r} (expected {expected})"
            )

The size-check invariant is what catches the most expensive class of capture bug early: a backend that returns the wrong stride, or a region that was sliced before its dimensions were adjusted. A frame whose declared dimensions don't match its actual byte length cannot be silently passed to a Pillow frombytes call downstream, because that path produces visually correct-looking garbage. Validating at construction time keeps the error close to the cause.

The capture transport is a Protocol — structural, not nominal — so any object with a matching grab(region) -> Frame method qualifies:

class CaptureBackend(Protocol):
    def grab(self, region: Region) -> "Frame": ...

RegionCapturer itself is the small piece that ties these together. Construction takes the backend, the region to grab, and two injected callables that default to the real time module:

class RegionCapturer:
    def __init__(
        self,
        backend: CaptureBackend,
        region: Region,
        *,
        clock: Callable[[], float] = time.monotonic,
        sleep: Callable[[float], None] = time.sleep,
    ) -> None:
        self._backend = backend
        self._region = region
        self._clock = clock
        self._sleep = sleep

    def snapshot(self) -> Frame:
        return self._backend.grab(self._region)

The interesting work happens in stream. Three design choices shape it:

Schedule against the original deadline, not against grab completion. A 300 ms grab inside a 500 ms cadence should not push every subsequent frame back by 300 ms. The next deadline is computed as next_due += interval_seconds — independent of when this frame actually landed — so a single slow grab causes a single short sleep, not a permanent drift.
When already behind, don't sleep at all. If remaining is non-positive, _wait_until simply returns. This keeps the loop honest under load: it never tries to "make up" missing frames by busy-looping, but it also never inserts a phantom delay when there's nothing to wait for.
Emit the first frame immediately. next_due = self._clock() at loop entry means the first iteration's wait is zero. Callers who want a settled-state cadence can simply throw the first frame away; callers who want an instant first read get it for free.

def stream(
    self,
    *,
    interval_seconds: float,
    max_frames: Optional[int] = None,
) -> Iterator[Frame]:
    _require_positive_interval(interval_seconds)
    emitted = 0
    next_due = self._clock()
    while _should_emit(emitted, max_frames):
        self._wait_until(next_due)
        yield self.snapshot()
        emitted += 1
        next_due += interval_seconds

def _wait_until(self, deadline: float) -> None:
    remaining = deadline - self._clock()
    if remaining > 0:
        self._sleep(remaining)

_should_emit and _require_positive_interval are pulled out as named helpers rather than inlined so that stream itself stays at a single level of branching, in line with the codebase's two-level nesting cap.

The tests use a FakeClock whose time() and sleep() advance shared state, plus a FakeBackend that records every region it was asked to grab. The cadence-under-load specs script a backend that mutates the clock during grab itself, which proves the deadline arithmetic holds without needing real timing.

Test it

pytest

..............................                                           [100%]
30 passed in 0.02s

Six of those thirty came from step 1 (the OBS client handshake suite). The remaining twenty-four — seven Region specs, ten Frame / capture specs, and seven cadence specs — are what this step adds.

What we got

A two-module capture pipeline (region.py + capture.py) and twenty-four new tests that pin down every behaviour worth pinning: region validation, frame buffer-size invariants, Pillow-compatible box tuples, snapshot-forwards-region, structural backend typing, the immediate first frame, the slow-grab cadence hold, the very-slow-grab no-sleep branch, the max_frames=0 no-op, and the rejection of non-positive intervals. None of it touches a real screen or a real clock. The next step plugs an mss-based backend into CaptureBackend, wraps the resulting Frame bytes in a Pillow Image, and hands that image to an OCR engine to extract whatever text the loading screen happens to be showing.

Repository

The companion code for this article: https://github.com/vytharion/obs-scene-switcher-ocr-loading-screen

The state of the code after this step: 810a0af

Key commits to step through:

[a5ba6bd] — step 1: Project scaffold and OBS WebSocket client connection
[810a0af] — step 2: Screen region capture pipeline

What we're doing this step

The OCR engine from step 3 emits one OCRResult per captured frame: a text string and a mean confidence score. That stream is honest about everything it sees — including a frame where Tesseract grabbed half of an animated logo, a frame where the screen was mid-transition, or a frame where the confidence dropped because the GPU stuttered. If we drove the scene switch directly off "did this frame's text contain the word loading?" we would get a strobe: scene flips on, scene flips off, scene flips on again, all within a second. Step 4 fixes that by introducing a small stateful detector that sits between OCR and the scene-switch dispatcher (which is the subject of the next step). The detector normalizes the OCR text so spelling and spacing noise stops mattering, drops low-confidence frames so a phantom keyword in a blurry frame can't extend the loading state, and applies a two-sided debounce — N consecutive matches to flip on, M consecutive misses to flip off — so a single bad frame on either side of the boundary is absorbed silently. The output of each observe() call carries both the raw per-frame verdict and the debounced state, plus a one-shot state_changed flag that the dispatcher will use to fire its switch exactly once per loading episode rather than every frame the loading screen is still up.

Setup

One new module lands under src/obs_scene_switcher/:

detector.py — DetectorObservation (the per-frame result record), LoadingScreenDetector (the stateful debouncer), plus the small private helpers _normalize_text, _normalize_keywords, _first_keyword_in, and _passes_confidence_floor.

One new test module lands under tests/:

test_detector.py — twenty specs covering the off-state default, the enter-debounce threshold, the streak reset on a mid-streak miss, the exit-debounce threshold, the NO_SIGNAL_CONFIDENCE sentinel as a miss, the low-confidence-counts-as-miss rule, case- and punctuation-insensitivity, multi-word phrase containment, the keyword-priority order, reset(), normalized-and-deduped keywords property, four validation error branches, the timestamp pass-through, and the two "state already settled, don't re-flip" specs.

The package's __init__.py grows two more re-exports so callers reach the detector without touching its private module path:

from .capture import CaptureBackend, Frame, RegionCapturer
from .detector import DetectorObservation, LoadingScreenDetector
from .obs_client import (
    OBSClient,
    OBSIdentified,
    OBSProtocolError,
)
from .ocr import (
    NO_SIGNAL_CONFIDENCE,
    OCREngine,
    OCRResult,
    TesseractOCR,
)
from .region import Region

__all__ = [
    "CaptureBackend",
    "DetectorObservation",
    "Frame",
    "LoadingScreenDetector",
    "NO_SIGNAL_CONFIDENCE",
    "OBSClient",
    "OBSIdentified",
    "OBSProtocolError",
    "OCREngine",
    "OCRResult",
    "Region",
    "RegionCapturer",
    "TesseractOCR",
]

No new runtime dependency is added. The detector is pure logic over OCRResult values — no I/O, no clock, no Tesseract — which is exactly what keeps its twenty specs running in single-digit milliseconds.

Implementation

The shape of the public output is fixed by a frozen dataclass. Callers should never have to inspect detector internals to know what a single frame meant; everything they need rides on one record:

@dataclass(frozen=True)
class DetectorObservation:
    is_loading: bool
    raw_match: bool
    matched_keyword: Optional[str]
    state_changed: bool
    frame_captured_at: float

Three of the five fields exist for one specific downstream reason. raw_match is the per-frame verdict before debounce — useful for diagnostics ("the keyword was visible, we just haven't crossed the threshold yet"). state_changed is the edge-triggered flag — true on exactly the frame that flips the debounced state — so the dispatcher in step 5 can write if obs.state_changed: switch_scene(...) without having to remember the previous frame's value itself. And frame_captured_at is plumbed straight through from OCRResult.frame_captured_at so latency budgets can be computed against the actual frame the decision was made on, not the wall-clock moment the dispatcher reads the result.

Construction takes the keyword list and the three knobs that shape behaviour:

def __init__(
    self,
    *,
    keywords: Sequence[str],
    enter_frames: int = 2,
    exit_frames: int = 2,
    min_confidence: float = 50.0,
) -> None:
    normalized = _normalize_keywords(keywords)
    _require_keywords(normalized)
    _require_positive("enter_frames", enter_frames)
    _require_positive("exit_frames", exit_frames)
    self._keywords = normalized
    self._enter_frames = enter_frames
    self._exit_frames = exit_frames
    self._min_confidence = min_confidence
    self._is_loading = False
    self._match_streak = 0
    self._miss_streak = 0

The defaults — enter_frames=2, exit_frames=2, min_confidence=50.0 — are chosen for a streamer-style capture loop running at roughly two to four frames per second. Two consecutive matches at that rate means "this has been on screen for at least half a second", which is well above the duration of any transition frame Tesseract might misread. Callers who run faster cadences can raise the thresholds; callers running slower can drop them to one and effectively turn debounce off.

Keyword normalization happens once, at construction. Two things matter about the routine: it lowercases and strips punctuation through a single translation table, and it deduplicates while preserving order so the deterministic "first keyword wins" behaviour stays predictable:

def _normalize_keywords(keywords: Iterable[str]) -> Tuple[str, ...]:
    normalized: List[str] = []
    for keyword in keywords:
        cleaned = _normalize_text(keyword)
        if cleaned:
            normalized.append(cleaned)
    seen: set = set()
    unique: List[str] = []
    for keyword in normalized:
        if keyword not in seen:
            seen.add(keyword)
            unique.append(keyword)
    return tuple(unique)


def _normalize_text(text: str) -> str:
    if not isinstance(text, str):
        return ""
    lowered = text.lower().translate(_PUNCTUATION_TABLE)
    return " ".join(lowered.split())

The same _normalize_text is reused on the OCR text every frame. That symmetry is what makes the detector tolerant of Tesseract's spacing noise: a Tesseract output of "NOW LOADING..." becomes "now loading", matches the configured "now loading", and reports matched_keyword="now loading" without anyone having to hand-write a regex.

observe() is intentionally short — three sub-steps with one method call each:

def observe(self, result: OCRResult) -> DetectorObservation:
    matched = self._frame_match(result)
    self._update_streaks(matched.raw_match)
    state_changed = self._maybe_flip()
    return DetectorObservation(
        is_loading=self._is_loading,
        raw_match=matched.raw_match,
        matched_keyword=matched.keyword,
        state_changed=state_changed,
        frame_captured_at=result.frame_captured_at,
    )

The per-frame match step short-circuits on confidence before it bothers normalizing the text. That ordering is deliberate: a frame with confidence below the floor — or with the NO_SIGNAL_CONFIDENCE sentinel from step 3, meaning Tesseract returned zero recognized words — is treated as a non-match no matter what stray ASCII the engine reported. That's the rule the test test_low_confidence_frame_counts_as_miss_even_with_keyword pins down: even if the text literally contains the keyword, a 30% confidence frame is not allowed to drive state.

def _frame_match(self, result: OCRResult) -> "_FrameMatch":
    if not _passes_confidence_floor(result, self._min_confidence):
        return _FrameMatch(raw_match=False, keyword=None)
    keyword = _first_keyword_in(result.text, self._keywords)
    if keyword is None:
        return _FrameMatch(raw_match=False, keyword=None)
    return _FrameMatch(raw_match=True, keyword=keyword)


def _passes_confidence_floor(result: OCRResult, floor: float) -> bool:
    if result.mean_confidence == NO_SIGNAL_CONFIDENCE:
        return False
    return result.mean_confidence >= floor

Streak bookkeeping is one tiny helper. Match increments the match streak and clears the miss streak; miss does the inverse. Both streaks can never be non-zero at the same time, which is the invariant the flip routine relies on:

def _update_streaks(self, raw_match: bool) -> None:
    if raw_match:
        self._match_streak += 1
        self._miss_streak = 0
        return
    self._miss_streak += 1
    self._match_streak = 0

The flip routine is the heart of the debounce. Two conditions, both early-returning, with an explicit "neither fired" tail:

def _maybe_flip(self) -> bool:
    if not self._is_loading and self._match_streak >= self._enter_frames:
        self._is_loading = True
        return True
    if self._is_loading and self._miss_streak >= self._exit_frames:
        self._is_loading = False
        return True
    return False

Two consequences of this shape are worth calling out. First, state_changed is true on exactly one frame per transition — the one that crossed the threshold — which is what the dispatcher in step 5 needs to fire a single scene switch per loading episode rather than per frame. Second, the routine never combines an enter check with an exit check in the same call; the streak-update invariant guarantees that at most one of _match_streak and _miss_streak is non-zero, so the two if arms cannot both fire on a single observation.

The detector also exposes a reset() that clears both streaks and forces state back to off. It's used by tests, and it's the hook the dispatcher will reach for after a scene switch fails and needs to be retried from a known state.

The whole module is just under 190 lines including docstrings, with no function exceeding two levels of if nesting — the codebase's hard cap. Anything that would have crossed the line (the keyword normalization loop, the confidence-floor check, the first-keyword-in-text scan) is pulled out as its own small named helper.

Test it

uv run pytest

...............................................................          [100%]
63 passed in 0.02s

The first forty-three tests are inherited from steps 1, 2, and 3 (OBS handshake, region capture, Tesseract OCR). The remaining twenty are what this step adds: the off-state default, the two enter-debounce threshold specs (one default, one with enter_frames=3), the mid-streak reset, the exit debounce, the NO_SIGNAL_CONFIDENCE-as-miss path, the low-confidence-with-keyword reject, the case- and punctuation-tolerance spec, the multi-word phrase containment spec, the first-keyword-wins priority, reset(), the normalize-and-dedupe keywords property, four validation error branches (empty list, whitespace-only list, enter_frames=0, exit_frames=0), the timestamp pass-through, the "already loading, keep loading, no state_changed" check, and the "match streak past the threshold only flips once" check.

What we got

A pure-logic LoadingScreenDetector plus a small frozen DetectorObservation record, twenty new specs, and zero new runtime dependencies. The module turns the noisy per-frame OCR stream into a stable boolean with normalized keyword matching, a confidence floor that rejects garbage frames, and a symmetric two-sided debounce that absorbs single-frame dropouts on either side of the transition. The next step takes the state_changed edge from DetectorObservation, wires it through a small dispatcher to the OBSClient from step 1, and finally closes the loop: when the loading screen appears on the captured region, OBS switches to a configured scene; when the loading screen leaves, OBS switches back.

Repository

The companion code for this article: https://github.com/vytharion/obs-scene-switcher-ocr-loading-screen

The state of the code after this step: 2f37daa

Key commits to step through:

[a5ba6bd] — step 1: Project scaffold and OBS WebSocket client connection
[810a0af] — step 2: Screen region capture pipeline
[abc7077] — step 3: Tesseract OCR turns captured frames into text + confidence
[2f37daa] — step 4: Loading-screen detection heuristic with keyword matching and debounce