jx.
jx10 min read

Build a Publisher-Safe MMO OCR Translation Overlay in Python with mss and Tesseract

Read foreign-language MMO UI text without touching game memory or input. Capture frames with mss, recognize glyphs with Windows OCR or Tesseract, render translations through a transparent Tkinter overlay.

Build a Publisher-Safe MMO OCR Translation Overlay in Python with mss and Tesseract

Foreign-language MMOs are a common pain point for solo players. Quest text, NPC dialogue, gear tooltips, and faction names all live behind a glyph wall the client never exposes through an API. The community workaround is a screen-reading overlay: capture pixels from the game window, run OCR, translate, and paint a transparent label back on top of the original UI region.

The architecture matters as much as the tooling. A memory-reading translator that hooks into the game process can read text in its raw form and ships pristine results, but it pokes at the client in ways most publishers explicitly forbid. A pixel-based OCR overlay never opens the process handle, never injects keys, never modifies a single byte of the game's address space. It does exactly what a streaming viewer's eyes do, just faster. That separation is what keeps this approach inside the boundaries publishers allow.

This article walks through the three load-bearing pieces of an OCR overlay: mss for screen capture, an OCR engine (Windows OCR API vs Tesseract, with numbers for both), and a transparent tkinter window for rendering. Translation gets a section too, since the choice between cloud APIs and offline neural models drives the latency budget of the whole loop.

What stays off-limits

Before any code, the line is worth stating sharply.

Permitted by the read-only overlay model:

  • mss.grab(region) calls against the desktop framebuffer
  • OCR run on the captured PNG bytes
  • A topmost transparent window drawn outside the game's process space
  • Local clipboard writes if the operator wants to paste names elsewhere

Not in this pipeline, and not in scope:

  • Reading the game's memory with ReadProcessMemory or pymem
  • Sending synthetic input via SendInput, pyautogui, pydirectinput, or a hardware key emulator
  • Hooking DirectX / Vulkan calls to intercept render data
  • Modifying client files, resource packs, or shaders

Treat that list as a hard contract. The moment the overlay starts driving the game, it stops being an accessibility tool and starts being something a TOS team will care about.

Architecture

A working overlay is a loop with three stages and one cache.

  1. Capture: grab a fixed region of the game window every N milliseconds.
  2. Detect change: hash the captured bytes; skip OCR if the region is identical to the previous frame.
  3. OCR + translate: only when the cache misses, send the bytes to OCR, then to the translator.
  4. Render: update a transparent topmost window with the translated string positioned over the original glyph region.

The cache is what makes the loop survive. MMOs spend most frames showing static UI; OCR on every frame would burn a CPU core for no reason. A simple FNV-1a or hashlib.blake2b over the captured bytes drops the OCR call rate by 80–95% on idle screens, which is the difference between a 5 watt overlay and a 35 watt one on a laptop.

Capturing with mss

The mss library is the right capture tool here because it talks directly to the OS framebuffer through BitBlt on Windows, XGetImage on Linux, and CGWindowListCreateImage on macOS. It is roughly 3× faster than PIL.ImageGrab on Windows in my measurements (around 4 ms vs 12 ms per full-screen capture on a 2560×1440 display) and supports arbitrary regions without an extra crop step. The project is maintained at github.com/BoboTiG/python-mss.

import mss
import hashlib

# Region: the tooltip area for an MMO item inspector.
# Pick these coordinates by eyeballing the UI once and recording them.
REGION = {"left": 1200, "top": 600, "width": 420, "height": 180}

def capture_loop(region):
    last_hash = None
    with mss.mss() as sct:
        while True:
            shot = sct.grab(region)
            digest = hashlib.blake2b(bytes(shot.raw), digest_size=8).digest()
            if digest != last_hash:
                last_hash = digest
                yield bytes(shot.raw), shot.size

The generator yields raw BGRA bytes plus the size tuple. Both go straight into OCR with no PNG roundtrip, which saves another 1–2 ms per frame. If the OCR backend wants a PIL Image, wrap the bytes once: Image.frombytes("RGB", shot.size, shot.rgb).

Pick the polling interval to match the game's UI cadence. Quest dialog at 4 Hz (250 ms) feels live without thrashing; a tooltip overlay can run at 10 Hz when the cursor is moving and idle when it parks.

OCR: Windows OCR API vs Tesseract

Two engines cover almost every MMO scenario. They make different trade-offs.

DimensionWindows OCR APITesseract 5.x
Install footprint0 MB (built into Windows 10+)~80 MB engine + ~15 MB per language pack
First-call latency80–120 ms cold, 20–40 ms warm60–90 ms warm (LSTM mode)
CJK accuracy on UI fontsHigh; trained on screen-rendered textMedium; needs --psm 7 plus a font hint
Cross-platformWindows-onlyLinux, macOS, Windows
Python bindingwinsdk.windows.media.ocrpytesseract subprocess

Most MMOs ship a Windows client, and the Windows OCR API was literally trained on screen-rendered Korean, Japanese, and Simplified Chinese, so when the target language is one of those, prefer it. The binding lives in the official winsdk package, which exposes the WinRT OcrEngine class through native Python.

When the overlay needs to run on a streaming viewer's Linux box or a macOS dev's MacBook, Tesseract from tesseract-ocr/tesseract is the portable answer. Install language packs by ISO code (tessdata/kor.traineddata, chi_sim.traineddata).

Here is a Tesseract path that runs on any platform with pip install pytesseract plus the system binary:

import pytesseract
from PIL import Image

def ocr_tesseract(bgra_bytes, size, lang="kor+eng"):
    # mss yields BGRA; PIL wants RGB for OCR.
    img = Image.frombytes("RGB", size, _bgra_to_rgb(bgra_bytes))
    # PSM 6 = uniform block of text; works well for MMO tooltips
    # which are short paragraphs in a single column.
    return pytesseract.image_to_string(img, lang=lang, config="--psm 6")

def _bgra_to_rgb(bgra):
    # Strip alpha and swap channels in one pass.
    mv = memoryview(bgra)
    return bytes((mv[i + 2], mv[i + 1], mv[i]) for i in range(0, len(mv), 4))

That --psm 6 flag is the most underrated knob in Tesseract. The default page-segmentation mode assumes a magazine page; for a 200×60 px tooltip with a single sentence, --psm 6 (uniform block) or --psm 7 (single line) cuts error rate roughly in half compared to PSM 3 on tested Korean MMO captures.

Translation backend

Once you have a recognized string, the choice is cloud vs offline.

Cloud APIs (DeepL, Google Translate, Microsoft Translator) deliver the best quality and ship in 150–400 ms per call from a residential connection. DeepL's free tier caps at 500K characters per month, which is more than enough for a personal overlay running 8 hours a day; a typical MMO session sends maybe 30K characters of unique text. Cache aggressively: an LRU dict keyed by the OCR output string skips the network for any tooltip you have already seen.

Offline neural translation is the alternative when network round-trips are unacceptable or when the operator wants zero external dependencies. Argos Translate (github.com/argosopentech/argos-translate) packages OPUS-MT models behind a clean Python API and runs entirely on CPU. Expect 200–600 ms per sentence on a modern laptop and quality that is 15–20% behind DeepL on idiomatic game text. For static UI strings the gap is smaller; for flavor text it is more obvious.

A pragmatic split is offline for hover-anything tooltips (low quality bar, high frequency) and cloud for quest dialogue (high quality bar, low frequency). The router is just a length check: route strings under 80 characters to the offline path, longer strings to the cloud.

Transparent Tkinter overlay

The render target is a topmost, frameless, click-through window with a chroma-key background. Tkinter handles this on Windows with two attributes:

import tkinter as tk

class Overlay:
    def __init__(self):
        self.root = tk.Tk()
        # Frameless, always on top, no taskbar entry.
        self.root.overrideredirect(True)
        self.root.attributes("-topmost", True)
        # Pick a fluorescent color the game UI will never use,
        # then make that color fully transparent.
        chroma = "#ff00ff"
        self.root.configure(bg=chroma)
        self.root.attributes("-transparentcolor", chroma)

        self.label = tk.Label(
            self.root,
            text="",
            font=("Segoe UI", 14),
            fg="white",
            bg="black",  # solid backdrop so text stays legible
            wraplength=400,
            justify="left",
        )
        self.label.pack(padx=4, pady=4)

    def show(self, text, x, y):
        self.label.configure(text=text)
        self.root.geometry(f"+{x}+{y}")
        self.root.update_idletasks()

The -transparentcolor trick is Windows-specific and the right answer here because it costs nothing per frame; the compositor handles the cutout. On Linux + X11, wm_attributes("-alpha", 0.0) plus a separately-alpha'd label widget gets close but with worse compositor performance. On Wayland, native Tkinter transparency is patchy enough that switching to a small Qt window via PySide6 is usually less pain than fighting the X server emulation layer.

Click-through is the missing piece. Out-of-the-box Tkinter swallows mouse events on the transparent region, which means the player cannot click through the overlay to interact with the game underneath. The fix on Windows is one ctypes call after the window is mapped:

import ctypes

WS_EX_LAYERED = 0x00080000
WS_EX_TRANSPARENT = 0x00000020
GWL_EXSTYLE = -20

def make_click_through(hwnd):
    user32 = ctypes.windll.user32
    style = user32.GetWindowLongW(hwnd, GWL_EXSTYLE)
    user32.SetWindowLongW(hwnd, GWL_EXSTYLE, style | WS_EX_LAYERED | WS_EX_TRANSPARENT)

Call it after root.update() so the HWND is real, passing root.winfo_id(). The overlay now sits inert on top of the game; the player sees translations but every click and keypress goes straight to the game window.

Putting the loop together

The three pieces meet in a simple coordinator. Capture yields changed frames; OCR + translate run only when the hash flips; the overlay is updated on the Tkinter event loop via after() to keep widget access on the main thread.

def run(overlay, region, ocr_fn, translate_fn):
    cache = {}  # ocr_text -> translated_text

    def tick():
        try:
            shot = next(capture_iter)
        except StopIteration:
            return
        bgra, size = shot
        text = ocr_fn(bgra, size).strip()
        if text:
            translated = cache.get(text)
            if translated is None:
                translated = translate_fn(text)
                cache[text] = translated
            overlay.show(translated, region["left"], region["top"] - 30)
        overlay.root.after(100, tick)

    capture_iter = capture_loop(region)
    overlay.root.after(100, tick)
    overlay.root.mainloop()

Anchoring the overlay 30 px above the captured region is the convention that reads best: the original glyphs stay visible for cross-checking, and the translated text floats in the negative space above the UI element.

Pitfalls worth knowing

A few rough edges show up consistently when shipping this kind of overlay.

Subpixel rendering destroys OCR. ClearType on Windows or LCD-pixel-order anti-aliasing on Linux smears glyph edges into red/blue fringes that Tesseract reads as noise. Pass the image through a quick grayscale + threshold first: img.convert("L").point(lambda p: 255 if p > 180 else 0) recovers 5–10% accuracy on light-on-dark MMO tooltips.

Animated UI elements thrash the cache. A quest tracker with a pulsing icon will hash differently every frame even when the text is identical. Mask the animated regions out of the captured bytes before hashing, or hash only the central text band of the region.

DPI scaling breaks coordinates. Windows reports virtual pixels by default, but mss and the game both work in physical pixels. Call ctypes.windll.shcore.SetProcessDpiAwareness(2) at startup so the whole Python process reports physical coordinates; otherwise a 1.5× display will offset capture regions by 50%.

Game fullscreen-exclusive mode hides the overlay. Borderless fullscreen and windowed both work; true exclusive fullscreen takes over the compositor and the Tkinter window vanishes. Set the game to borderless and the overlay reappears.

Where to take it next

The same architecture extends naturally. Add a second capture region for chat. Pipe OCR output to a TTS engine for accessibility. Bind a hotkey (using a global listener, not a synthetic input library) to toggle the overlay per encounter type. The boundary that keeps the project safe is the same one: read pixels, render to a separate window, never write back to the game.

References: