// Case Study

Dofus Retro Bot

Reverse engineering a 2005 Flash game with computer vision and packet sniffing

Autonomous bot engine for Dofus 1.29 — computer vision pipeline for UI state reading and passive TCP packet analysis for authoritative game state.

Python Computer Vision OpenCV Network Analysis Scapy Automation Reverse Engineering

View on GitHub

Why I built this

Dofus 1.29 is a Flash-based MMORPG from 2004. The official servers shut down years ago; the game lives on because the community runs private servers. I played it as a kid and ended up reverse engineering it as an adult — which is a fairly predictable outcome for someone who became a security engineer.

The problem is that version 1.29 has no API, no plugin interface, and a custom undocumented TCP protocol. There’s nothing to hook into. Automating it means reconstructing the full game state from scratch using only two signals: what pixels appear on screen and what bytes flow over the network.

I tried pure vision first. It breaks constantly — any resolution change, any UI skin difference, and your template matches start returning garbage. Pure network-based approaches have the opposite problem: you need to reverse-engineer a protocol that isn’t stable across server patches, and you don’t actually know what you’re missing until something silently fails.

The solution I landed on was running both in parallel and making each one authoritative for the part of the game state it’s most reliable for. Vision owns UI state; network owns server-authoritative game state. When they disagree, network wins.

Architecture

The engine is organized into layers with clear ownership boundaries.

Capture layer (src/capture/) grabs frames from the game window via mss on macOS, using Quartz APIs to get accurate window bounds. Retina display scaling is handled at init — the physical/logical pixel ratio is detected once and all coordinates are normalized downstream from that point.

Vision pipeline (src/vision/) runs frame analysis through OpenCV template matching, a scene detector that identifies which screen is active (combat, map, dialog, inventory), and a multi-engine OCR stack for reading text elements like HP values, coordinates, item names, and chat. OCR preprocessing uses per-context presets. DIGITS_GREEN for health bars isolates the HSV green channel before thresholding. COORDINATES uses simple binary threshold at 200 brightness. When primary OCR fails, the engine falls through to EasyOCR, then Tesseract, then a Claude API call as last resort.

Network pipeline (src/network/) is a passive scapy-based TCP sniffer that captures raw packets on the game port (default 5489, auto-detected via port scanner). Stream reassembly is done per-connection using (src_ip, src_port, dst_ip, dst_port) keyed buffers. Reassembled messages feed into the protocol parser, which dispatches typed events through an EventDispatcher.

State machine reconciles vision and network state via a shared game state object. When the two pipelines disagree — network says the character moved but vision still shows the old position — network takes precedence for server-owned state.

Input layer (src/input/) operates at OS level only. No DLL injection, no memory reading. A Humanizer component adds Bezier-curve movement paths, lognormal reaction delays, and click scatter to make input patterns non-uniform. Three speed profiles: casual (250–800ms Bezier duration), normal (150–600ms), fast (100–400ms).

Protocol reverse engineering

The first thing I got wrong was assuming the protocol was binary. Dofus 1.29 is actually text-based — messages are newline-delimited strings over TCP. The server sends \x00-terminated messages; the client sends \n\x00-terminated messages. Fine. The harder part is that message IDs are variable-length prefixes (1–4 characters) with no delimiter separating them from the payload. You don’t know where the ID ends and the data starts unless you already have a registry of known IDs.

I built the parser around longest-prefix-match against a registry of 46 known message types:

def _resolve_message_id(raw: str) -> tuple[str, str]:
    # try longest prefix first (4, 3, 2, 1 chars)
    for length in (4, 3, 2, 1):
        if len(raw) >= length:
            prefix = raw[:length]
            if prefix in _SERVER_PARSERS:
                return prefix, raw[length:]
    # fallback: split on first pipe
    pipe_idx = raw.find("|")
    return (raw[:pipe_idx], raw[pipe_idx + 1:]) if pipe_idx != -1 else (raw, "")

Most messages use pipe-delimited fields, which is manageable. Combat action messages (GA) use semicolons instead and embed the action type in the ID itself — GA0;1;123;path has ID GA, with the action type in the first field. This inconsistency only becomes obvious when you’re capturing live packets and watching the parser silently discard events. Each message family required identifying separately from packet captures, cross-referencing against a decompiled SWF of the game client.

A concrete example: a combat action packet GA0;1;456;abKfge decodes as action type 0 (movement), source entity 456, movement path abKfge — a base-64-like encoding of cell coordinates on the game map.

Vision engine

Scene detection drives everything downstream. The SceneDetector checks marker templates at configurable regions to determine which screen is currently active. It caches the result for 500ms to avoid rechecking on every frame:

def detect(self, frame: np.ndarray, force: bool = False) -> str:
    now = time.monotonic()
    if not force and (now - self._last_scene_time) < self.cache_ttl:
        return self._last_scene
    for screen in self._screens:
        if self._check_markers(frame, screen):
            self._last_scene = screen.name
            ...

Template matching runs multi-scale — templates are loaded at reference resolution and matched against frames that may be at different actual resolutions. A ResolutionScaler tracks the reference resolution and applies scaling factors to all region coordinates before matching. Going multi-scale dropped false-positive match rates from around 40% (fixed pixel matching) to under 5%.

Movement settling uses frame differencing rather than fixed sleep times. After injecting a click, the bot captures frames at intervals and computes mean absolute pixel difference between consecutive pairs. When N consecutive frame pairs fall below a threshold, movement is considered settled. This matters because map transitions have highly variable durations — a fixed sleep that works 90% of the time fails badly on the other 10%.

What works and what doesn’t

Resource farming — mining, harvesting — is the most reliable mode. The loop is predictable: walk to resource, interact, wait, loot, repeat. Vision and network are redundant here; either pipeline alone can drive the action sequence, which makes it resilient.

Map navigation handles simple paths via an A* graph built from game map data files. Complex terrain (water crossings, interiors with teleporters) requires human-assisted path definition. That’s a known gap.

Combat automation is partial. Turn-based combat is tractable — the bot can read turn order from network events, select targets, and cast spells. But there are edge cases that need better handling: spell range validation, position evaluation for AoE spells, multi-character team coordination. The network layer parses all the relevant events; the decision logic just isn’t complete yet.

Tech

Python 3.12, uv for package management. Core dependencies: OpenCV, EasyOCR, Tesseract, scapy, mss, Pydantic for config, loguru for structured logging, Anthropic SDK for OCR fallback. Windows target (bettercam, pydirectinput, win32gui) with macOS support added alongside for development. 1433 tests.

// Related Reading

Reverse Engineering a 2005 Flash Game With OpenCV and Packet Sniffing

Read the blog post