Reverse Engineering a 2005 Flash Game With OpenCV and Packet Sniffing
Dofus Retro (version 1.29) is a Flash-based MMORPG from 2004 that the publisher officially killed years ago. No API, no developer docs, no hooks — just a Flash client talking to community-run private servers over a custom undocumented TCP protocol. I wanted to automate it. The game wasn’t going to help me do that.
So I had to read its screen and sniff its packets.
What started as a weekend experiment turned into a proper reverse engineering project: a bot with two completely separate information sources — a computer vision pipeline for UI state and a TCP packet analyzer for game state — reconciled through a shared state object with explicit authority rules. This article covers the vision side of the bot and how the network layer fills in what vision can’t see reliably.
Why Two Engines
The naive approach to game automation is pure vision: screenshot the window, find what you’re looking for, click it. This works until it doesn’t. Resolution changes, UI skin variations, and partially-obscured windows all cause silent failures where the bot confidently acts on wrong data. I watched it try to click a button that was hidden behind a dialog box for twenty minutes before I gave up on pure vision.
The alternative is pure network analysis: intercept TCP traffic, parse the protocol, know the game state authoritatively. This is more reliable for server-owned state (character position, inventory, entity HP) but tells you nothing about UI state — which dialog is open, what NPC options are visible, whether a button is currently clickable.
The bot uses both. Vision is authoritative for UI state. Network is authoritative for game state that the server owns. When both are available for the same data point, network wins.
Screen Capture
The capture layer uses mss for frame grabbing, with Quartz APIs to get accurate window bounds on macOS. Retina displays require explicit handling — and this caught me off guard initially. Quartz reports coordinates in logical pixels but mss captures at physical pixel resolution. The capturer detects the scale factor at init time and normalizes all downstream coordinates:
def _get_display_scale_factor() -> float: """detect Retina scale factor (logical vs physical pixels)""" main_display = Quartz.CGMainDisplayID() mode = Quartz.CGDisplayCopyDisplayMode(main_display) physical_w = Quartz.CGDisplayModeGetPixelWidth(mode) logical_w = Quartz.CGDisplayModeGetWidth(mode) if logical_w > 0: return physical_w / logical_w return 1.0Without this, template coordinates computed at logical resolution would be off by a factor of two on a Retina display. Every click landing in the wrong place with no obvious error — the kind of bug that makes you question your sanity.
Scene Detection
Before anything else, the bot needs to know which screen is currently visible — combat, map, NPC dialog, inventory, loading screen. Each screen has a distinct UI layout, and the vision pipeline runs different analysis depending on the active scene.
Scene detection works through a marker system: each named screen has a set of marker templates with associated regions. If all markers for a screen are found in their expected regions, that screen is active. Markers are small distinctive image crops — typically UI chrome elements that only appear in specific screens.
def detect(self, frame: np.ndarray, force: bool = False) -> str: now = time.monotonic() if not force and (now - self._last_scene_time) < self.cache_ttl: return self._last_scene
for screen in self._screens: if self._check_markers(frame, screen): self._last_scene = screen.name self._last_scene_time = now return screen.name
return "unknown"The result is cached for 500ms to avoid re-running on every frame. force=True bypasses the cache — used after state transitions where the scene change is expected.
Template Matching
Core UI element detection uses OpenCV’s matchTemplate with TM_CCOEFF_NORMED. The first version used templates at a fixed resolution — which worked fine until the window was resized slightly, and suddenly every detection broke. The fix: load templates at a reference resolution and rescale to match the actual window resolution at match time. That brought false-positive rates from ~40% down to under 5%.
The matcher runs multi-scale search within a configurable pixel range around the reference scale:
def _match_multiscale( self, frame_gray: np.ndarray, template: np.ndarray, threshold: float) -> tuple[float, tuple[int, int, int, int]] | None: best_confidence = 0.0 best_rect = None
for scale in self.scaler.get_scale_variants(): w = max(1, int(template.shape[1] * scale)) h = max(1, int(template.shape[0] * scale)) scaled = cv2.resize(template, (w, h))
result = cv2.matchTemplate(frame_gray, scaled, cv2.TM_CCOEFF_NORMED) _, max_val, _, max_loc = cv2.minMaxLoc(result)
if max_val > best_confidence: best_confidence = max_val best_rect = (max_loc[0], max_loc[1], w, h)
if best_confidence >= threshold: return best_confidence, best_rect return NoneRegion-of-interest support cuts down search area for elements that only appear in predictable parts of the screen (inventory slots, HP bar region, minimap area). Fractional regions are supported — defined as proportions of the full frame so they stay valid across resolutions.
OCR Pipeline
Some game state is only available as text: character coordinates (shown in the UI footer), HP values on health bars, item names in tooltips, chat messages. These require OCR.
The preprocessing step is where most of the accuracy comes from. Raw game screenshots are small, low-contrast, and use pixel fonts — standard OCR models trained on document text fail badly without preprocessing. The engine uses named presets per context:
DIGITS_GREEN = PreprocessPreset( name="digits_green", upscale_factor=3.0, color_isolation={"lower": [35, 50, 50], "upper": [85, 255, 255]}, binary_threshold=128,)
COORDINATES = PreprocessPreset( name="coordinates", upscale_factor=2.0, binary_threshold=200,)
CHAT_TEXT = PreprocessPreset( name="chat_text", upscale_factor=2.0, use_clahe=True, use_adaptive_threshold=True, threshold_block_size=11, threshold_c=2,)DIGITS_GREEN isolates HP bar values: convert to HSV, mask pixels outside the green range, upscale 3x, threshold. This drops background noise before the OCR model sees anything. COORDINATES uses a hard brightness threshold because the coordinate text is white on a dark background with minimal noise.
The OCR itself runs a fallback chain: Tesseract first (fast, good for digits with the right PSM config), then EasyOCR (slower but handles more variation), then a Claude API call for cases where both fail. The API fallback is intentionally last — it adds latency and cost but it works on edge cases that the local models can’t handle.
Network State: The Protocol
The network sniffer runs in a background thread using scapy for raw packet capture. Packets are reassembled per TCP stream using four-tuple keys, and complete messages (terminated by \x00) are extracted from the reassembly buffer.
The Dofus 1.29 protocol is text-based, not binary — which was the first surprise. I expected length-prefixed binary frames, the kind of thing you’d see in any modern networked game. Instead: variable-length strings, where the message ID is a 1–4 character prefix with no delimiter separating it from the payload.
The parser resolves IDs using longest-prefix-match against a registry of 46 known message types, built up incrementally from live packet captures correlated with in-game observations:
cMK*|123|PlayerName|Hello world -> id=cMK, channel=*, source=123, name=PlayerNameGDM|4327|1234567890|decryptKey -> id=GDM (map_data), map_id=4327GA0;1;456;abKfge -> id=GA (action), type=0 (movement), source=456GTS789|30000 -> id=GTS (turn_start), entity=789, timeout=30000The GA (game action) family is the most complex: combat actions use semicolons instead of pipes, and the action type is embedded in the first field. Action type 0 is movement, 100/300 are spell casts, 102 is HP change. Figuring this out required hundreds of packet captures correlated with in-game observations until the structure became consistent.
A complete parsed combat sequence looks like:
GTS456|30000 # turn starts for entity 456, 30s timeoutGA300;1;456;789;3;abcde # entity 456 casts spell 789 level 3 at target with path abcdeGA102;0;789;-145 # entity 789 takes 145 HP damageGTF456 # entity 456 ends turnThe EventDispatcher converts parsed messages into typed GameEvent callbacks. Consumers subscribe by event type:
dispatcher.on(GameEvent.TURN_START, handle_turn_start)dispatcher.on(GameEvent.HP_CHANGE, track_hp)dispatcher.on(GameEvent.MAP_LOADED, refresh_map_state)60+ event types cover connection lifecycle, combat, navigation, inventory, chat, and character state.
Movement: Vision-Confirmed
After injecting a click to move the character, the bot doesn’t use a fixed sleep. That approach breaks on map transitions and slow load screens. Instead, it captures frames at intervals and measures mean absolute pixel difference between consecutive frames:
def _wait_for_movement_settle(self, capturer: FrameCapturer, timeout_s: float) -> bool: settle_threshold = 5.0 # mean absolute pixel diff below this = settled consecutive_settled = 0 prev_frame = None
while time.monotonic() < deadline: frame = capturer.grab() if prev_frame is not None: diff = np.mean(np.abs( frame.astype(np.float32) - prev_frame.astype(np.float32) )) if diff < settle_threshold: consecutive_settled += 1 if consecutive_settled >= required_checks: return True else: consecutive_settled = 0 prev_frame = frame time.sleep(interval_s) return FalseWhen the character stops moving, consecutive frames become nearly identical and the mean diff drops below 5.0 (out of 255). The threshold was tuned empirically — low enough to catch idle frames, high enough to not trigger on UI animations like chat scrolling or health bar updates.
Map transitions (loading screens) have highly variable duration and would break any fixed-time approach. The frame differencing approach handles them naturally: the loading screen is static once fully loaded, so the settle condition fires immediately.
What It Taught Me
The protocol reverse engineering was the most interesting part. The variable-length IDs with no delimiter mean you can’t parse a message without knowing the complete registry of valid prefixes — which itself required building up incrementally from live captures. You can’t start with the format. You have to discover the format.
The dual-authority state model (vision + network reconciled through a shared state object, with explicit authority rules) made debugging tractable. When the bot made a wrong decision, the state log showed exactly which pipeline provided which data and which rule resolved the conflict. Without that structure, debugging automation bugs in a live game session is nearly impossible.
The full source is on GitHub. The project page has an architecture overview.