A Python-based mobile automation agent that uses Qwen3-VL vision-language models to understand and interact with Android devices through visual analysis and ADB commands.
- 🤖 Vision-powered automation: Uses Qwen3-VL to visually understand phone screens
- 📱 ADB integration: Controls Android devices via ADB commands
- 🎯 Natural language tasks: Describe what you want in plain English
- 🖥️ Web UI: Built-in Gradio interface for easy control
- 📊 Real-time feedback: Live screenshots and execution logs
- Python 3.10+
- Android device with USB debugging & Developer Mode enabled
- ADB (Android Debug Bridge) installed
- GPU with sufficient VRAM (Tested on 128gb Strix Halo with Qwen3-VL-30B Model)
- The Repo is set to use the Dense Qwen3-VL 4B/8B Model which performs very well. To swap to an MoE model, see the configuration section below
Linux/Ubuntu:
sudo apt update
sudo apt install adbgit clone https://github.com/OminousIndustries/PhoneDriver.git
cd PhoneDriverCreate a Virtual Enviornment
python -m venv phonedriver
source phonedriver/bin/activateInstall Python Deps
pip install git+https://github.com/huggingface/transformers
# pip install transformers==4.57.0 # currently, V4.57.0 is not released
# Install other requirements
pip install pillow gradio qwen_vl_utils requests- Enable USB debugging on your Android device (Settings → Developer Options)
- Connect via USB
- Verify connection:
adb devicesYou should see your device listed.
Edit qwen_vl_agent.py to choose your model:
# For 4B model
model_name: str = "Qwen/Qwen3-VL-4B-Instruct"
# For 8B model
#model_name: str = "Qwen/Qwen3-VL-8B-Instruct"If you want to try a Qwen3 MoE model, you need to change the import in qwen_vl_agent.py to the following:
#from transformers import Qwen3VLForConditionalGeneration, AutoProcessor - Comment this import out, it is for the Dense models
# Uncomment the import below for the MoE Variants!!!
from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessorYou will also need to change line 61:
self.model = Qwen3VLForConditionalGeneration.from_pretrained(Change it to:
self.model = Qwen3VLMoeForConditionalGeneration.from_pretrained(The agent can auto-detect your device resolution from the Web UI settings tab, but you can manually configure it in config.json.
{
"screen_width": 1080,
"screen_height": 2340,
...
}To get your device resolution, with the device connected to your computer type the following in the terminal:
adb shell wm sizeLaunch the Gradio interface:
python ui.pyNavigate to http://localhost:7860 and enter tasks like:
- "Open Chrome"
- "Search for weather in New York"
- "Open Settings and enable WiFi"
python phone_agent.py "your task here"Example:
python phone_agent.py "Open the camera app"- Screenshot Capture: Takes a screenshot of the phone via ADB
- Visual Analysis: Qwen3-VL analyzes the screen to understand UI elements
- Action Planning: Determines the best action to take (tap, swipe, type, etc.)
- Execution: Sends ADB commands to perform the action
- Repeat: Continues until task is complete or max cycles reached
Key settings in config.json:
temperature: Model creativity (0.0-1.0, default: 0.1)max_tokens: Max response length (default: 512)step_delay: Wait time between actions in seconds (default: 1.5)max_retries: Maximum retry attempts (default: 3)use_flash_attention: Enable Flash Attention 2 for faster inference
Device not detected:
- Ensure USB debugging is enabled
- Run
adb devicesto verify connection - Try
adb kill-server && adb start-server
Wrong tap locations:
- Auto-detect resolution in Settings tab of UI
- Or manually verify with
adb shell wm size
Model loading errors:
- Ensure you have sufficient VRAM
- Try the 8B model for lower memory requirements
- Check that transformers is installed from source
Out of memory:
- Use the 8B model instead of 30B
- Reduce
max_tokensin config - Close other applications using GPU memory
Apache License 2.0 - see LICENSE file for details