TeleEgo Benchmark

Live-in-the-Wild Personal Assistant Benchmark
🎥 Egocentric Video 🔊 Multi-Modal Audio ⏱️ Real-Time Processing 🧠 Long-Term Memory

Why TeleEgo?

Existing benchmarks evaluate AI assistants on curated, short-duration clips. TeleEgo challenges models with real-world, continuous streams spanning hours of daily activities across diverse scenarios.

🌍

Real-World Complexity

Authentic egocentric recordings from real users performing daily activities across work, study, social, shopping, health, and travel scenarios.

🎬

Streaming Protocol

Questions arrive dynamically throughout the video stream, mimicking real personal assistant interactions without pre-segmentation.

🧩

Multi-Modal Integration

Combines egocentric video, ambient audio, speech transcripts, and visual narrations requiring cross-modal reasoning.

Long-Horizon Memory

Tests model's ability to retain and recall information across extended time periods, from seconds to hours.

Real-Time Constraints

Models must respond within decision windows, reflecting the temporal demands of live assistance.

📊

Verifiable Evidence

Every answer requires precise temporal and modality grounding, enabling auditable evaluation.

Leaderboard

Comprehensive evaluation results on TeleEgo benchmark. Models are tested on both Test Set A (public) and Test Set B (hidden).

🏆 Test Set A (Public Leaderboard)

Public test set with released ground truth. Models can be fine-tuned and optimized on this set.

Rank Model Memory
(Avg %)
Understanding
(Avg %)
Cross-Memory
(Avg %)
Overall
(Avg %)
MPT
(minutes)
🥇 1 GPT-4o 42.69 60.92 45.87 48.04 3.01
🥈 2 Gemini-2.5-Pro 42.23 57.98 40.26 46.35 2.76
3 MiniCPM-o 40.36 50.19 38.28 42.84 2.19
4 Qwen2.5-VL-2.5 34.24 35.89 27.39 33.96 1.60
4 Videochat-Online 28.91 41.76 29.04 32.46 1.33
5 Qwen2.5-Omni 25.34 27.33 20.13 25.33 1.00

🔒 Test Set B (Hidden Test Set) Continue...

Hidden test set for unbiased evaluation. Ground truth is not released. Submit your model predictions for evaluation.

Rank Model Memory
(Avg %)
Understanding
(Avg %)
Cross-Memory
(Avg %)
Overall
(Avg %)
MPT
(minutes)
🥇 1 GPT-4o -- -- -- -- --
🥈 2 Gemini-2.5-Pro -- -- -- -- --
3 MiniCPM-o -- -- -- -- --
4 Qwen2.5-VL-2.5 -- -- -- -- --
5 Videochat-Online -- -- -- -- --
6 Qwen2.5-Omni -- -- -- -- --

🚀 Be among the first to submit your results! Submit now →

📊 Evaluation Metrics

⚡ Real-Time Accuracy (RTA)

Measures whether the model can produce a correct answer within the decision window (time interval between question arrival and deadline). This reflects practical usability in streaming scenarios.

RTA = (# correctly answered within decision window) / (# total questions)

Only the first correct output within the decision window receives credit.

⏱️ Memory Persistence Time (MPT)

Among questions initially answered correctly, MPT measures how long the model can still recover the correct answer without re-exposure to the underlying evidence.

MPT = Average retention time across correctly answered items

Higher MPT indicates better long-horizon memory retention.

Auditable Evidence Compliance

Every QA item includes time-stamped evidence spans and required-modality tags. Systems must localize verifiable support from the correct modalities.

🎯

Temporal Grounding

Models must identify the exact time intervals containing relevant evidence, preventing vague or hallucinated responses.

🔊

Modality Compliance

Each question specifies required modalities (video, speech, narration). Answers must draw from the correct sources.

📝

Evidence Overlap

Submitted evidence must overlap with annotated spans, enabling quantitative assessment of retrieval quality.

Dataset & Scenarios

💼

Work & Study

Meetings, focused learning, research, and collaborative tasks.

🏡

Lifestyle & Routines

Daily habits, home organization, wellness, and time management.

👥

Social Activities

Conversations, gatherings, group coordination, and shared events.

🎭

Outings & Culture

Dining, entertainment, museums, concerts, and city exploration.

Multi-Modal Streams

🎥

Egocentric Video

Continuous first-person visual stream captured from wearable cameras

🔊

Speech & Ambient Audio

Full audio track including conversations and environmental sounds

📝

Visual Narration

Human-authored descriptions of visual events with precise timestamps

💬

ASR Transcripts

Automatic speech recognition text aligned with audio stream

📤 Submit Your Results

We welcome submissions from all researchers! Choose one of the methods below to submit your model's evaluation results.

🔀 Method 1: GitHub Pull Request (Recommended)

This is the preferred method for official leaderboard submissions. Your results will be reviewed and added to the leaderboard.

📝 Steps:
  1. Fork our repository: github.com/Programmergg/TeleEgo
  2. Create a new directory under submissions/ with your model name:
    submissions/your-model-name-YYYY-MM-DD/
  3. Add your submission files (see requirements below)
  4. Create a pull request with the title: [Submission] Your Model Name
  5. Our team will review and merge your submission within 3-5 business days

📋 Required Files:

  • results.json - Your model's predictions (see format below)
  • metadata.json - Model information and configuration
  • README.md - Brief description of your model and approach
  • (Optional) paper.pdf - Technical paper if available

💬 Method 2: GitHub Issue

For quick submissions or if you're unable to create a pull request, you can submit via GitHub Issues.

📝 Steps:
  1. Go to our Issues page
  2. Click "New Issue" and select the "Submission" template
  3. Fill in all required information in the template
  4. Attach your results.json and metadata.json files
  5. Submit the issue and we'll process your submission

📧 Method 3: Email Submission

If GitHub is not accessible, you can email your submission directly to our team.

📝 Steps:
  1. Prepare your submission files as a ZIP archive
  2. Include all required files: results.json, metadata.json, README.md
  3. Email to: chengxuyuangg@gmail.com
  4. Use subject line: [TeleEgo Submission] Your Model Name
  5. We'll confirm receipt within 48 hours
⚠️ Note: Email submissions may take longer to process than GitHub submissions. Please allow 5-7 business days for review.

📄 Submission Format

results.json Format:

Your prediction results for each question in the test set.

{ "submission_info": { "model_name": "YourModel-v1.0", "submission_date": "2025-01-15", "team": "Your Organization" }, "predictions": [ { "question_id": "Q001", "answer": "Your model's answer", "response_time": 2.5, "evidence_spans": [ { "modality": ""video", "audio", "text"" } ] } ] }

metadata.json Format:

Information about your model and experimental setup.

{ "model_name": "YourModel-v1.0", "organization": "Your Organization", "authors": ["Author 1", "Author 2"], "contact_email": "your-email@example.com", "paper_url": "https://arXiv.org/abs/...", "code_url": "https://github.com/...", "model_description": "Brief description of your approach", "model_parameters": "7B", "training_data": "Description of training data used", "modalities_used": ["video", "audio", "text"], "inference_time": "Average time per question" }
⚠️ Important Guidelines:
  • Ensure your results are reproducible
  • Do not use the test set for training or hyperparameter tuning
  • Provide complete information in metadata.json
  • Include evidence spans for all predictions when possible
  • Follow the exact JSON format specified above

Need Help?

If you have questions about the submission process, please:

Citation

If you find TeleEgo useful in your research, please cite our paper:

@misc{yan2025teleegobenchmarkingegocentricai, title={TeleEgo: Benchmarking Egocentric AI Assistants in the Wild}, author={Jiaqi Yan and Ruilong Ren and Jingren Liu and Shuning Xu and Ling Wang and Yiheng Wang and Yun Wang and Long Zhang and Xiangyu Chen and Changzhi Sun and Jixiang Luo and Dell Zhang and Hao Sun and Chi Zhang and Xuelong Li}, year={2025}, eprint={2510.23981}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.23981}, }