VisionAssist Demo: Real-time object detection and distance estimation in action
VisionAssist is a groundbreaking AI-powered assistant that transforms the way visually impaired individuals interact with their environment. Using state-of-the-art computer vision and natural language processing, it provides real-time audio descriptions of surroundings, making the world more accessible and navigable.
-
🎯 Real-time Object Detection
- Powered by YOLOv8, one of the fastest and most accurate object detection models
- Detects 80+ different types of objects in real-time
- Smooth performance on standard hardware
-
📏 Precise Distance Estimation
- Accurate distance measurements using advanced focal length calculations
- Real-time updates as objects move
- Distance reported in both inches and feet
-
🔊 Natural Audio Descriptions
- Crystal-clear text-to-speech descriptions
- Contextual information about object locations
- Adjustable speech rate and volume
-
🎤 Intuitive Voice Control
- Simple voice commands for system control
- Works in noisy environments
- Supports multiple accents and dialects
- Python 3.8 or higher
- Webcam or USB camera
- Microphone
- Internet connection (for speech recognition)
- 4GB RAM minimum (8GB recommended)
- NVIDIA GPU (optional, for better performance)
-
Clone the repository:
git clone https://github.com/yourusername/VisionAssist.git cd VisionAssist -
Set up a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install required packages:
pip install -r requirements.txt
-
Download the YOLO model:
wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8n.pt
-
Start the application:
python main.py
-
Voice Commands:
- Say "start" to begin object detection
- Say "stop" to pause detection
- Say "quit" to exit the application
-
Object Detection and Distance Estimation:
import cv2 Known_width = 5.7 # Inches # Load the YOLO model model = YOLO('yolov8n.pt') # Load an official model names = model.names # Get class names
-
Text-to-Speech and Speech Recognition:
import pyttsx3 import speech_recognition as sr # Initialize the text-to-speech engine engine = pyttsx3.init() # Initialize the speech recognizer recognizer = sr.Recognizer()
-
Focal Length Calculation:
def FocalLength(measured_distance, real_width, width_in_rf_image): return (width_in_rf_image * measured_distance) / real_width
-
Distance Finder:
def Distance_finder(Focal_Length, real_object_width, object_width_in_frame): return (real_object_width * Focal_Length) / object_width_in_frame
-
Generate Description:
def generate_description(object_distance, class_id): return f"A {class_id} is at {object_distance} inches"
-
Generate Speech:
def generate_speech(description): engine.say(description) engine.runAndWait() print("Say 'start' to continue or 'stop' to end.") command = listen_for_command() return command
-
Listen for Command:
def listen_for_command(): with sr.Microphone() as source: recognizer.adjust_for_ambient_noise(source) audio = recognizer.listen(source) try: command = recognizer.recognize_google(audio).lower() print("Received command:", command) return command except sr.UnknownValueError: print("Sorry, could not understand audio.") return "" except sr.RequestError: print("Could not request results; check your internet connection.") return ""
-
Control Speech:
def control_speech(): while True: command = generate_speech("Description goes here") if command == "start": cap = cv2.VideoCapture(0) # Camera object describe_objects(cap) elif command == "stop": print("Stopping speech generation...") engine.stop() break else: print("Sorry, could not understand the command.")
-
Describe Objects:
def describe_objects(cap): Focal_length_found = None # Initialize Focal_length_found variable while True: ret, frame = cap.read() if not ret: break results = model(frame) # Predict on an image result = results[0] if Focal_length_found is None: Focal_length_found = FocalLength(Known_distance, Known_width, frame.shape[1]) for box in result.boxes: cords = box.xyxy[0].tolist() cords = [round(x) for x in cords] x, y, w, h = cords class_id = result.names[box.cls[0].item()] object_width_in_frame = w object_distance = Distance_finder(Focal_length_found, Known_width, object_width_in_frame) object_distance = round(object_distance, 2) description = generate_description(object_distance, class_id) command = generate_speech(description) if command == "stop": cap.release() cv2.destroyAllWindows() return cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2) cv2.putText(frame, f"Object: {class_id}", (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2) cv2.imshow('Object Detection', frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows()
- Main Function:
def main(): control_speech() if __name__ == "__main__": main()
Common issues and solutions:
-
Camera not detected:
# Try changing the camera index cv2.VideoCapture(1) # Instead of 0
-
Speech recognition errors:
- Ensure stable internet connection
- Check microphone permissions
- Try reducing background noise
-
Performance issues:
- Close other GPU-intensive applications
- Reduce frame resolution in settings
- Use a faster YOLO model variant
We love your input! Check out our Contributing Guidelines to get started.
This project is licensed under the MIT License - see the LICENSE file for details.
If you find this project useful, please consider giving it a star ⭐️
For any questions or support, please open an issue or contact us at your-email@example.com