Human-Centric Computer Vision for the Artificial Intelligence of Things

Doctoral Candidate Name: 
Christopher Neff
Program: 
Electrical and Computer Engineering
Abstract: 

This dissertation presents a comprehensive exploration of innovative approaches and systems at the intersection of edge computing, deep learning, and real-time video analytics, with a focus on real-world computer vision for the Artificial Intelligence of Things (AIoT). The research comprises four distinct articles, each contributing to the advancement of AIoT systems, intelligent surveillance, lightweight human pose estimation, and real-world domain adaptation for person re-identification.

The first article, REVAMP2T: Real-time Edge Video Analytics for Multicamera Privacy-aware Pedestrian Tracking, introduces REVAMP2T, an integrated end-to-end IoT system for privacy-built-in decentralized situational awareness. REVAMP2T presents novel algorithmic and system constructs to push deep learning and video analytics next to IoT devices (i.e. video cameras). On the algorithm side, REVAMP2T proposes a unified integrated computer vision pipeline for detection, reidentification, and racking across multiple cameras without the need for storing the streaming data. At the same time, it avoids facial recognition, and tracks and reidentifies pedestrians based on their key features at runtime. On the IoT system side, REVAMP2T provides infrastructure to maximize hardware utilization on the edge, orchestrates global communications, and provides system-wide re-identification, without the use of personally identifiable information, for a distributed IoT network. For the results and evaluation, this article also proposes a new metric, Accuracy•Efficiency (Æ), for holistic evaluation of AIoT systems for real-time video analytics based on accuracy, performance, and power efficiency. REVAMP2T outperforms current state-of-the-art by as much as thirteen-fold Æ improvement.

The second article, Ancilia: Scalable Intelligent Video Surveillance for the
Artificial Intelligence of Things, presents an end-to-end scalable intelligent video
surveillance system tailored for the Artificial Intelligence of Things. Ancilia brings
state-of-the-art artificial intelligence to real-world surveillance applications while respecting ethical concerns and performing high-level cognitive tasks in real-time. Ancilia aims to revolutionize the surveillance landscape, to bring more effective, intelligent, and equitable security to the field, resulting in safer and more secure communities without requiring people to compromise their right to privacy.

The third article, EfficientHRNet: Efficient and Scalable High-Resolution Networks for Real-Time Multi-Person 2D Human Pose Estimation, focuses on the increasing demand for lightweight multi-person pose estimation, a vital component of emerging smart IoT applications. Existing algorithms tend to have large model sizes and intense computational requirements, making them ill-suited for real-time applications and deployment on resource-constrained hardware. Lightweight and real-time approaches are exceedingly rare and come at the cost of inferior accuracy. This article presents EfficientHRNet, a family of lightweight multi-person human pose estimators that are able to perform in real-time on resource-constrained devices. By unifying recent advances in model scaling with high-resolution feature representations, EfficientHRNet creates highly accurate models while reducing computation enough to achieve real-time performance. The largest model is able to come within 4.4% accuracy of the current state-of-the-art, while having 1/3 the model size and 1/6 the computation, achieving 23 FPS on Nvidia Jetson Xavier. Compared to the top real-time approach, EfficientHRNet increases accuracy by 22% while achieving similar FPS with 1 the power. At every level, EfficientHRNet proves to be more computationally efficient than other bottom-up 2D human pose estimation approaches, while achieving highly competitive accuracy.

The final article introduces the concept of R2OUDA: Real-world Real-time Online Unsupervised Domain Adaptation for Person Re-identification. Following the popularity of Unsupervised Domain Adaptation (UDA) in person reidentification, the recently proposed setting of Online Unsupervised Domain Adaptation (OUDA) attempts to bridge the gap towards practical applications by introducing a consideration of streaming data. However, this still falls short of truly representing real-world applications. The R2OUDA setting sets the stage for true real-world real-time OUDA, bringing to light four major limitations found in real-world applications that are often neglected in current research: system generated person images, subset distribution selection, time-based data stream segmentation, and a segment-based time constraint. To address all aspects of this new R2OUDA setting, this paper further proposes Real-World Real-Time Online Streaming Mutual Mean-Teaching (R2MMT), a novel multi-camera system for real-world person re-identification. Taking a popular person re-identification dataset, R2MMT was used to construct over 100 data subsets
and train more than 3000 models, exploring the breadth of the R2OUDA setting to understand the training time and accuracy trade-offs and limitations for real-world
applications. R2MMT, a real-world system able to respect the strict constraints of the proposed R2OUDA setting, achieves accuracies within 0.1% of comparable OUDA methods that cannot be applied directly to real-world applications.

Collectively, this dissertation contributes to the evolution of intelligent surveillance, lightweight human pose estimation, edge-based video analytics, and real-time unsupervised domain adaptation, advancing the capabilities of real-world computer vision in AIoT applications.

Defense Date and Time: 
Friday, October 27, 2023 - 10:00am
Defense Location: 
EPIC 3344
Committee Chair's Name: 
Dr. Hamed Tabkhi
Committee Members: 
Dr. Arun Ravindran, Dr. Ahmed Arafa, Dr. Vasily Astratov