Video synopsis

Last updated
Video synopsis example: 9 hours of activity summarized in a 20-second simultaneous presentation of multiple objects and activities that occurred at different times. Video Synopsis before-after.jpg
Video synopsis example: 9 hours of activity summarized in a 20-second simultaneous presentation of multiple objects and activities that occurred at different times.

Video synopsis is a method for automatically synthesizing a short, informative summary of a video. Unlike traditional video summarization, the synopsis is not just composed of frames from the original video. [1] The algorithm detects, tracks and analyzes moving objects (also called events) in a database of objects and activities. [2] The final output is a new, short video clip in which objects and activities that originally occurred at different times are displayed simultaneously, so as to convey information in the shortest possible time. Video synopsis has specific applications in the field of video analytics and video surveillance where, despite technological advancements and increased growth in the deployment of CCTV (closed circuit television) cameras, [3] viewing and analysis of recorded footage is still a costly labor-intensive and time-intensive task.

Contents

Technology overview

Video synopsis combines a visual summary of stored video together with an indexing mechanism.

When a summary is required, all objects from the target period are collected and shifted in time to create a much shorter synopsis video showing maximum activity. A synopsis video clip is generated, in real time, in which objects and activities that originally occurred in different times are displayed simultaneously. [4]

Tube packing - Schematic example: Creating the video summary by re-timing the space-time tubes, (X represents the 2-dimensional XY axis of each frame). WhitePaper fig 2.jpg
Tube packing – Schematic example: Creating the video summary by re-timing the space-time tubes, (X represents the 2-dimensional XY axis of each frame).

The process begins by detecting and tracking objects of interest. Each object is represented as a tube in space-time of all video frames. Objects are detected and stored in a database in approximately real time.

Following a request to summarize a time period, all objects from the desired time are extracted from the database, and indexed to create a much shorter summary video containing maximum activity.

Real time rendering is used to generate the summary video after object re-timing. This allows end-user control over object/event density.

Video synopsis technology was invented by Prof. Shmuel Peleg [5] of The Hebrew University of Jerusalem, Israel, and is being developed under commercial license by BriefCam, Ltd. [6] BriefCam received a license to use the technology from Yissum which is the owner of the patents registered for the technology. In May 2018, BriefCam was acquired by Japanese digital imaging giant, Canon Inc., for an estimated $90 million. [7] Investors in the company include Motorola Solutions Venture Capital, Aviv Venture Capital, and OurCrowd. [8]

Another video synopsis example Video Synopsis before+after (horizontal).jpg
Another video synopsis example

Recent advances

Recent advances in the field of video synopsis have resulted in methods that focus in collecting key-points(or frames) from the long uncut video and presenting them as a chain of key events that summarize the video. This is only one of the many methods employed in modern literature to perform this task. [9] Recently, these event-driven methods have focused on correlating objects in frames, but in a more semantically related way that has been called a story-driven method of summarizing video. These methods have been shown to work well for egocentric [10] settings where the video is basically a point-of-view perspective of a single person or a group of people.

Classification

Video synopsis techniques have a number of standardized properties in common, which can be quantified as follows: (a) the video synopsis should contain the maximum activity with the least redundancy; (b) the chronological order and spatial consistency of objects in space and time must be preserved; (c) in the resultant synopsis video, there must be minimal collision; and (d) the synopsis video must be smooth and able to permit viewing without losing the region of interest. [11]

  1. Keyframe-Based Synopsis
  2. Object-Based Synopsis
  3. Action-Based Synopsis
  4. Collision Graph-Based Synopsis
  5. Abnormal Content-Based Synopsis

See also

Related Research Articles

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

Video tracking is the process of locating a moving object over time using a camera. It has a variety of uses, some of which are: human-computer interaction, security and surveillance, video communication and compression, augmented reality, traffic control, medical imaging and video editing. Video tracking can be a time-consuming process due to the amount of data that is contained in video. Adding further to the complexity is the possible need to use object recognition techniques for tracking, a challenging problem in its own right.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate Computer Vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.

<span class="mw-page-title-main">Lifelog</span> Personal record of ones daily life

A lifelog is a personal record of one's daily life in a varying amount of detail, for a variety of purposes. The record contains a comprehensive dataset of a human's activities. The data could be used to increase knowledge about how people live their lives. In recent years, some lifelog data has been automatically captured by wearable technology or mobile devices. People who keep lifelogs about themselves are known as lifeloggers.

<span class="mw-page-title-main">ActionShot</span> Photography Technique

ActionShot is a method of capturing an object in action and displaying it in a single image with multiple sequential appearances of the object.

Route panorama is a continuous 2D image that includes all the scenes visible from a route, as it first appeared in Zheng and Tsuji's work of panoramic views in 1990.

<span class="mw-page-title-main">Foreground detection</span>

Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

<span class="mw-page-title-main">Subhasis Chaudhuri</span>

Subhasis Chaudhuri is an Indian electrical engineer and the director at the Indian Institute of Technology, Bombay. He is a former K. N. Bajaj Chair Professor of the Department of Electrical Engineering of IIT Bombay. He is known for his pioneering studies on computer vision and is an elected fellow of all the three major Indian science academies viz. the National Academy of Sciences, India, Indian Academy of Sciences, and Indian National Science Academy. He is also a fellow of Institute of Electrical and Electronics Engineers, and the Indian National Academy of Engineering. The Council of Scientific and Industrial Research, the apex agency of the Government of India for scientific research, awarded him the Shanti Swarup Bhatnagar Prize for Science and Technology, one of the highest Indian science awards, in 2004 for his contributions to Engineering Sciences.

Egocentric vision or first-person vision is a sub-field of computer vision that entails analyzing images and videos captured by a wearable camera, which is typically worn on the head or on the chest and naturally approximates the visual field of the camera wearer. Consequently, visual data capture the part of the scene on which the user focuses to carry out the task at hand and offer a valuable perspective to understand the user's activities and their context in a naturalistic setting.

Moving object detection is a technique used in computer vision and image processing. Multiple consecutive frames from a video are compared by various methods to determine if any moving object is detected.

<span class="mw-page-title-main">Object co-segmentation</span>

In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames.

Kristen Lorraine Grauman is a Professor of Computer Science at the University of Texas at Austin on leave as a research scientist at Facebook AI Research (FAIR). She works on computer vision and machine learning.

Jason Joseph Corso is Co-Founder / CEO of the computer vision startup Voxel51 and a Professor of Robotics, Electrical Engineering and Computer Science at the University of Michigan.

Jiebo Luo is a Chinese-American computer scientist, the Albert Arendt Hopeman Professor of Engineering and Professor of Computer Science at the University of Rochester. He is interested in artificial intelligence, data science and computer vision.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

References

  1. Mademlis, Ioannis; Tefas, Anastasios; Pitas, Ioannis (2018). "A salient dictionary learning framework for activity video summarization via key-frame extraction" (PDF). Information Sciences. Elsevier. 432: 319–331. doi:10.1016/j.ins.2017.12.020 . Retrieved 4 December 2022.
  2. Y. Pritch, S. Ratovitch, A. Hendel, and S. Peleg, Clustered Synopsis of Surveillance Video, 6th IEEE Int. Conf. on Advanced Video and Signal Based Surveillance (AVSS'09), Genoa, Italy, Sept. 2-4, 2009
  3. Y. Pritch, A. Rav-Acha, A. Gutman, and S. Peleg, Webcam Synopsis: Peeking Around the World, ICCV'07, October 2007. 8p.
  4. Y. Pritch, A. Rav-Acha, and S. Peleg, Nonchronological Video Synopsis and Indexing, IEEE Trans. PAMI, Vol 30, No 11, Nov. 2008, pp. 1971-1984.
  5. A. Rav-Acha, Y. Pritch, and S. Peleg, Making a Long Video Short: Dynamic Video Synopsis, CVPR'06, June 2006, pp. 435-441.
  6. S. Peleg, Y. Caspi, BriefCam White Paper
  7. Yablonko, Yasmin (9 May 2018). "Canon buys Israeli video solutions co BriefCam for $90m". Globes (in Hebrew). Retrieved 2018-12-05.
  8. CB Insights. "BriefCam Funding & Investors - CB Insights". www.cbinsights.com. Retrieved 2018-12-05.
  9. Muhammad Ajmal, Muhammad Husnain Ashraf, Muhammad Shakir, Yasir Abbas, Faiz Ali Shah, Video Summarization: Techniques and Classification
  10. Zheng Lu, Kristen Grauman Story-driven summarization for egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition - 2013.
  11. Ingle, P.Y. and Kim, Y.G., 2023. Video Synopsis Algorithms and Framework: A Survey and Comparative Evaluation. Systems, 11(2), p.108.

Patents