Shot transition detection

Last updated September 11, 2024

Shot transition detection (or simply shot detection) also called cut detection is a field of research of video processing. Its subject is the automated detection of transitions between shots in digital video with the purpose of temporal segmentation of videos.^[1]

Use

Shot transition detection is used to split up a film into basic temporal units called shots; a shot is a series of interrelated consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space.^[2]

This operation is of great use in software for post-production of videos. It is also a fundamental step of automated indexing and content-based video retrieval or summarization applications which provide an efficient access to huge video archives, e.g. an application may choose a representative picture from each scene to create a visual overview of the whole film and, by processing such indexes, a search engine can process search items like "show me all films where there's a scene with a lion in it."

Cut detection can do nothing that a human editor couldn't do manually, however it is advantageous as it saves time. Furthermore, due to the increase in the use of digital video and, consequently, in the importance of the aforementioned indexing applications, the automatic cut detection is very important nowadays.

Basic technical terms

The dissolve blends one shot gradually into another with a transparency effect. Dissolve.jpg — The *dissolve* blends one shot gradually into another with a transparency effect.

In simple terms cut detection is about finding the positions in a video in that one scene is replaced by another one with different visual content. Technically speaking the following terms are used:

A digital video consists of frames that are presented to the viewer's eye in rapid succession to create the impression of movement. "Digital" in this context means both that a single frame consists of pixels and the data is present as binary data, such that it can be processed with a computer. Each frame within a digital video can be uniquely identified by its frame index, a serial number.

A shot is a sequence of frames shot uninterruptedly by one camera. There are several film transitions usually used in film editing to juxtapose adjacent shots; In the context of shot transition detection they are usually group into two types:^[3]

Abrupt Transitions - This is a sudden transition from one shot to another, i. e. one frame belongs to the first shot, the next frame belongs to the second shot. They are also known as hard cuts or simply cuts.
Gradual Transitions - In this kind of transitions the two shots are combined using chromatic, spatial or spatial-chromatic effects which gradually replace one shot by another. These are also often known as soft transitions and can be of various types, e.g., wipes, dissolves, fades...

"Detecting a cut" means that the position of a cut is gained; more precisely a hard cut is gained as "hard cut between frame i and frame i+1", a soft cut as "soft cut from frame i to frame j".

A transition that is detected correctly is called a hit, a cut that is there but was not detected is called a missed hit and a position in that the software assumes a cut, but where actually no cut is present, is called a false hit.

An introduction to film editing and an exhaustive list of shot transition techniques can be found at film editing.

Vastness of the problem

Although cut detection appears to be a simple task for a human being, it is a non-trivial task for computers. Cut detection would be a trivial problem if each frame of a video was enriched with additional information about when and by which camera it was taken. Possibly no algorithm for cut detection will ever be able to detect all cuts with certainty, unless it is provided with powerful artificial intelligence. ^{[ citation needed ]}

While most algorithms achieve good results with hard cuts, many fail with recognizing soft cuts. Hard cuts usually go together with sudden and extensive changes in the visual content while soft cuts feature slow and gradual changes. A human being can compensate this lack of visual diversity with understanding the meaning of a scene. While a computer assumes a black line wiping a shot away to be "just another regular object moving slowly through the on-going scene", a person understands that the scene ends and is replaced by a black screen.

Methods

Each method for cut detection works on a two-phase-principle:

Scoring – Each pair of consecutive frames of a digital video is given a certain score that represents the similarity/dissimilarity between them.
Decision – All scores calculated previously are evaluated and a cut is detected if the score is considered high.

This principle is error prone. First, because even minor exceedings of the threshold value produce a hit, it must be ensured that phase one scatters values widely to maximize the average difference between the score for "cut" and "no cut". Second, the threshold must be chosen with care; usually useful values can be gained with statistical methods.

Scoring

There are many possible scores used to access the differences in the visual content; some of the most common are:

Sum of absolute differences (SAD). This is both the most obvious and most simple algorithm of all: The two consecutive frames are compared pixel by pixel, summing up the absolute values of the differences of each two corresponding pixels. The result is a positive number that is used as the score. SAD reacts very sensitively to even minor changes within a scene: fast movements of the camera, explosions or the simple switching on of a light in a previously dark scene result in false hits. On the other hand, SAD hardly reacts to soft cuts at all. Yet, SAD is used often to produce a basic set of "possible hits" as it detects all visible hard cuts with utmost probability.
Histogram differences (HD). Histogram differences is very similar to Sum of absolute differences. The difference is that HD computes the difference between the histograms of two consecutive frames; a histogram is a table that contains for each color within a frame the number of pixels that are shaded in that color. HD is not as sensitive to minor changes within a scene as SAD and thus produces less false hits. One major problem of HD is that two images can have exactly the same histograms while the shown content differs extremely, e. g. a picture of the sea and a beach can have the same histogram as one of a corn field and the sky. HD offers no guarantee that it recognizes hard cuts.
Edge change ratio (ECR). The ECR attempts to compare the actual content of two frames. It transforms both frames to edge pictures, i. e. it extracts the probable outlines of objects within the pictures (see edge detection for details). Afterwards it compares these edge pictures using dilation to compute a probability that the second frame contains the same objects as the first frame. The ECR is one of the best performing algorithms for scoring. It reacts very sensitively to hard cuts and can detect many soft cuts by nature. In its basic form even ECR cannot detect soft cuts such as wipes as it considers the fading-in objects as regular objects moving through the scene. Yet, ECR can be extended manually to recognize special forms of soft cuts.

Finally, a combination of two or more of these scores can improve the performance.

Decision

In the decision phase the following approaches are usually used:

Fixed Threshold – In this approach, the scores are compared to a threshold which was set previously and if the score is higher than the threshold a cut is declared.
Adaptive Threshold – In this approach, the scores are compared to a threshold which considers various scores in the video to adapt the threshold to the properties of the current video. Like in the previous case, if the score is higher than the corresponding threshold a cut is declared.
Machine Learning - Machine learning techniques can be applied also to the decision process.

Cost

All of the above algorithms complete in O(n) — that is to say they run in linear time — where n is the number of frames in the input video. The algorithms differ in a constant factor that is determined mostly by the image resolution of the video.

Measures for quality

Usually the following three measures are used to measure the quality of a cut detection algorithm:

Recall is the probability that an existing cut will be detected:

V={C \over C+M}

Precision is the probability that an assumed cut is in fact a cut:

P={C \over C+F}

F1 is a combined measure that results in high value if, and only if, both precision and recall result in high values:

F1={2*P*V \over P+V}

The symbols stand for: C, the number of correctly detected cuts ("correct hits"), M, the number of not detected cuts ("missed hits") and F, the number of falsely detected cuts ("false hits"). All of these measures are mathematical measures, i. e. they deliver values in between 0 and 1. The basic rule is: the higher the value, the better performs the algorithm.

Benchmarks

Comparison of benchmarks
Benchmark	Videos	Hours	Frames	Shot transitions	Participants	Years
TRECVid	12 - 42	4.8 - 7.5	545,068 - 744,604	2090 - 4806	57	2001 - 2007
MSU SBD	31	21.45	1,900,000+	10883	7	2020 - 2021

TRECVid SBD Benchmark 2001-2007^[4]

Automatic shot transition detection was one of the tracks of activity within the annual TRECVid benchmarking exercise from 2001 to 2007. There were 57 algorithms from different research groups. Сalculations of F score were performed for each algorithm on a dataset, which was replenished annually.

Top research groups
Group	F score	Processing speed (compared to real-time)	Open source	Used metrics and technologies
Tsinghua U.^[5]	0.897	×0.23	No	Mean of Pixel Intensities Standard Deviation of Pixel Intensities Color Histogram Pixel-wise Difference Motion Vector
NICTA^[6]	0.892	×2.30	No	Machine learning
IBM Research^[7]	0.876	×0.30	No	Color histogram Localized Edges direction histogram Gray-level Thumbnails comparison Frame luminance

MSU SBD Benchmark 2020-2021 ^[8]

The benchmark has compared 6 methods on more than 120 videos from RAI and MSU CC datasets with different types of scene changes, some of which were added manually.^[9] The authors state that the main feature of this benchmark is the complexity of shot transitions in the dataset. To prove it they calculate SI/TI metric of shots and compare it with others publicly available datasets.

Top algorithms
Algorithm	F score	Processing speed (FPS)	Open source	Used metrics and technologies
Saeid Dadkhah^[10]	0.797	86	Yes	Color histogram Adaptive threshold
Max Reimann^[11]	0.787	76	Yes	SVM for cuts Neural networks for graduals transitions Color Histogram
VQMT^[12]	0.777	308	No	Edges histograms Motion compensation Color histograms
PySceneDetect^[13]	0.776	321	Yes	Frame intensity
FFmpeg^[14]	0.772	165	Yes	Color histogram

Related Research Articles

The Canny edge detector is an edge detection operator that uses a multi-stage algorithm to detect a wide range of edges in images. It was developed by John F. Canny in 1986. Canny also produced a computational theory of edge detection explaining why the technique works.

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

Deinterlacing is the process of converting interlaced video into a non-interlaced or progressive form. Interlaced video signals are commonly found in analog television, VHS, Laserdisc, digital television (HDTV) when in the 1080i format, some DVD titles, and a smaller number of Blu-ray discs.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

<span class="mw-page-title-main">576i</span> Standard-definition video mode

576i is a standard-definition digital video mode, originally used for digitizing analogue television in most countries of the world where the utility frequency for electric power distribution is 50 Hz. Because of its close association with the legacy colour encoding systems, it is often referred to as PAL, PAL/SECAM or SECAM when compared to its 60 Hz NTSC-colour-encoded counterpart, 480i.

In computer vision and image processing, motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion happens in three dimensions (3D) but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.

In the post-production process of film and video editing, a dissolve is a type of film transition in which one sequence fades over another. The terms fade-out and fade-in are used to describe a transition to and from a blank image. This is in contrast to a cut, where there is no such transition. A dissolve overlaps two shots for the duration of the effect, usually at the end of one scene and the beginning of the next, but may also be used in montage sequences. Generally, but not always, the use of a dissolve is held to indicate that a time has passed between the two scenes. Also, it may indicate a change of location or the start of a flashback.

Mattes are used in photography and special effects filmmaking to combine two or more image elements into a single, final image. Usually, mattes are used to combine a foreground image with a background image. In this case, the matte is the background painting. In film and stage, mattes can be physically huge sections of painted canvas, portraying large scenic expanses of landscapes.

<span class="mw-page-title-main">Image histogram</span> Digital image analysis tool

An image histogram is a type of histogram that acts as a graphical representation of the tonal distribution in a digital image. It plots the number of pixels for each tonal value. By looking at the histogram for a specific image a viewer will be able to judge the entire tonal distribution at a glance.

<span class="mw-page-title-main">Image stitching</span> Combining multiple photographic images with overlapping fields of view

Image stitching or photo stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image. Commonly performed through the use of computer software, most approaches to image stitching require nearly exact overlaps between images and identical exposures to produce seamless results, although some stitching algorithms actually benefit from differently exposed images by doing high-dynamic-range imaging in regions of overlap. Some digital cameras can stitch their photos internally.

In computer vision and image processing, a feature is a piece of information about the content of an image; typically about whether a certain region of the image has certain properties. Features may be specific structures in the image such as points, edges or objects. Features may also be the result of a general neighborhood operation or feature detection applied to the image. Other examples of features are related to motion in image sequences, or to shapes defined in terms of curves or boundaries between different image regions.

The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This method is similar to that of edge orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

Pedestrian detection is an essential and significant task in any intelligent video surveillance system, as it provides the fundamental information for semantic understanding of the video footages. It has an obvious extension to automotive applications due to the potential for improving safety systems. Many car manufacturers offer this as an ADAS option in 2017.

Local binary patterns (LBP) is a type of visual descriptor used for classification in computer vision. LBP is the particular case of the Texture Spectrum model proposed in 1990. LBP was first described in 1994. It has since been found to be a powerful feature for texture classification; it has further been determined that when LBP is combined with the Histogram of oriented gradients (HOG) descriptor, it improves the detection performance considerably on some datasets. A comparison of several improvements of the original LBP in the field of background subtraction was made in 2015 by Silva et al. A full survey of the different versions of LBP can be found in Bouwmans et al.

Video copy detection is the process of detecting illegally copied videos by analyzing them and comparing them to original content.

The Fujifilm FinePix HS is a series of bridge cameras that started in February 2010 with the HS10 model. The special feature of the HS cameras is the manual zoom that - otherwise common only with system cameras - allows a quick and precise change of the focal length but demands two-handed operation.

Features from accelerated segment test (FAST) is a corner detection method, which could be used to extract feature points and later used to track and map objects in many computer vision tasks. The FAST corner detector was originally developed by Edward Rosten and Tom Drummond, and was published in 2006. The most promising advantage of the FAST corner detector is its computational efficiency. Referring to its name, it is indeed faster than many other well-known feature extraction methods, such as difference of Gaussians (DoG) used by the SIFT, SUSAN and Harris detectors. Moreover, when machine learning techniques are applied, superior performance in terms of computation time and resources can be realised. The FAST corner detector is very suitable for real-time video processing application because of this high-speed performance.

Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.

In computer vision, a saliency map is an image that highlights either the region on which people's eyes focus first or the most relevant regions for machine learning models. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system or an otherwise opaque ML model.

Video browsing, also known as exploratory video search, is the interactive process of skimming through video content in order to satisfy some information need or to interactively check if the video content is relevant. While originally proposed to help users inspecting a single video through visual thumbnails, modern video browsing tools enable users to quickly find desired information in a video archive by iterative human–computer interaction through an exploratory search approach. Many of these tools presume a smart user that wants features to interactively inspect video content, as well as automatic content filtering features. For that purpose, several video interaction features are usually provided, such as sophisticated navigation in video or search by a content-based query. Video browsing tools often build on lower-level video content analysis, such as shot transition detection, keyframe extraction, semantic concept detection, and create a structured content overview of the video file or video archive. Furthermore, they usually provide sophisticated navigation features, such as advanced timelines, visual seeker bars or a list of selected thumbnails, as well as means for content querying. Examples of content queries are shot filtering through visual concepts, through some specific characteristics, through user-provided sketches, or through content-based similarity search.

References

↑ P. Balasubramaniam; R Uthayakumar (2 March 2012). Mathematical Modelling and Scientific Computation: International Conference, ICMMSC 2012, Gandhigram, Tamil Nadu, India, March 16-18, 2012. Springer. pp. 421–. ISBN 978-3-642-28926-2.
↑ Weiming Shen; Jianming Yong; Yun Yang (18 December 2008). Computer Supported Cooperative Work in Design IV: 11th International Conference, CSCWD 2007, Melbourne, Australia, April 26-28, 2007. Revised Selected Papers. Springer Science & Business Media. pp. 100–. ISBN 978-3-540-92718-1.
↑ Joan Cabestany; Ignacio Rojas; Gonzalo Joya (30 May 2011). Advances in Computational Intelligence: 11th International Work-Conference on Artificial Neural Networks, IWANN 2011, Torremolinos-Málaga, Spain, June 8-10, 2011, Proceedings. Springer Science & Business Media. pp. 521–. ISBN 978-3-642-21500-1. Shot detection is performed by means of shot transition detection algorithms. Two different types of transitions are used to split a video into shots: – Abrupt transitions, also referred as cuts or straight cuts, occur when a sudden change from one ...
↑ Smeaton, A. F., Over, P., & Doherty, A. R. (2010). Video shot boundary detection: Seven years of TRECVid activity. Computer Vision and Image Understanding, 114(4), 411–418. doi : 10.1016/j.cviu.2009.03.011
↑ Yuan, J., Zheng, W., Chen, L., Ding, D., Wang, D., Tong, Z., Wang, H., Wu, J., Li, J., Lin, F., & Zhang, B. (2004). Tsinghua University at TRECVID 2004: Shot Boundary Detection and High-Level Feature Extraction. TRECVID.
↑ Yu, Zhenghua, S. Vishwanathan and Alex Smola. “NICTA at TRECVID 2005 Shot Boundary Detection Task.” TRECVID (2005).
↑ A. Amir, The IBM Shot Boundary Detection System at TRECVID 2003, in: TRECVID 2005 Workshop Notebook Papers, National Institute of Standards and Technology, MD, USA, 2003.
↑ "MSU SBD Benchmark 2020". Archived from the original on 2021-02-13. Retrieved 2021-02-19.
↑ "MSU SBD Benchmark 2020". Archived from the original on 2021-02-13. Retrieved 2021-02-19.
↑ "SaeidDadkhah/Shot-Boundary-Detection". GitHub . 19 September 2021.
↑ "Shot-Boundary-Detection". GitHub . 11 September 2021.
↑ "MSU Scene Change Detector (SCD)".
↑ "Home - PySceneDetect".
↑ "Ffprobe Documentation".

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[BalasubramaniamUthayakumar2012-1] P. Balasubramaniam; R Uthayakumar (2 March 2012). Mathematical Modelling and Scientific Computation: International Conference, ICMMSC 2012, Gandhigram, Tamil Nadu, India, March 16-18, 2012. Springer. pp. 421–. ISBN 978-3-642-28926-2.

[ShenYong2008-2] Weiming Shen; Jianming Yong; Yun Yang (18 December 2008). Computer Supported Cooperative Work in Design IV: 11th International Conference, CSCWD 2007, Melbourne, Australia, April 26-28, 2007. Revised Selected Papers. Springer Science & Business Media. pp. 100–. ISBN 978-3-540-92718-1.

[CabestanyRojas2011-3] Joan Cabestany; Ignacio Rojas; Gonzalo Joya (30 May 2011). Advances in Computational Intelligence: 11th International Work-Conference on Artificial Neural Networks, IWANN 2011, Torremolinos-Málaga, Spain, June 8-10, 2011, Proceedings. Springer Science & Business Media. pp. 521–. ISBN 978-3-642-21500-1. Shot detection is performed by means of shot transition detection algorithms. Two different types of transitions are used to split a video into shots: – Abrupt transitions, also referred as cuts or straight cuts, occur when a sudden change from one ...

[4] Smeaton, A. F., Over, P., & Doherty, A. R. (2010). Video shot boundary detection: Seven years of TRECVid activity. Computer Vision and Image Understanding, 114(4), 411–418. doi : 10.1016/j.cviu.2009.03.011

[5] Yuan, J., Zheng, W., Chen, L., Ding, D., Wang, D., Tong, Z., Wang, H., Wu, J., Li, J., Lin, F., & Zhang, B. (2004). Tsinghua University at TRECVID 2004: Shot Boundary Detection and High-Level Feature Extraction. TRECVID.

[6] Yu, Zhenghua, S. Vishwanathan and Alex Smola. “NICTA at TRECVID 2005 Shot Boundary Detection Task.” TRECVID (2005).

[7] A. Amir, The IBM Shot Boundary Detection System at TRECVID 2003, in: TRECVID 2005 Workshop Notebook Papers, National Institute of Standards and Technology, MD, USA, 2003.

[8] "MSU SBD Benchmark 2020". Archived from the original on 2021-02-13. Retrieved 2021-02-19.

[9] "MSU SBD Benchmark 2020". Archived from the original on 2021-02-13. Retrieved 2021-02-19.

[10] "SaeidDadkhah/Shot-Boundary-Detection". GitHub . 19 September 2021.

[11] "Shot-Boundary-Detection". GitHub . 11 September 2021.

[12] "MSU Scene Change Detector (SCD)".

[13] "Home - PySceneDetect".

[14] "Ffprobe Documentation".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]