STFCN: Spatio-Temporal FCN for Semantic Video Segmentation


by M Fayyaz, MH Saffar, M Sabokrou, M Fathy, R Klette, F Huang
Abstract:
This paper presents a novel method to involve both spatial and temporal features for semantic video segmentation. Current work on convolutional neural networks(CNNs) has shown that CNNs provide advanced spatial features supporting a very good performance of solutions for both image and video analysis, especially for the semantic segmentation task. We investigate how involving temporal features also has a good effect on segmenting video data. We propose a module based on a long short-term memory (LSTM) architecture of a recurrent neural network for interpreting the temporal characteristics of video frames over time. Our system takes as input frames of a video and produces a correspondingly-sized output; for segmenting the video our method combines the use of three components: First, the regional spatial features of frames are extracted using a CNN; then, using LSTM the temporal features are added; finally, by deconvolving the spatio-temporal features we produce pixel-wise predictions. Our key insight is to build spatio-temporal convolutional networks (spatio-temporal CNNs) that have an end-to-end architecture for semantic video segmentation. We adapted fully some known convolutional network architectures (such as FCN-AlexNet and FCN-VGG16), and dilated convolution into our spatio-temporal CNNs. Our spatio-temporal CNNs achieve state-of-the-art semantic segmentation, as demonstrated for the Camvid and NYUDv2 datasets.
Reference:
STFCN: Spatio-Temporal FCN for Semantic Video Segmentation (M Fayyaz, MH Saffar, M Sabokrou, M Fathy, R Klette, F Huang), In , 2016.
Bibtex Entry:
@article{fayyaz2016stfcn:segmentation,
author = "Fayyaz, M and Saffar, MH and Sabokrou, M and Fathy, M and Klette, R and Huang, F",
journal = "",
month = "Aug",
title = "STFCN: Spatio-Temporal FCN for Semantic Video Segmentation",
url = "http://arxiv.org/abs/1608.05971v2",
year = "2016",
abstract = "This paper presents a novel method to involve both spatial and temporal
features for semantic video segmentation. Current work on convolutional neural
networks(CNNs) has shown that CNNs provide advanced spatial features supporting
a very good performance of solutions for both image and video analysis,
especially for the semantic segmentation task. We investigate how involving
temporal features also has a good effect on segmenting video data. We propose a
module based on a long short-term memory (LSTM) architecture of a recurrent
neural network for interpreting the temporal characteristics of video frames
over time. Our system takes as input frames of a video and produces a
correspondingly-sized output; for segmenting the video our method combines the
use of three components: First, the regional spatial features of frames are
extracted using a CNN; then, using LSTM the temporal features are added;
finally, by deconvolving the spatio-temporal features we produce pixel-wise
predictions. Our key insight is to build spatio-temporal convolutional networks
(spatio-temporal CNNs) that have an end-to-end architecture for semantic video
segmentation. We adapted fully some known convolutional network architectures
(such as FCN-AlexNet and FCN-VGG16), and dilated convolution into our
spatio-temporal CNNs. Our spatio-temporal CNNs achieve state-of-the-art
semantic segmentation, as demonstrated for the Camvid and NYUDv2 datasets.",
keyword = "cs.CV",
keyword = "cs.CV",
day = "22",
}