INV ITEDP A P E R
Large-ScaleSituationAwarenessWith Camera Networks andMultimodal SensingThis paper describes principles and practice of situational awareness in
communications; smart cameras, wireless infrastructure, context-aware
computing, and programing models are discussed.
By Umakishore Ramachandran, Senior Member IEEE, Kirak Hong,
Liviu Iftode, Senior Member IEEE, Ramesh Jain, Fellow IEEE, Rajnish Kumar,
Kurt Rothermel, Junsuk Shin, and Raghupathy Sivakumar
ABSTRACT | Sensors of various modalities and capabilities,
especially cameras, have become ubiquitous in our environ-
ment. Their intended use is wide ranging and encompasses
surveillance, transportation, entertainment, education, health-
care, emergency response, disaster recovery, and the like.
Technological advances and the low cost of such sensors
enable deployment of large-scale camera networks in large
metropolises such as London and New York. Multimedia algo-
rithms for analyzing and drawing inferences from video and
audio have also matured tremendously in recent times. Despite
all these advances, large-scale reliable systems for media-rich
sensor-based applications, often classified as situation-
awareness applications, are yet to become commonplace.
Why is that? There are several forces at work here. First, the
system abstractions are just not at the right level for quickly
prototyping such applications on a large scale. Second, while
Moore’s law has held true for predicting the growth of
processing power, the volume of data that applications are
called upon to handle is growing similarly, if not faster.
Enormous amount of sensing data is continually generated
for real-time analysis in such applications. Further, due to the
very nature of the application domain, there are dynamic and
demanding resource requirements for such analyses. The lack
of right set of abstractions for programing such applications
coupled with their data-intensive nature have hitherto made
realizing reliable large-scale situation-awareness applications
difficult. Incidentally, situation awareness is a very popular but
ill-defined research area that has attracted researchers from
many different fields. In this paper, we adopt a strong systems
perspective and consider the components that are essential in
realizing a fully functional situation-awareness system.
KEYWORDS | Large-scale distributed systems; programing
model; resource management; scalability; situation awareness;
video-based surveillance
I . INTRODUCTION
Situation awareness is both a property and an application
class that deals with recognizing when sensed data could
lead to actionable knowledge.With advances in technology, it is becoming feasible to
integrate sophisticated sensing, computing, and commu-nication in a single small footprint sensor platform (e.g.,
smart cameras). This trend is enabling deployment of
powerful sensors of different modalities in a cost-effective
Manuscript received May 16, 2011; revised September 5, 2011; accepted October 20,
2011. Date of publication February 20, 2012; date of current version March 21, 2012.
U. Ramachandran, K. Hong, and J. Shin are with the College of Computing,
Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: [emailprotected];
[emailprotected]; [emailprotected]).
L. Iftode is with the Department of Computer Science, Rutgers University,
Piscataway, NJ 08854 USA (e-mail: [emailprotected]).
R. Jain is with the School of Information and Computer Sciences, University of
California at Irvine, Irvine, CA 92697-3425 USA (e-mail: [emailprotected]).
R. Kumar was with the College of Computing, Georgia Institute of Technology,
Atlanta, GA 30332 USA. He is now with Weyond, Princeton, NJ 08540 USA
(e-mail: [emailprotected]).
K. Rothermel is with the Institute for Parallel and Distributed Systems
(IPVS), University of Stuttgart, 70569 Stuttgart, Germany
(e-mail: [emailprotected]).
R. Sivakumar is with the School of Electrical and Computer Engineering,
Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: [emailprotected]).
Digital Object Identifier: 10.1109/JPROC.2011.2182093
878 Proceedings of the IEEE | Vol. 100, No. 4, April 2012 0018-9219/$31.00 �2012 IEEE
manner. While Moore’s law has held true for predictingthe growth of processing power, the volume of data that
applications handle is growing similarly, if not faster.
Situation-awareness applications are inherently distri-
buted, interactive, dynamic, stream based, computation-
ally demanding, and needing real-time or near real-time
guarantees. A sense–process–actuate control loop charac-
terizes the behavior of this application class.
There are three main challenges posed by data explo-sion for realizing situation awareness: overload on the
infrastructure, cognitive overload on humans in the loop,
and dramatic increase in false positives and false negatives
in identifying threat scenarios. Consider, for example,
providing situation awareness in a battlefield. It needs
complex fusion of contextual knowledge with time-
sensitive sensor data obtained from different sources to
derive higher level inferences. With an increase in thesensed data, a fighter pilot will need to take more data into
account in decision making leading to a cognitive overload
and an increase in human errors (false positives and nega-
tives). Also, to process and disseminate the sensed data,
more computational and network resources are needed
thus overloading the infrastructure.
Distributed video-based surveillance is a good canonical
example of this application class. Visual information playsa vital role in surveillance applications, demonstrated by
the strategic use of video cameras as a routine means of
physical security. With advances in imaging technology,
video cameras have become increasingly versatile and so-
phisticated. They can be multispectral, can sense at varying
resolutions, can operate with differing levels of actuation
(stationary, moving, controllable), and can even be air-borne (e.g., in military applications). Cameras are being
deployed on a large scale, from airports to city-scale, infra-
structures. Such large-scale deployments result in massive
amounts of visual information that must be processed in
real time to extract useful and actionable knowledge for
timely decision making. The overall goal of surveillance
systems is to detect and track suspicious activities to ward
off potential threats. Reliable computer-automated surveil-lance using vision-based tracking, identification, and
activity monitoring can relieve operator tedium and allow
coverage of larger areas for various applications (airports,
cities, highways, etc.). Fig. 1 is a visual of the camera de-
ployment in an airport to serve as the infrastructure for
such a video-based surveillance system.
Video surveillance based on closed-circuit television
(CCTV) was first introduced in the United Kingdom in themiddle of the last century. Since then, camera surveillance
networks have proliferated in the United Kingdom, with
over 200 000 cameras in London alone [1]. In the United
States, the penetration of CCTV has been relatively slower;
Chicago is leading with more than 2000 cameras, which
connect to an operation center constantly monitored by
police officers [2]. Apart from the legal and privacy aspects
of the CCTV technology [3], it is both expensive and hardto scale due to the huge human capital involved in moni-
toring the camera feeds.
Smart or intelligent surveillance combines sensing and
computing to automatically identify interesting objects and
suspicious behaviors. Advances in computer vision have
enabled a range of technologies including: human
Fig. 1. Cameras and people movement in an airport.
Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 879
detection and discrimination [4]–[7]; single-camera andmulticamera target tracking [8], [9]; biometric informa-
tion gathering, such as face [10] and gait signatures [11],
[12]; and human motion and activity classification [13]–
[15]. Such advances (often referred to as video analytics)are precursors to fully automated surveillance, and bode
well for use in many critical applications including and
beyond surveillance.
As image processing and interpretation tasks migratefrom a manual to a computer-automated model, questions
of system scalability and efficient resource management
will arise and must be addressed. In large settings such as
airports or urban environments, processing the data
streaming continuously from multiple video cameras is a
computationally intensive task. Moreover, given the goals
of surveillance, images must be processed in real time in
order to provide the timeliness required by modern secu-rity practice. Questions of system scalability go beyond
video analytics, and fall squarely in the purview of distri-
buted systems research.
Consider a typical smart surveillance system in an air-
port with cameras deployed in a pattern to maintain con-
tinuous surveillance of the terminal concourses (Fig. 1).
Images from these cameras are processed by some
application-specific logic to produce the precise level ofactionable knowledge required by the end user (human
and/or software agent). The application-specific processing
may analyze multiple camera feeds to extract higher level
information such as Bmotion,[ Bpresence of a human face,[or Bcommitting a suspicious activity.[ Additionally, a secu-rity agent can specify policies, e.g., Bonly specified people
are allowed to enter a particular area,[ which causes the
system to trigger an alert whenever such a policy is violated.The surveillance system described above, fully realized,
is no longer a problem confined to computer vision but a
large-scale distributed systems problem with intensive
data-processing resource requirements. Consider, for ex-
ample, a simple small-scale surveillance system that does
motion sensing and Joint Photographic Experts Group
(JPEG) encoding/decoding. Fig. 2 shows the processing
requirements for such a system using a centralized setup[single 1.4-GHz Intel Pentium central processing unit
(CPU)]. In this system, each camera is restricted to stream
images at a slow rate of 5 frames/s, and each image has a
very coarse-grained resolution of only 320 � 240. Even
under the severely restricted data-processing conditions,
the results show that the test system cannot scale beyond
four cameras due to CPU saturation. Increasing the video
quality (frames per second and resolution) to those re-quired by modern security applications would saturate
even a high-end computing system attempting to process
more than a few cameras. Clearly, scaling up to a large
number of cameras (on the order of hundreds or thou-
sands) warrants a distributed systems solution.
We take a systems approach to scalable smart surveil-
lance, embodying several interrelated research threads:
1) determining the appropriate system abstractions to aidthe computer-vision domain expert in developing such
complex applications; 2) determining the appropriate
execution model that fully exploits the resources across
the distributed system; and 3) identifying technologies
spanning sensing hardware and wireless infrastructures for
supporting large-scale situation awareness.
Situation awareness as a research area is still evolving.
It has attracted researchers from vastly different fieldsspanning computer vision, robotics, artificial intelligence,
systems, and networking. In this paper, we are talking
about component technologies, not end-to-end system,
that are essential to realize a fully functional situation-
awareness system. We start by understanding the applica-
tion requirements, especially in the domain of video-based
surveillance (Section II). We use this domain knowledge to
raise questions about the systems research that is neededto support large-scale situation awareness. We then pre-
sent a bird’s eye view of the enabling technologies of rele-
vance to large-scale situation awareness (Section III). This
tour of technologies spans computer vision, smart cameras
and other sensors, wireless, context-aware computing, and
programing models. We then report on our own experi-
ence in developing a system architecture for situation-
awareness applications (Section IV).IBM’s S3 system [16] is perhaps the only complete end-
to-end system for situation awareness that we are aware of.
Fig. 2. Surveillance system resource utilization. (a) CPU load.
(b) Memory usage.
Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
880 Proceedings of the IEEE | Vol. 100, No. 4, April 2012
We include a case study of IBM’s S3 product, which re-presents the state of the art in online video-based surveil-
lance (Section V). We conclude with thoughts on where
we are headed in the future in the exploration of large-
scale situation awareness (Section VI).
II . APPLICATION MODEL
Using video-based surveillance as a concrete instance ofthe domain of situation-awareness applications, let us first
understand the application model. In a video-based sur-
veillance application, there are two key functions: detec-tion and tracking. For the sake of this discussion, we will
say that detection concerns with identifying any anoma-
lous behavior of a person or an object from a scene. For
example, in an airport, a person leaving a bag in a public
place and walking away is one such anomalous event. Suchan event has to be captured in real time by an automated
surveillance system among thousands of normal activities
in the airport. As can be imagined, there could be several
such potentially anomalous events that may be happening
in an airport at any given time. Once such an event is
detected, the object or the person becomes a target and theautomated surveillance system should keep track of the
target that triggered the event. While tracking the targetacross multiple cameras, the surveillance system provides
all relevant information of the target including location
and multiple views captured by different cameras, to
eventually lead to a resolution of whether the original
event is a benign one or something serious warranting
appropriate action by a security team. For clarity, we will
use the term detector and tracker to indicate these two
pieces of the application logic.The application model reveals the inherent parallel/
distributed nature of a video-based surveillance applica-
tion. Each detector is a per-camera computation and these
computations are inherently data parallel since there is no
data dependency among the detectors working on different
camera streams. Similarly, each tracker is a per-target
computation that can be run concurrently for each target.
If a target simultaneously appears in the field of view(FOV) of multiple cameras, the trackers following the tar-
get on each of the different camera streams need to work
together to build a composite knowledge of the target.
Moreover, there exist complex data sharing and commu-
nication patterns among the different instances of detec-
tors and trackers. For example, the detector and trackers
have to work together to avoid duplicate detection of the
same target.The application model as presented above can easily be
realized on a small scale (i.e., on the order of tens of
camera streams) by implementing the application logic to
be executed on each of the cameras, and the output to be
centrally analyzed for correlation and refinement. Indeed,
there are already video analytics solution providers [17],
[18] that peddle mature commercial products for such
scenarios. However, programing such scenarios on a largescale requires a distributed approach, whose scalability is a
hard open problem.
How can a vision expert write an application for video-
based surveillance that spans thousands of cameras and
other sensors? How can we design a scalable infrastructure
that spans a huge geographical area such as an airport or a
city to support such applications? How do we reduce the
programing burden on the domain expert by providing theright high-level abstractions? What context information is
needed to support prioritization of data streams and asso-
ciated computations? How can we transparently migrate
computations between the edges of the network (i.e., at or
close to the sensors) and the computational workhorses
(e.g., cloud)? How do we adaptively adjust the fidelity of
the computation commensurate with the application dyna-
mics (e.g., increased number of targets to be observed thancan be sustained by the infrastructure)? These are some of
the questions that our vision for large-scale situation
awareness raises.
III . ENABLING TECHNOLOGIES
The objective in this section is to give a bird’s eye view of
the state of the art in technologies that are key enablers forlarge-scale situation awareness. We start with a brief survey
of computer vision technologies as they apply to video-
based surveillance (Section III-A). We then discuss smart
camera technology that is aimed at reducing the stress on
the compute and networking infrastructure by facilitating
efficient edge processing (such as filtering and motion
detection) to quench the uninteresting camera streams at
the source (Section III-B). We then survey wireless tech-nologies (Section III-C) that allow smart cameras to be
connected to backend servers given the computationally
intensive nature of computer vision tasks. This is followed
by reviewing the middleware framework for context-aware
computing, a key enabler to paying selective attention to
streams of interest for deeper analysis in situation-
awareness applications (Section III-D). Last, we review
programing models and execution frameworksVperhapsthe most important piece of the puzzle for developing large-
scale situation-awareness applications (Section III-E).
A. Computer VisionComputer vision technologies have advanced dramat-
ically during the last decade in a number of ways. Many
algorithms have been proposed in different subareas of
computer vision and have significantly improved the per-formance of computer vision processing tasks. There are
two aspects to performance when it comes to vision tasks:
accuracy and latency. Accuracy has to do with the correct-
ness of the inference made by the vision processing task
(e.g., how precise is the bounding box around a face
generated by a face detection algorithm?). Latency, on the
other hand, has to do with the time it takes for a vision
Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 881
processing task to complete its work. Traditionally, com-puter vision research has been focused on developing
algorithms that increase the accuracy of detection, track-
ing, etc. However, when computer vision techniques are
applied to situation-awareness applications, there is a ten-
sion between accuracy and latency. Algorithms that
increase the accuracy of event detection are clearly pre-
ferable. However, if the algorithm is too slow then the
outcome of the event detection may be too late to serve asactionable knowledge. In general, in a video-based surveil-
lance application the objective is to shrink the elapsed time
(i.e., latency) between sensing and actuation. Since video
processing is continuous in nature, computer vision algo-
rithms strive to achieve a higher processing frame rate
(i.e., frames per second) to ensure that important events
are not missed. Therefore, computer vision research has
been focusing on improving performance both in terms ofaccuracy and latency for computer vision tasks of relevance
to situation awareness, namely, 1) object detection; 2) objecttracking; and 3) event recognition.
Object detection algorithms, as the name suggests, de-
tect and localize objects in a given image frame. As the
same object can have significant variations in appearance
due to the orientation, lighting, etc., accurate object de-
tection is a hard problem. For example, with human sub-jects there can be variation from frame to frame in poses,
hand position, and face expressions [19]. Object detection
can also suffer from occlusion [20]. This is the reason
detection algorithms tend to be slow and do not achieve a
very high frame rate. For example, a representative detec-
tion algorithm proposed by Felzenszwalb et al. [19] takes 3 sto train one frame and 2 s to evaluate one frame. To put this
performance in perspective, camera systems are capable ofgrabbing frames at rates upwards of 30 frames/s.
Object tracking research has addressed online algo-
rithms that train and track in real time (see [21] for a
comprehensive survey). While previous research has typi-
cally used lab environment with static background and
slowly moving objects in the foreground, recent research
[22]–[24] has focused on improving 1) real-time proces-
sing; 2) occlusions; 3) movement of both target and back-ground; and 4) the scenarios where an object leaves the
FOV of one camera and appears in front of another. Real-
time performance to accuracy tradeoff is evident in the
design and experimental results reported by the authors of
these algorithms. For example, the algorithm proposed by
Babenko et al. [22] runs at 25 frames/s while the algorithm
proposed by Kwon and Lee [24] takes 1–5 s/frame (for
similar sized video frames). However, Kwon and Lee [24]show through experimental results that their algorithm
results in higher accuracy over a larger data set of videos.
Event recognition is a higher level computer vision task
that plays an important role in situation-awareness appli-
cations. There are many different types of event recogni-
tion algorithms that are trained to recognize certain events
and/or actions from video data. Examples of high level
events include modeling individual object trajectories [25],[26], recognizing specific human poses [27], and detection
of anomalies and unusual activities [28]–[30].
In recent years, the state of the art in automated visual
surveillance has advanced quite a bit for many tasks in-
cluding: detecting humans in a given scene [4], [5]; track-
ing targets within a given scene from a single camera or
multiple cameras [8], [9]; following targets in a wide FOV
given overlapping sensors; classification of targets intopeople, vehicles, animals, etc.; collecting biometric infor-
mation such as face [10] and gait signatures [11]; and
understanding human motion and activities [13], [14].
In general, it should be noted that computer vision
algorithms for tasks of importance to situation awareness,
namely, detection, tracking, and recognition, are compu-
tationally intensive. The first line of defense is to quickly
eliminate streams that are uninteresting. This is one of theadvantages of using smart cameras (to be discussed next).
More generally, facilitating real-time execution of such
algorithms on a large-scale deployment of camera sensors
necessarily points to a parallel/distributed solution (see
Section III-E).
B. Smart CamerasOne of the keys to a scalable infrastructure for large-
scale situation awareness is to quench the camera streams
at the source if they are not relevant (for, e.g., no action in
front of a camera). One possibility is moving some aspects
of the vision processing (e.g., object detection) to the
cameras themselves. This would reduce the communica-
tion requirements from the cameras to the backend servers
and add to the overall scalability of the wireless infrastruc-
ture (see Section III-C). Further, it will also reduce theoverall computation requirements in the backend server
for the vision processing tasks.
With the evolution of sensing and computation tech-
nologies, smart cameras have also evolved along three di-
mensions: data acquisition, computational capability, and
configurability. Data acquisition sensors are used for cap-
turing the images of camera views. There are two current
alternatives for such sensors: charge-coupled device(CCD) sensors and complementary metal–oxide–
semiconductor (CMOS) sensors. Despite the superior
image quality obtainable with CCD sensors, CMOS sensors
are more common in today’s smart cameras mainly because
of their flexible digital control, high-speed exposure, and
other functionalities.
Computational element, needed for real-time proces-
sing of the sensed data, is in general implemented usingone of (or a combination of) the following technologies:
digital signal processors (DSPs), microcontroller or micro-
processors, field programmable gate arrays (FPGAs), mul-
timedia processors, and application specific integrated
circuits (ASICs). Microcontrollers provide the most flexi-
bility among these options but may be less suitable for the
implementation of image processing algorithms compared
Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
882 Proceedings of the IEEE | Vol. 100, No. 4, April 2012
to DSPs or FPGAs. With recent advances, memory con-trollers and microcontrollers are integrated with FPGA
circuits to attain hardware-level parallelism while main-
taining the reconfigurability of microprocessors. Thus,
FPGAs are emerging to be a good choice for implementing
the computational elements of smart cameras [31].
Finally, because of the reconfigurability of CMOS sen-
sors and FPGA-based computational flexibility of today’s
smart cameras, it is now possible to have fine-grainedcontrol of both sensing and processing units, leading to a
whole new field of computational cameras [32]. This newbreed of cameras can quickly adjust their optical circuitry
to obtain high-quality images even under dynamic lighting
or depth of view conditions. Such computational cameras,
combined with pan-tilt-zoom (PTZ) controllers, FPGA-
based image-processing elements, and a communication
element to interact with other cameras or remote servers,can be considered today as the state-of-the-art design of a
smart camera. CITRIC [33] is a recent example from an
academic setting of a wireless camera platform. Multi-
tiered camera platforms have also been proposed wherein
low-power camera motes wake up higher resolution came-
ras to capture and process interesting images. SensEye [34]
is one such platform; it achieves low latency from sensing
to actuation without sacrificing energy efficiency.Companies such as Philips, Siemens, Sony, and Texas
Instruments [35], [36] have commercial smart camera
products, and such smart cameras usually have program-
mable interfaces for customization. Axis [37], while focus-
ing on IP camera, incorporates multimodal sensors and
passive infrared (PIR) sensors (for motion detection) in
their camera offerings. The entertainment industry has
also embraced cameras with additional sensing modalities,e.g., Microsoft Kinect [38] uses advanced sensor technol-
ogies to construct 3-D video data with depth information
using a combination of CMOS cameras and infrared sensing.
One of the problems with depending on only one
sensing technology is the potential for increasing falsepositives (false alarm for a nonexistent threat situation) and
false negatives (a real threat missed by the system). Despite
the sophistication in computer vision algorithms, it is stillthe case that these algorithms are susceptible to lighting
conditions, ambient noise, occlusions, etc. One way of
enhancing the quality of the inference is to augment the
vision techniques with other sensing modalities that may
be less error prone. Because of the obvious advantage of
multimodal sensing, many smart camera manufacturers
today add different sensors along with optics and provide
an intelligent surveillance system that takes advantage ofthe nonoptical data, e.g., use of integrated global position-
ing system (GPS) to tag location awareness to the streamed
data.
C. Wireless InfrastructureThe physical deployment for a camera-based situation-
awareness application would consist of a plethora of wired
and wireless infrastructure components: simple and smartcameras, wireless access points, wireless routers, gateways,
and Internet connected backend servers. The cameras will,
of course, be distributed spatially in a given region along
with wireless routers and gateways. The role of the wire-
less routers is to stream the camera images to backend
servers in the Internet (e.g., cloud computing resources)
using one or more gateways. The gateways connect the
wireless infrastructure with the wired infrastructure andare connected to the routers using long-range links referredto as backhaul links. Similarly, the links between the wire-
less cameras and the wireless routers are short-range linksand are referred to as access links. Additionally, wirelessaccess points may be available to directly connect the
cameras to the wired infrastructure. A typical deployment
may in fact combine wireless access points and gateways
together, or access points may be connected to gatewaysvia gigabit Ethernet.
1) Short-Range Technologies: IEEE 802.11n [39] is a very
high throughput standard for wireless local area networks
(WLANs). The 802.11n standard has evolved considerably
from its predecessors: 802.11b and 802.11a/g. The 802.11n
standard includes unique capabilities such as the use of
multiple antennas at the transmitter and the receiver to real-ize high throughput links along with frame aggregation
and channel bonding. These features enable a maximum
physical layer data rate of up to 600 Mb/s. The 802.11
standards provide an indoor communication range of less
than 100 m and hence are good candidates for short-range
links.
IEEE 802.15.4 (Zigbee) [40] is another standard for
small low-power radios intended for networking low-bitrate sensors. The protocol specifies a maximum physical
layer data rate of 250 kb/s and a transmission range be-
tween 10 and 75 m. Zigbee uses multihop routing built
upon the ad hoc on demand distance vector (AODV; [41])
routing protocol. In the context of situation-awareness
applications, Zigbee would be useful for networking other
sensing modalities in support of the cameras (e.g., radio-
frequency identification (RFID), temperature, and humid-ity sensors).
The key issue in the use of the above technologies in a
camera sensor network is the performance versus energy
tradeoff. The IEEE 802.11n provides much higher data
rates and wider coverage but is less energy efficient when
compared to Zigbee. Depending on the power constraints
and data rate requirements in a given deployment, either
of this technologies would be more appropriate than theother.
2) Long-Range Technologies: The two main candidate
technologies for long-range links (for connecting routers
to gateways) are Long-Term Evolution (LTE) [42] and IEEE802.16 Worldwide Interoperability for Microwave Access(WiMax) [43]. The LTE specification provides an uplink
Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 883
data rate of 50 Mb/s and communication ranges from 1 to100 km. WiMax provides high data rates of up to 128 Mb/s
uplink and a maximum range of 50 km. Thus, both these
technologies are well suited as backhaul links for camera
sensor networks. There is an interesting rate-range trade-
off between access links and backhaul links. To support the
high data rates (but short range) of access links, it is quite
common to bundle multiple backhaul links together.
While both the above technologies allow long-range com-munication, the use of one technology in a given environ-
ment would depend on the spectrum that is available in a
given deployment (licensed versus unlicensed) and the
existence of prior cellular core networks in the deploy-
ment area.
3) Higher Layer Protocols: In addition to the link layer
technologies that comprise a camera sensor network,higher layer protocols for routing are also essential for
successful operation of camera networks.
a) Surge mesh routing is a popular routing protocol
used in several commercial Zigbee devices such
as the Crossbow Micaz motes [44]. This provides
automatic rerouting when a camera sensor link
fails and constructs a topology dynamically by
keeping track of link conditions.b) RPL [45] is an IPv6 routing protocol for com-
municating multipoint-to-point traffic from low-
power devices toward a central control point, as
well as point-to-multipoint traffic from the cen-
tral control point to the devices. RPL allows dy-
namic construction of routing trees with the root
at the gateways in a camera sensor network.
c) Mobile IP is a protocol that is designed to allowmobile device users to move from one network to
another while maintaining a permanent IP ad-
dress [46]. This is especially useful in applica-
tions with mobile cameras (mounted on vehicles
or robots). In the context of large-scale camera
sensor networks, high throughput and scalability
are essential.
While mobile IP and surge mesh routing are easier to de-ploy, they are usable only for small-size networks. For
large-scale networks, RPL is more suited but is more
complex to operate as well.
D. Context-Aware FrameworksSituation-awareness applications are context sensitive,
i.e., they adapt their behavior depending on the state of
their physical environment or information derived fromthe environment. Further, the context associated with ap-
plications is increasing in both volume and geographic
extent over time. The notion of context is application de-
pendent. For example, IBM’s Smarter Planet [47] vision
includes the global management of the scarce resources of
our earth like energy and water, traffic, transportation, and
manufacturing.
Specifically, with respect to video-based surveillancethe context information is the metadata associated with a
specific object that is being tracked. An infrastructure for
context management should integrate both the collected
sensor data and geographic data, including maps, floor
plans, 3-D building models, and utility network plans.
Such context information can be used by the automated
surveillance system to preferentially allocate computing,
networking, and cognitive resources. In other words, con-text awareness enables selective attention leading to better
situation awareness.
An appropriate framework for managing and sharing
context information for situation-awareness applications is
the sensor web concept, exemplified by Microsoft Sense-
Web [48], and Nokia Sensor Planet [49]. Also various
middleware systems for monitoring and processing sensor
data have been proposed (e.g., [50]–[52]). These infra-structures provide facilities for sensor discovery (e.g., by
type or location), retrieval and processing of sensor data,
and more sophisticated operations such as filtering and/or
event recognition.
Federated Context Management goes one step further
than sensor webs. It integrates not only sensor data but
also geographic data from various sources. The Nexus
framework [53], [54] federates the context information ofthe different providers and offers context-aware applica-
tions a global and consistent view of the federated context.
This global context not only includes observable context
information (such as sensor values, road maps, and 3-D
models), but also higher level situation information
inferred from other contexts.
E. Programing ModelsBy far the biggest challenge in building large-scale
situation-awareness applications is the programing com-
plexity associated with large-scale distributed systems,
both in terms of ease of development for the domain expert
and efficiency of the execution environment to ensure
good performance. We will explore a couple of different
approaches, addressing the programing challenge.
1) Thread-Based Programing Model: The lowest level
approach to building surveillance systems is to have the
application developer handle all aspects of the system,
including traditional systems aspects, such as resource
management, and more application-specific aspects, such
as mapping targets to cameras. In this model, a developer
wishing to exploit the inherent application-level concur-
rency (see Section I) has to manage the concurrently exe-cuting threads over a large number of computing nodes.
This approach gives maximum flexibility to the developer
for optimizing the computational resources since he/she
has complete control of the system resources and the
application logic.
However, effectively managing the computational re-
sources for multiple targets and cameras is a daunting
Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
884 Proceedings of the IEEE | Vol. 100, No. 4, April 2012
responsibility for the domain expert. For example, theshared data structures between detectors and trackers en-
suring target uniqueness should be carefully synchronized
to achieve the most efficient parallel implementation.
Multiple trackers operating on different video streams may
also need to share data structures when they are monitor-
ing the same target. These complex patterns of data
communication and synchronization place an unnecessary
burden on an application developer, which is exacerbatedby the need to scale the system to hundreds or even thou-
sands of cameras and targets in a large-scale deployment
(airports, cities).
2) Stream-Oriented Programing Model: Another approachis to use a stream-oriented programing model [55]–[58] as
a high-level abstraction for developing surveillance appli-
cations. In this model, a programer does not need to dealwith low-level system issues such as communication and
synchronization. Rather, she can focus on writing an
application as a stream graph consisting of computation
vertices and communication edges. Once a programer
provides necessary information including a stream graph,
the underlying stream processing system manages the
computational resources to execute the stream graph over
multiple nodes. Various optimizations are applied at thesystem level shielding the programers from having to con-
sider performance issues.
Fig. 3 illustrates a stream graph for a target tracking
application using IBM System S [55]Vone of the represen-
tative off-the-shelf stream processing engines. A detector
processes each frame from a camera, and produces a digest
containing newly detected blobs, the original camera
frame, and a foreground mask. A second stream stage,
trackerlist, maintains a list of trackers following differenttargets within a camera stream. It internally creates a new
tracker for newly detected blobs by the detector. Each
tracker in the trackerlist uses the original camera frame
and a foreground mask to update each target’s blob posi-
tion. The updated blob position will be sent back to the
detector (associated with this camera stream), to prevent
redundant detection of the same target.
There are several limitations for using this approach fora large-scale surveillance application. First, a complete
stream graph should be provided by a programer. In a
large-scale setting, specifying such a stream graph with a
huge number of stream stages and connections among
them (taking into account camera proximities) is a very
tedious task. Second, it cannot exploit the inherent paral-
lelism of target tracking. Dynamically creating a new
stream stage is not supported by System S; therefore, asingle stream stage, namely, trackerlist, has to execute
multiple trackers internally. This limitation creates a
significant load imbalance among the stream stages of
different trackerlists, as well as low target tracking per-
formance due to the sequential execution of the trackers by
a given stream stage. Last, stream stages can only com-
municate through statically defined stream channels
(internal to IBM System S), which prohibits arbitraryreal-time data sharing among different computation mod-
ules. As shown in Fig. 3, a programer has to explicitly
connect stream stages using stream channels and deal with
the ensuing communication latency under conditions of
infrastructure overload.
IV. SYSTEM BUILDING EXPERIENCES
In this section, we report on our experiences in building a
system infrastructure for large-scale situation-awareness
applications. Specifically, we describe 1) a novel program-ing model called target container (TC) (Section IV-A) that
addresses some of the limitations of the existing program-
ing models described in Section III-E; and 2) a peer-to-
peer distributed software architecture called ASAP1 that
addresses many of the scalability challenges for scaling up
situation-awareness applications that were identified in
Section III-A–D. ASAP uses the principles in the wireless
model for deployment of the physical infrastructure (seeSection III-C); allows for edge processing (where possible
with smart cameras) to conserve computing and network-
ing resources; incorporates multimodal sensing to reduce
the ill-effects of false positives and false negatives; and
exploits context awareness to prioritize camera streams of
interest.
A. TC Programing ModelThe TC programing model is designed for domain
experts to rapidly develop large-scale surveillance
Fig. 3. Target tracking based on stream-oriented models. 1ASAP stands for Bpriority-aware situation awareness[ read backwards.
Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 885
applications. The key insight is to elevate target as a first
class entity both from the perspective of the programer and
from the perspective of resource allocation by the execu-
tion environment. Consequently, all application level
vision tasks become target centric, which is more naturalfrom the point of view of the domain expert. The runtime
system is also able to optimize the resource allocation
commensurate with the application’s perception of
importance (expressed as target priority; see Table 1)
and equitably across all equal priority targets. In principle,
the TC model generalizes to dealing with heterogeneous
sensors (cameras, RFID readers, microphones, etc.). How-
ever, for the sake of clarity of the exposition, we adhere tocameras as the only sensors in this section.
TC programing model shares with large-scale stream
processing engines [55], [56], the concept of providing a
high-level abstraction for large-scale stream analytics.
However, TC is specifically designed for real-time surveil-
lance applications with special support based on the notionof a target.
1) TC Handlers and API: The intuition behind the TC
programing model is quite simple and straightforward.
Fig. 4 shows the conceptual picture of how a surveillance
application will be structured using the new programing
model, and Table 1 summarizes APIs provided by the TC
system. The application is written as a collection of hand-lers. There is a detector handler associated with each
Table 1 Target Container API
Fig. 4. Surveillance application using TC model.
Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
886 Proceedings of the IEEE | Vol. 100, No. 4, April 2012
camera stream. The role of the detector handler is to ana-lyze each camera image it receives to detect any new target
that is not already known to the surveillance system. The
detector creates a target container for each new target it
identifies in a camera frame by calling TC_create_targetwith initial tracker and TC data.
In the simple case, where a target is observed in only
one camera, the target container contains a single trackerhandler, which receives images from the camera andupdates the target information on every frame arrival.
However, due to overlapping FOVs, a target may appear
in multiple cameras. Thus, in the general case, a target
container may contain multiple trackers following a
target observed by multiple cameras. A tracker can call
TC_stop_track to notify the TC system that this tracker
need not be scheduled anymore; it would do that upon
realizing that the target it is tracking is leaving thecamera’s FOV.
In addition to the detectors (one for each sensor
stream) and the trackers (one per target per sensor stream
associated with this target), the application must provide
additional handlers to the TC system for the purposes of
merging TCs as explained below. Upon detecting a new
target in its FOV, a detector would create a new target
container. However, it is possible that this is not a newtarget but simply an already identified target that hap-
pened to move into the FOV of this camera. To address
this situation, the application would also provide a
handler for equality checking of two targets. Upon estab-
lishing the equality of two targets, the associated contain-
ers will be merged to encompass the two trackers (see
target container in Fig. 4). The application would provide
a merger handler to accomplish this merging of two targetsby combining two application-specific target data struc-
tures (TC data) into one. Incidentally, the application
may also choose to merge two distinct targets into a single
one (for example, consider a potential threat situation
when two cohorts join together and walk in unison in an
airport).
As shown in Fig. 4, there are three categories of data
with different sharing properties and life cycles. Detectordata are the result of processing per-stream input, which is
associated with a detector. The data can be used to main-
tain detector context such as detection history and average
motion level in the camera’s FOV, which are potentially
useful for surveillance applications using per-camera infor-
mation. The detector data are potentially shared by the
detector and the trackers spawned thereof. The trackers
spawned by the detector as a result of blob detection mayneed to inspect this detector data. The tracker data main-
tain the tracking context for each tracker. The detector
may inspect these data to ensure target uniqueness. TC
data represent a target. It is the composite of the tracking
results of all the trackers within a single TC. The equality
checking handler inspects TC data to see if two TCs are
following the same target.
While all three categories of data are shared, the loca-lity and degree of sharing for these three categories can be
vastly different. For example, the tracker data are unique
to a specific tracker and at most shared with the detector
that spawned the data. On the other hand, the TC data may
be shared by multiple trackers potentially spanning mul-
tiple computational nodes if an object is in the FOV of
several cameras. The detector data are also shared among
all the trackers that are working off a specific stream andthe detector associated with that stream. This is the reason
our API (see Table 1) includes six different access calls for
these three categories of shared data.
When programing a target tracking application, the
developer has to be aware of the fact that the handlers
may be executed concurrently. Therefore, the handlers
should be written as sequential codes with no side effects
to shared data structures to avoid explicit application-levelsynchronization. The TC programing model allows an
application developer to use optimized handlers written in
low-level programing languages such as C and C++. To
shield the domain expert from having to deal with
concurrency bugs, data sharing between different hand-
lers is only allowed through TC API calls (shown in
Table 1), which subsume data access with synchronization
guarantees.
2) TC Merge Model: To seamlessly merge two TCs into
one while tracking the targets in real time, the TC system
periodically calls equality checker on candidates for merge
operation. After merge, one of the two TCs is eliminated,
while the other TC becomes the union of the two pre-
vious TCs.
Execution of the equality checker on different pairs ofTCs can be done in parallel since it does not update any TC
data. Similarly, merger operations can go on in parallel so
long as the TCs involved in the parallel merges are all
distinct.
TC system may use camera topology information for
efficient merge operations. For example, if many targets
are being tracked in a large-scale camera network, only
those targets in nearby cameras should be compared andmerged to reduce the performance overhead of real-time
surveillance application.
B. ASAPSituation-awareness applications (such as video-based
surveillance) are capable of stressing the available compu-
tation and communication infrastructures to their limits.
Hence, the underlying system infrastructure should be:1) highly scalable (i.e., designed to reduce infrastructure
overload, cognitive overload, and false positives and false
negatives); 2) flexible to use (i.e., provide query- and
policy-based user interfaces to exploit context-sensitive
information for selective attention); and 3) easily exten-
sible (i.e., accommodate heterogeneous sensing and allow
for incorporation of new sensing modalities).
Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 887
We have designed and implemented a distributed
software architecture called ASAP for situation-awareness
applications. Fig. 5 shows the logical organization of the
ASAP architecture into control and data network. The
control network deals with low-level sensor-specific pro-
cessing to derive priority cues. These cues in turn are used
by the data network to prioritize the streams and carry out
further processing such as filtering and fusion of streams.It should be emphasized that this logical separation is
simply a convenient vehicle to partition the functionalities
of the ASAP architecture. The two networks are in fact
overlaid on the same physical network and share the
computational and sensing resources. For example, low
bitrate sensing, such as an RFID tag or a fire alarm, is part
of the control network. However, a high bitrate camera
sensor while serving the video stream for the data networkmay also be used by the control network for discerning
motion.
Fig. 6 shows the software architecture of ASAP: it is a
peer-to-peer network of ASAP agents (AAs) that execute on
independent nodes of the distributed system. The software
organization in each node consists of two parts: ASAP agent
(AA) and sensor agent (SA). There is one SA per sensor, and
a collection of SAs are assigned dynamically to an AA.
1) Sensor Agent: SA provides a virtual sensor abstractionthat provides a uniform interface for incorporating hete-
rogeneous sensing devices as well as to support multimodal
sensing in an extensible manner. This abstraction allows
new sensor types to be added without requiring any changeof the AA. There is a potential danger in such a virtual-
ization that some specific capability of a sensor may get
masked from full utilization. To avoid such semantic loss,
we have designed a minimal interface that serves the needs
of situation-awareness applications.
The virtual sensor abstraction allows the same physical
sensor to be used for providing multiple sensing services.
For example, a camera can serve not only as a video datastream, but also as a motion or a face detection sensor.
Similarly, an SA may even combine multiple physical
sensors to provide a multimodal sensing capability. Once
these different sensing modalities are registered with AAs,
they are displayed as a list of available features that users
can select to construct a query for the ASAP platform.
Fig. 6. ASAP software architecture.
Fig. 5. Functional view of ASAP.
Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
888 Proceedings of the IEEE | Vol. 100, No. 4, April 2012
ASAP platform uses these features as control cues forprioritization (to be discussed shortly).
2) ASAP Agent: As shown in Fig. 6, an AA is associated
with a set of SAs. The association is dynamic, and is engi-
neered at runtime in a peer-to-peer fashion among the
AAs. The components of AA are shown in Fig. 6. AA
provides a simple query interface with SQL-like2 syntax.
Clients can pose an SQL query using control cues as attri-butes. Different cues can be combined using Band[ and
Bor[ operators to create multimodal sensing queries.
False positives and negatives: Fig. 5 shows that senseddata lead to events, which when filtered and fused ulti-
mately lead to actionable knowledge. Unfortunately,
individual sensors may often be unreliable due to envi-
ronmental conditions (e.g., poor lighting conditions near a
camera). Thus, it may not always be possible to have highconfidence in the sensed data; consequently there is a
danger that the system may experience high levels of false
negatives and false positives. It is generally recognized
that multimodal sensors would help reduce the ill effects
of false positives and negatives. The virtual sensor
abstraction of ASAP allows multiple sensors to be fused
together and registered as a new sensor. Unlike multi-
feature fusion (a la face recognizer) where features arederived from the same (possibly noisy) image, multisensor
fusion uses different sensing modalities. ASAP exploits a
quorum system to make a decision. Even though a majority
vote is implemented at the present time, AA may assign
different weights to the different sensors commensurate
with the error rates of the sensors to make the voting more
accurate.
Prioritization strategies: ASAP needs to continuouslyextract prioritization cues from all the cameras and other
sensors (control network), and disseminate the selected
camera streams (data network) to interested clients
(which could be detectors/trackers of the TC system
from Section IV-A and/or a end user such as a security
personnel). ASAP extracts information from a sensor
stream by invoking the corresponding SA. Since there may
be many SAs registered at any time, invoking all SAs maybe very compute intensive. ASAP needs to prioritize the
invocations of SAs to scale well with the number of sen-
sors. This leads to the need for priority-aware computationin the control network. Once a set of SAs that are relevant
to client queries is identified, the corresponding camera
feeds need to be disseminated to the clients. If the band-
width required to disseminate all streams exceeds the
available bandwidth near the clients, network will end updropping packets. This leads to the need for priority-awarecommunication in the data network. Based on these needs,
the prioritization strategies employed by ASAP can be
grouped into the following categories: priority-aware com-putation and priority-aware communication.
Priority-aware computation: The challenge is dynami-
cally determining a set of SAs among all available SAs that
need to be invoked such that overall value of the derived
actionable knowledge (benefit for the application) is maxi-
mized. We use the term measure of effectiveness (MOE) to
denote this overall benefit. ASAP currently uses a simple
MOE based on clients’ priorities.The priority of an SA should reflect the amount of
possibly Bnew[ information the SA output may have and
its importance to the query in progress. Therefore, the
priority value is dynamic, and it depends on multiple fac-
tors, including the application requirements, and the in-
formation already available from other SAs. In its simplest
form, priority assignment can be derived from the priority
of the queries themselves. For instance, given two queriesfrom an application, if the first query is more important
than the second one, the SAs relevant to the first query
will have higher priority compared to the SAs corre-
sponding to the second query. More importantly, com-
putations do not need to be initiated at all of SAs since
1) such information extracted from sensed data may not
be required by any AA; and 2) unnecessary computation
can degrade overall system performance. The BWHERE[clause in the SQL-like query is used to activate a specific
sensing task. If multiple WHERE conditions exist, the
lowest computation-intensive task is initiated first that
activates the next task in turn. While it has a tradeoff
between latency and overhead, ASAP uses this heuristic for
the sake of scalability.
Priority-aware communication: The challenge is design-ing prioritization techniques for communication on thedata network such that application-specific MOE can be
maximized. Questions to be explored here include: How
do we assign priorities to different data streams and how
do we adjust their spatial or temporal fidelities that maxi-
mize the MOE?
In general, the control network packets are given higher
priority than data network packets. Since the control net-
work packets are typically much smaller than the datanetwork packets, supporting a cluster of SAs with each AA
does not overload the communication infrastructure.
C. Summary of ResultsWe have built a testbed with network cameras and
RFID readers for object tracking based on RFID tags and
motion detection. This testbed allows us to both under-
stand the programmability of large-scale camera networksusing the TC model, as well as understand the scalability of
the ASAP architecture. Specifically, in implementing
ASAP, we had three important goals: 1) platform neutrality
for the Bbox[ that hosts the AA and SA; 2) ability to sup-
port a variety of sensors seamlessly (for, e.g., network
cameras as well as USB cameras); and 3) extensibility to
support a wide range of handheld devices. We augmented
2SQL is derived from the original acronym SEQUEL, which stands forStructured English QUEry Language, for relational database systems.
Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 889
our real testbed consisting of tens of cameras, RFIDreaders, and microphones with emulated sensors. The
emulated sensors use the uniform virtual sensor interface
discussed in Section IV-B. Due to the virtual sensor
abstraction, an AA does not distinguish whether data comes
from an emulated sensor or a real sensor. The emulated
camera sends JPEG images at a rate requested by a client.
The emulated RFID reader sends tag detection event based
on an event file, where different event files mimic differentobject movement scenarios.
By using real devices (cameras, RFID readers, and micro-
phones) and emulated sensors, we were able to conduct
experiments to verify the scalability of our proposed software
architecture to scale to a large number of cameras. The
workload used is as follows. An area is assumed to bemade of a
set of cells, organized as a grid. Objects start from a randomly
selected cell, wait for a predefined time, and move to aneighbor cell. The number of objects, the grid size, and the
object wait time are workload parameters.We used end-to-end
latency (from sensing to actuation), network bandwidth usage,
and CPU utilization as figures of merit as we scale up the
system size (i.e., the number of cameras from 20 to 980) and
the number of queries (i.e., interesting events to be observed).
The scalability is attested by two facts: 1) the end-to-end
latency remains the same as we increase the number of queries(for a system with 980 camera streams); and 2) the CPU load
and the network bandwidth requirements grow linearly with
the number of interesting events to be observed (i.e., number
of queries) and not proportional to the size of the system
(i.e., the number of camera sensors in the deployment).3
V. CASE STUDY: IBM S3
The IBM Smart Surveillance project [16], [61] is one of the
few research projects in smart surveillance systems that
turned into a product, which has been recently used toaugment Chicago’s video surveillance network [62]. Quite
a bit of fundamental research in computer vision
technologies forms the cornerstone for IBM’s smart
surveillance solution. Indeed, IBM S3 transformed video-
based surveillance systems from a pure data acquisition
endeavor (i.e., recording the video streams on DVRs for
postmortem analysis) to an intelligent real-time online
video analysis engine that converts raw data intoactionable knowledge. IBM S3 product includes several
novel technologies [63] including multiscale video acquisi-
tion and analysis, salient motion detection, 2-D multiobject
tracking, 3-D stereo object tracking, video-tracking-based
object classification, object structure analysis, face categori-
zation following face detection to prevent Btailgating,[etc. Backed by the IBM DB2 product, IBM S3 is a powerful
engine for online querying of live and historical data in thehands of security personnel.
Our work, focusing on the programing model for large-scale situation awareness and a scalable peer-to-peer system
architecture for multimodal sensing, is complementary to
the state of the art established by the IBM S3 research.
VI. CONCLUDING REMARKS
Large-scale situation-awareness applications will continue
to grow in scale and importance as our penchant for in-strumenting the world with sensors of various modalities
and capabilities continues. Using video-based surveillance
as a concrete example, we have reviewed the enabling
technologies spanning wireless networking, smart cam-
eras, computer vision, context-aware frameworks, and
programing models. We have reported our own experi-
ences in building scalable programing models and software
infrastructure for situation awareness.Any interesting research answers a few questions and
raises several more. This is no different. One of the most
hairy problems with physical deployment is the heteroge-
neity and lack of standards for smart cameras. A typical
large-scale deployment will include smart cameras of dif-
ferent models and capabilities. Vendors typically provide
their own proprietary software for analyzing camera feeds
and controlling the cameras from a dashboard. Interoper-ability of camera systems from different vendors is difficult
if not impossible.
From the perspective of computer vision, one of the
major challenges is increasing the accuracy of detection
and/or scene analysis in the presence of ambient noise,
occlusion, and rapid movement of objects. Multiple views
of the same object help in improving the accuracy of de-
tection and analysis; with the ubiquity of cameras in recentyears it is now feasible to deploy several tens if not hun-
dreds of cameras in relatively small spaces (e.g., one gate
of an airport); but the challenge of using these multiple
views to develop accurate and scalable object detection
algorithms still remains an open problem.
From systems perspective, there is considerable amount
of work to be done in aiding the domain expert. There needs
to be closer synergy between vision researchers and systemsresearchers to develop the right abstractions for programing
large-scale camera networks, facilitating seamless handoff
from one camera to another as objects move, state and
computation migration between smart cameras and backend
servers, elastically increasing the computational resources to
deal with dynamic application needs, etc.
Last but not the least, one of the most thorny problems
that is plaguing the Internet today is bound to hit sensor-based distributed computing in the near future, namely,
spam. We have intentionally avoided discussing tamper-
proofing techniques such as stenography in camera sys-
tems; but as we explore mission-critical applications (such
as surveillance, urban terrorism, emergency response, and
healthcare) ensuring the veracity of sensor sources will
become increasingly important. h3Details of the TC programing system can found in [59], and detailed
results of the ASAP system evaluation can be found in [60].
Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
890 Proceedings of the IEEE | Vol. 100, No. 4, April 2012
REFERENCES
[1] M. McCahill and C. Norris, BEstimatingthe extent, sophistication and legalityof CCTV in London,[ in CCTV, M. Gill, Ed.London, U.K.: Palgrave Macmillan, 2003.
[2] R. Hunter, Chicago’s Surveillance Plan isan Ambitious Experiment, Gartner Research,2004. [Online]. Available: http://www.gartner.com/DisplayDocument?doc_cd=123919.
[3] C. Norris, M. McCahill, and D. Wood,BThe growth of CCTV: A global perspectiveon the international diffusion of videosurveillance in publicly accessible space,[Surveill. Soc., vol. 2, no. 2/3, pp. 110–135,2004.
[4] W. E. L. Grimson and C. Stauffer, BAdaptivebackground mixture models for real-timetracking,[ in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., 1999, DOI: 10.1109/CVPR.1999.784637.
[5] A. Elgammal, D. Harwood, and L. S. Davis,BNonparametric background model forbackground subtraction,[ in Proc. 6th Eur.Conf. Comput. Vis., pp. 751–767, 2000.
[6] D. M. Gavrila, BPedestrian detection froma moving vehicle,[ in Proc. 6th Eur. Conf.Comput. Vis. II, 2000, pp. 37–49.
[7] P. A. Viola and M. J. Jones, BRobust real-timeface detection,[ Int. J. Comput. Vis., vol. 57,no. 2, pp. 137–154, 2004.
[8] I. Haritaoglu, D. Harwood, and L. S. Davis,BW4: Who? when? where? what? a real timesystem for detecting and tracking people,[ inProc. IEEE Int. Conf. Autom. Face GestureRecognit., 1998, pp. 222–227.
[9] C. R. Wern, A. Azarbayejani, T. Darrell, andA. P. Pentland, BPfinder: Real-time trackingof human body,[ IEEE Trans. Pattern Anal.Mach. Intell., vol. 19, no. 7, pp. 780–785,Jul. 1997.
[10] W. Zhao, R. Chellappa, P. J. Phillips, andA. Rosenfeld, BFace recognition: A literaturesurvey,[ ACM Comput. Surv., vol. 35, no. 4,pp. 399–458, 2003.
[11] J. Little and J. Boyd, BRecognizing peopleby their gait: The shape of motion,[Videre, vol. 1, no. 2, pp. 1–32, 1998.
[12] A. Bissacco, A. Chiuso, Y. Ma, and S. Soatto,BRecognition of human gaits,[ in Proc.IEEE Conf. Comput. Vis. Pattern Recognit.,2001, vol. 2, pp. 52–58.
[13] D. M. Gavrila, BThe visual analysis of humanmovement: A survey,[ Comput. Vis. ImageUnderstand., vol. 73, no. 1, pp. 82–98, 1999.
[14] A. F. Bobick and J. W. Davis, BThe recognitionof human movement using temporaltemplates,[ IEEE Trans. Pattern Anal. Mach.Intell., vol. 23, no. 3, pp. 257–267, Mar. 2001.
[15] T. B. Moeslund and E. Granum, BA surveyof computer vision-based human motioncapture,[ Comput. Vis. Image Understand.,vol. 81, no. 3, pp. 231–268, 2001.
[16] IBM Smart Surveillance System (S3). [Online].Available: http://www.research.ibm.com/peoplevision/
[17] Video Surveillance Integrated SurveillanceSystems. [Online]. Available: https://www.buildingtechnologies.siemens.com
[18] Products That Make Surveillance Smart.[Online]. Available: http://www.objectvideo.com/products/
[19] P. F. Felzenszwalb, R. B. Girshick,D. A. McAllester, and D. Ramanan,BObject detection with discriminativelytrained part-based models,[ IEEE Trans.Pattern Anal. Mach. Intell., vol. 32, no. 9,pp. 1627–1645, Sep. 2010.
[20] X. Wang, T. X. Han, and S. Yan, BAnHOG-LBP human detector with partialocclusion handling,[ in Proc. Int. Conf.Comput. Vis., 2009, pp. 32–39.
[21] A. Yilmaz, O. Javed, and M. Shah, BObjecttracking: A survey,[ ACM Comput. Surv.,vol. 38, no. 4, p. 13, 2006.
[22] B. Babenko, M. Yang, and S. J. Belongie,BVisual tracking with online multiple instancelearning,[ Proc. IEEE Conf. Comput. Vis.Pattern Recognit., 2009, pp. 983–990.
[23] D. A. Ross, J. Lim, R. Lin, and M. Yang,BIncremental learning for robust visualtracking,[ Int. J. Comput. Vis., vol. 77,pp. 125–141, 2008.
[24] J. Kwon and K. M. Lee, BVisual trackingdecomposition,[ Proc. IEEE Conf. Comput.Vis. Pattern Recognit., 2010, pp. 1269–1276.
[25] A. Basharat, A. Gritai, and M. Shah, BLearningobject motion patterns for anomaly detectionand improved object detection,[ in Proc.IEEE Conf. Comput. Vis. Pattern Recognit.,2008, DOI: 10.1109/CVPR.2008.4587510.
[26] I. Saleemi, K. Shafique, and M. Shah,BProbabilistic modeling of scene dynamicsfor applications in visual surveillance,[IEEE Trans. Pattern Anal. Mach. Intell., vol. 31,no. 8, pp. 1472–1485, Aug. 2009.
[27] B. Yao and L. Fei-Fei, BModeling mutualcontext of object and human pose inhuman-object interaction activities,[ inProc. IEEE Conf. Comput. Vis. PatternRecognit., 2010, pp. 17–24.
[28] V. Mahadevan, W. Li, V. Bhalodia, andN. Vasconcelos, BAnomaly detectionin crowded scenes,[ in Proc. IEEEConf. Comput. Vis. Pattern Recognit.,2010, pp. 1975–1981.
[29] X. Wang, X. Ma, and W. Grimson,BUnsupervised activity perception in crowdedand complicated scenes using hierarchicalBayesian models,[ IEEE Trans. Pattern Anal.Mach. Intell., vol. 31, no. 3, pp. 539–555,Mar. 2009.
[30] H. Zhong, J. Shi, and M. Visontai, BDetectingunusual activity in video,[ in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2004,vol. 2, pp. 819–826.
[31] I. Bravo, J. Balias, A. Gardel, J. L. Lzaro,F. Espinosa, and J. Garca. (2011). Efficientsmart CMOS camera based on FPGAsoriented to embedded image processing.Sensors [Online]. 11(3), pp. 2282–2303.Available: http://www.mdpi.com/1424-8220/11/3/2282/
[32] S. K. Nayar, BComputational cameras:Redefining the image,[ IEEE Comput.Mag., vol. 39, Special Issue on ComputationalPhotography, no. 8, pp. 30–38, Aug. 2006.
[33] P. Chen, P. Ahammad, C. Boyer, S. Huang,L. Lin, E. Lobaton, M. Meingast, S. Oh,S. Wang, P. Yan, A. Y. Yang, C. Yeo,L. Chung Chang, J. D. Tygar, and S. S. Sastry,BCitric: A low-bandwidth wireless cameranetwork platform,[ in Proc. ACM/IEEEInt. Conf. Distrib. Smart Cameras, 2008,DOI: 10.1109/ICDSC.2008.4635675.
[34] P. Kulkarni, D. Ganesan, P. Shenoy, andQ. Lu, BSenseye: A multi-tier camera sensornetwork,[ in Proc. 13th Annu. ACM Int.Conf. Multimedia, 2005, pp. 229–238.
[35] H. Broers, W. Caarls, P. Jonker, andR. Kleihorst, BArchitecture study for smartcameras,[ in Proc. EOS Conf. Ind. Imag. Mach.Vis., 2005, pp. 39–49.
[36] Texas Instruments IP Camera. [Online].Available: http://www.ti.com/ipcamera
[37] Axis Communications. [Online]. Available:http://www.axis.com/
[38] Microsoft Kinect. [Online]. Available: http://www.xbox.com/kinect
[39] IEEE Standard for Wireless Local AreaNetworks, 802.11n. [Online]. Available:http://www.ieee802.org/11.
[40] The Zigbee Specification. [Online]. Available:http://www.zigbee.org
[41] C. Perkins, E. B. Royer, and S. Das, BAd hocon-demand distance vector (AODV) routing,[IETF RFC 3561, 2003.
[42] 3GPP Long Term Evolution. [Online]. Available:http://www.3gpp.org/article/lte
[43] Worldwide Interoperability for MicrowaveAccess. [Online]. Available: http://www.wimaxforum.org
[44] Xbow Micaz Motes. [Online]. Available:http://www.xbow.com
[45] RPL: IPv6 Routing Protocol for Low Power andLossy Networks, draft-ietf-roll-rpl-19. [Online].Available: http://tools.ietf.org/html/draft-ietf-roll-rpl-19
[46] RFC 3344: IP Mobility Support for IPv4, U.S.,2002, Tech. Rep.
[47] IBMA Smarter Planet. [Online]. Available:http://www.ibm.com/smarterplanet.
[48] L. Luo, A. Kansal, S. Nath, and F. Zhao,BSharing and exploring sensor streams overgeocentric interfaces,[ in Proc. 16th ACMSIGSPATIAL Int. Conf. Adv. Geograph. Inf.Syst., Irvine, CA, Nov. 2008, pp. 3–12.
[49] Nokia, Sensor Planet. [Online]. Available:http://www.sensorplanet.org/.
[50] L. Sanchez, J. Lanza, M. Bauer, R. L. Olsen,and M. G. Genet, BA generic contextmanagement framework for personalnetworking environments,[ in Proc. 3rdAnnu. Int. Conf. Mobile Ubiquitous Syst.,2006, DOI: 10.1109/MOBIQW.2006.361743.
[51] S. Kang, J. Lee, H. Jang, H. Lee, Y. Lee,S. Park, T. Park, and J. Song, BSeemon:Scalable and energy-efficient contextmonitoring framwork for sensor-rich mobileenvironments,[ in Proc. ACM Int. Conf.Mobile Syst., 2008.
[52] D. J. Lillethun, D. Hilley, S. Horrigan, andU. Ramachandran, BMB++: An integratedarchitecture for pervasive computing andhigh-performance computing,[ in Proc. 13thIEEE Int. Conf. Embedded Real-Time Comput.Syst. Appl., Aug. 2007, pp. 241–248.
[53] F. Hohl, U. Kubach, A. Leonhardi,K. Rothermel, and M. Schwehm, BNextcentury challenges: NexusVAn open globalinfrastructure for spatial-aware applications,[in Proc. 5th ACM/IEEE Int. Conf. MobileComput. Netw., Seattle, WA, Aug. 1999,pp. 249–255.
[54] R. Lange, N. Cipriani, L. Geiger,M. Grossmann, H. Weinschrott, A. Brodt,M. Wieland, S. Rizou, and K. Rothermel,BMaking the world wide space happen:New challenges for the nexus platform,[ inProc. 7th IEEE Int. Conf. Pervasive Comput.Commun., 2009, DOI: 10.1109/PERCOM.2009.4912782.
[55] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, andM. Doo, BSpade: The system S declarativestream processing engine,[ in Proc. ACMSIGMOD Int. Conf. Manage. Data, 2008,pp. 1123–1134. [Online]. Available: http://doi.acm.org/10.1145/1376616.1376729.
[56] L. Neumeyer, B. Robbins, A. Nair, andA. Kesari, BS4: Distributed stream computingplatform,[ in Proc. IEEE Int. Conf. DataMining Workshops, 2010, pp. 170–177.
[57] W. Thies, M. Karczmarek, andS. P. Amarasinghe, BStreamit: A languagefor streaming applications,[ in Proc. 11thInt. Conf. Compiler Construct., London, U.K.,
Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 891
2002, pp. 179–196. [Online]. Available:http://portal.acm.org/citation.cfm?id=647478.727935.
[58] P. S. Pillai, L. B. Mummert, S. W. Schlosser,R. Sukthankar, and C. J. Helfrich,BSlipstream: Scalable low-latency interactiveperception on streaming data,[ in Proc. 18thInt. Workshop Netw. Oper. Syst. Support DigitalAudio Video, 2009, pp. 43–48. [Online].Available: http://doi.acm.org/10.1145/1542245.1542256.
[59] K. Hong, B. Branzoi, J. Shin, S. Smaldone,L. Iftode, and U. Ramachandran, TargetContainer: A Target-Centric ParallelProgramming Abstraction for Video-Based
Surveillance, 2010. [Online]. Available:http://hdl.handle.net/1853/36186.
[60] J. Shin, R. Kumar, D. Mohapatra,U. Ramachandran, and M. Ammar, BASAP:A camera sensor network for situationawareness,[ in OPODIS 2007, vol. 4878.Berlin, Germany: Springer-Verlag, 2007,pp. 31–47, ser. Lecture Notes in ComputerScience.
[61] R. Feris, A. Hampapur, Y. Zhai, R. Bobbitt,L. Brown, D. Vaquero, Y. Tian, H. Liu, andM.-T. Sun, BCasestudy: IBM smartsurveillance system,[ in Intelligent VideoSurveillance: Systems and Technologies,
Y. Ma and G. Qian, Eds. London, U.K.:Taylor & Francis/CRC Press, 2009.
[62] ABC News, ABC7 Puts Video Analytics to theTest, Feb. 23, 2010. [Online]. Available:http://abclocal.go.com/wls/story?section=news/special_segments&id=7294108.
[63] A. Hampapur, L. Brown, J. Connell, A. Ekin,N. Haas, M. Lu, H. Merkl, S. Pankanti,A. Senior, C.-F. Shu, and Y. L. Tian,BSmart video surveillance: Exploringthe concept of multiscale spatiotemporaltracking,[ IEEE Signal Process. Mag.,vol. 22, no. 2, pp. 38–51, Mar. 2005.
ABOUT THE AUTHORS
Umakishore Ramachandran (Senior Member,
IEEE) received the Ph.D. degree in computer
science from the University of Wisconsin-Madison,
Madison, in 1986.
He is the Director of Samsung Tech Advanced
Research (STAR) Center and a Professor in the
College of Computing, Georgia Institute of Tech-
nology, Atlanta. His research interests span par-
allel and distributed systems, sensor networks,
pervasive computing, and mobile and embedded
computing.
Kirak Hong received the B.S. degree in computer
science from Yonsei University, Seoul, Korea, in
2009. Currently, he is working towards the Ph.D.
degree at the College of Computing, Georgia
Institute of Technology, Atlanta. His dissertation
research focuses on programing models and
execution frameworks for large-scale situation
awareness applications.
His research interests span distributed sys-
tems, mobile and embedded computing, and
sensor networks.
Liviu Iftode (Senior Member, IEEE) received the
Ph.D. degree in computer science from Princeton
University, Princeton, NJ, in 1998.
He is a Professor of Computer Science at
Rutgers University, Piscataway, NJ. His re-
search interests include operating systems, dis-
tributed systems, mobile, vehicular, and pervasive
computing.
Prof. Iftode is a member of the Association for
Computing Machinery (ACM).
Ramesh Jain (Fellow, IEEE) received the B.E.
degree from Visvesaraya Regional College of
Engineering, Nagpur, India, in 1969 and the Ph.D.
degree from the Indian Institute of Technology,
Kharagpur, India, in 1975.
He is a Donald Bren Professor in Information &
Computer Sciences at the University of California
at Irvine, Irvine, where he is doing research in
EventWeb and experiential computing. His current
research interests are in searching multimedia
data and creating EventWebs for experiential computing.
Dr. Jain is a Fellow of the Association for Computing Machinery (ACM),
the Association for the Advancement of Artificial Intelligence (AAAI), the
International Association for Pattern Recognition (IAPR), and The
International Society for Optics and Photonics (SPIE).
Rajnish Kumar received the Ph.D. degree in
computer science from Georgia Institute of Tech-
nology, Atlanta, in 2006. As part of his disserta-
tion, he designed and implemented SensorStack
that provides systems support for cross layering in
network stack for adaptability.
He is currently Chief Technology Officer at
Weyond, Princeton, NJ. His research interests are
in systems support for large-scale streaming data
analytics.
Kurt Rothermel received the Ph.D. degree in
computer science from University of Stuttgart,
Stuttgart, Germany, in 1985.
Since 1990, he has been with the University of
Stuttgart, where he is a Professor of Computer
Science and the Director of the Institute of Parallel
and Distributed Systems (IPVS). His research
interests span distributed systems, computer net-
works, mobile computing, and sensor networks.
Junsuk Shin received the B.S. degree in electrical
engineering from Yonsei University, Seoul, Korea
and the M.S. degree in computer science from
Georgia Institute of Technology, Atlanta, where he
is currently working towards the Ph.D. degree.
He joined Microsoft in 2009. His research
interest includes distributed system, sensor net-
work, mobile computing, and embedded system.
Raghupathy Sivakumar received the Ph.D. de-
gree in computer science from the University of
Illinois at Urbana-Champaign, Urbana, in 2000.
He is a Professor in the School of Electrical and
Computer Engineering at Georgia Institute of
Technology, Atlanta. He leads the Georgia Tech
Networking and Mobile Computing (GNAN) Re-
search Group, conducting research in the areas of
wireless networking, mobile computing, and com-
puter networks.
Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing
892 Proceedings of the IEEE | Vol. 100, No. 4, April 2012