(PDF) Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

INV ITEDP A P E R

Large-ScaleSituationAwarenessWith Camera Networks andMultimodal SensingThis paper describes principles and practice of situational awareness in

communications; smart cameras, wireless infrastructure, context-aware

computing, and programing models are discussed.

By Umakishore Ramachandran, Senior Member IEEE, Kirak Hong,

Liviu Iftode, Senior Member IEEE, Ramesh Jain, Fellow IEEE, Rajnish Kumar,

Kurt Rothermel, Junsuk Shin, and Raghupathy Sivakumar

ABSTRACT | Sensors of various modalities and capabilities,

especially cameras, have become ubiquitous in our environ-

ment. Their intended use is wide ranging and encompasses

surveillance, transportation, entertainment, education, health-

care, emergency response, disaster recovery, and the like.

Technological advances and the low cost of such sensors

enable deployment of large-scale camera networks in large

metropolises such as London and New York. Multimedia algo-

rithms for analyzing and drawing inferences from video and

audio have also matured tremendously in recent times. Despite

all these advances, large-scale reliable systems for media-rich

sensor-based applications, often classified as situation-

awareness applications, are yet to become commonplace.

Why is that? There are several forces at work here. First, the

system abstractions are just not at the right level for quickly

prototyping such applications on a large scale. Second, while

Moore’s law has held true for predicting the growth of

processing power, the volume of data that applications are

called upon to handle is growing similarly, if not faster.

Enormous amount of sensing data is continually generated

for real-time analysis in such applications. Further, due to the

very nature of the application domain, there are dynamic and

demanding resource requirements for such analyses. The lack

of right set of abstractions for programing such applications

coupled with their data-intensive nature have hitherto made

realizing reliable large-scale situation-awareness applications

difficult. Incidentally, situation awareness is a very popular but

ill-defined research area that has attracted researchers from

many different fields. In this paper, we adopt a strong systems

perspective and consider the components that are essential in

realizing a fully functional situation-awareness system.

KEYWORDS | Large-scale distributed systems; programing

model; resource management; scalability; situation awareness;

video-based surveillance

I . INTRODUCTION

Situation awareness is both a property and an application

class that deals with recognizing when sensed data could

lead to actionable knowledge.With advances in technology, it is becoming feasible to

integrate sophisticated sensing, computing, and commu-nication in a single small footprint sensor platform (e.g.,

smart cameras). This trend is enabling deployment of

powerful sensors of different modalities in a cost-effective

Manuscript received May 16, 2011; revised September 5, 2011; accepted October 20,

2011. Date of publication February 20, 2012; date of current version March 21, 2012.

U. Ramachandran, K. Hong, and J. Shin are with the College of Computing,

Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: [emailprotected];

[emailprotected]; [emailprotected]).

L. Iftode is with the Department of Computer Science, Rutgers University,

Piscataway, NJ 08854 USA (e-mail: [emailprotected]).

R. Jain is with the School of Information and Computer Sciences, University of

California at Irvine, Irvine, CA 92697-3425 USA (e-mail: [emailprotected]).

R. Kumar was with the College of Computing, Georgia Institute of Technology,

Atlanta, GA 30332 USA. He is now with Weyond, Princeton, NJ 08540 USA

(e-mail: [emailprotected]).

K. Rothermel is with the Institute for Parallel and Distributed Systems

(IPVS), University of Stuttgart, 70569 Stuttgart, Germany

(e-mail: [emailprotected]).

R. Sivakumar is with the School of Electrical and Computer Engineering,

Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: [emailprotected]).

Digital Object Identifier: 10.1109/JPROC.2011.2182093

878 Proceedings of the IEEE | Vol. 100, No. 4, April 2012 0018-9219/$31.00 �2012 IEEE

manner. While Moore’s law has held true for predictingthe growth of processing power, the volume of data that

applications handle is growing similarly, if not faster.

Situation-awareness applications are inherently distri-

buted, interactive, dynamic, stream based, computation-

ally demanding, and needing real-time or near real-time

guarantees. A sense–process–actuate control loop charac-

terizes the behavior of this application class.

There are three main challenges posed by data explo-sion for realizing situation awareness: overload on the

infrastructure, cognitive overload on humans in the loop,

and dramatic increase in false positives and false negatives

in identifying threat scenarios. Consider, for example,

providing situation awareness in a battlefield. It needs

complex fusion of contextual knowledge with time-

sensitive sensor data obtained from different sources to

derive higher level inferences. With an increase in thesensed data, a fighter pilot will need to take more data into

account in decision making leading to a cognitive overload

and an increase in human errors (false positives and nega-

tives). Also, to process and disseminate the sensed data,

more computational and network resources are needed

thus overloading the infrastructure.

Distributed video-based surveillance is a good canonical

example of this application class. Visual information playsa vital role in surveillance applications, demonstrated by

the strategic use of video cameras as a routine means of

physical security. With advances in imaging technology,

video cameras have become increasingly versatile and so-

phisticated. They can be multispectral, can sense at varying

resolutions, can operate with differing levels of actuation

(stationary, moving, controllable), and can even be air-borne (e.g., in military applications). Cameras are being

deployed on a large scale, from airports to city-scale, infra-

structures. Such large-scale deployments result in massive

amounts of visual information that must be processed in

real time to extract useful and actionable knowledge for

timely decision making. The overall goal of surveillance

systems is to detect and track suspicious activities to ward

off potential threats. Reliable computer-automated surveil-lance using vision-based tracking, identification, and

activity monitoring can relieve operator tedium and allow

coverage of larger areas for various applications (airports,

cities, highways, etc.). Fig. 1 is a visual of the camera de-

ployment in an airport to serve as the infrastructure for

such a video-based surveillance system.

Video surveillance based on closed-circuit television

(CCTV) was first introduced in the United Kingdom in themiddle of the last century. Since then, camera surveillance

networks have proliferated in the United Kingdom, with

over 200 000 cameras in London alone [1]. In the United

States, the penetration of CCTV has been relatively slower;

Chicago is leading with more than 2000 cameras, which

connect to an operation center constantly monitored by

police officers [2]. Apart from the legal and privacy aspects

of the CCTV technology [3], it is both expensive and hardto scale due to the huge human capital involved in moni-

toring the camera feeds.

Smart or intelligent surveillance combines sensing and

computing to automatically identify interesting objects and

suspicious behaviors. Advances in computer vision have

enabled a range of technologies including: human

Fig. 1. Cameras and people movement in an airport.

Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 879

detection and discrimination [4]–[7]; single-camera andmulticamera target tracking [8], [9]; biometric informa-

tion gathering, such as face [10] and gait signatures [11],

[12]; and human motion and activity classification [13]–

[15]. Such advances (often referred to as video analytics)are precursors to fully automated surveillance, and bode

well for use in many critical applications including and

beyond surveillance.

As image processing and interpretation tasks migratefrom a manual to a computer-automated model, questions

of system scalability and efficient resource management

will arise and must be addressed. In large settings such as

airports or urban environments, processing the data

streaming continuously from multiple video cameras is a

computationally intensive task. Moreover, given the goals

of surveillance, images must be processed in real time in

order to provide the timeliness required by modern secu-rity practice. Questions of system scalability go beyond

video analytics, and fall squarely in the purview of distri-

buted systems research.

Consider a typical smart surveillance system in an air-

port with cameras deployed in a pattern to maintain con-

tinuous surveillance of the terminal concourses (Fig. 1).

Images from these cameras are processed by some

application-specific logic to produce the precise level ofactionable knowledge required by the end user (human

and/or software agent). The application-specific processing

may analyze multiple camera feeds to extract higher level

information such as Bmotion,[ Bpresence of a human face,[or Bcommitting a suspicious activity.[ Additionally, a secu-rity agent can specify policies, e.g., Bonly specified people

are allowed to enter a particular area,[ which causes the

system to trigger an alert whenever such a policy is violated.The surveillance system described above, fully realized,

is no longer a problem confined to computer vision but a

large-scale distributed systems problem with intensive

data-processing resource requirements. Consider, for ex-

ample, a simple small-scale surveillance system that does

motion sensing and Joint Photographic Experts Group

(JPEG) encoding/decoding. Fig. 2 shows the processing

requirements for such a system using a centralized setup[single 1.4-GHz Intel Pentium central processing unit

(CPU)]. In this system, each camera is restricted to stream

images at a slow rate of 5 frames/s, and each image has a

very coarse-grained resolution of only 320 � 240. Even

under the severely restricted data-processing conditions,

the results show that the test system cannot scale beyond

four cameras due to CPU saturation. Increasing the video

quality (frames per second and resolution) to those re-quired by modern security applications would saturate

even a high-end computing system attempting to process

more than a few cameras. Clearly, scaling up to a large

number of cameras (on the order of hundreds or thou-

sands) warrants a distributed systems solution.

We take a systems approach to scalable smart surveil-

lance, embodying several interrelated research threads:

1) determining the appropriate system abstractions to aidthe computer-vision domain expert in developing such

complex applications; 2) determining the appropriate

execution model that fully exploits the resources across

the distributed system; and 3) identifying technologies

spanning sensing hardware and wireless infrastructures for

supporting large-scale situation awareness.

Situation awareness as a research area is still evolving.

It has attracted researchers from vastly different fieldsspanning computer vision, robotics, artificial intelligence,

systems, and networking. In this paper, we are talking

about component technologies, not end-to-end system,

that are essential to realize a fully functional situation-

awareness system. We start by understanding the applica-

tion requirements, especially in the domain of video-based

surveillance (Section II). We use this domain knowledge to

raise questions about the systems research that is neededto support large-scale situation awareness. We then pre-

sent a bird’s eye view of the enabling technologies of rele-

vance to large-scale situation awareness (Section III). This

tour of technologies spans computer vision, smart cameras

and other sensors, wireless, context-aware computing, and

programing models. We then report on our own experi-

ence in developing a system architecture for situation-

awareness applications (Section IV).IBM’s S3 system [16] is perhaps the only complete end-

to-end system for situation awareness that we are aware of.

Fig. 2. Surveillance system resource utilization. (a) CPU load.

(b) Memory usage.

Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

880 Proceedings of the IEEE | Vol. 100, No. 4, April 2012

We include a case study of IBM’s S3 product, which re-presents the state of the art in online video-based surveil-

lance (Section V). We conclude with thoughts on where

we are headed in the future in the exploration of large-

scale situation awareness (Section VI).

II . APPLICATION MODEL

Using video-based surveillance as a concrete instance ofthe domain of situation-awareness applications, let us first

understand the application model. In a video-based sur-

veillance application, there are two key functions: detec-tion and tracking. For the sake of this discussion, we will

say that detection concerns with identifying any anoma-

lous behavior of a person or an object from a scene. For

example, in an airport, a person leaving a bag in a public

place and walking away is one such anomalous event. Suchan event has to be captured in real time by an automated

surveillance system among thousands of normal activities

in the airport. As can be imagined, there could be several

such potentially anomalous events that may be happening

in an airport at any given time. Once such an event is

detected, the object or the person becomes a target and theautomated surveillance system should keep track of the

target that triggered the event. While tracking the targetacross multiple cameras, the surveillance system provides

all relevant information of the target including location

and multiple views captured by different cameras, to

eventually lead to a resolution of whether the original

event is a benign one or something serious warranting

appropriate action by a security team. For clarity, we will

use the term detector and tracker to indicate these two

pieces of the application logic.The application model reveals the inherent parallel/

distributed nature of a video-based surveillance applica-

tion. Each detector is a per-camera computation and these

computations are inherently data parallel since there is no

data dependency among the detectors working on different

camera streams. Similarly, each tracker is a per-target

computation that can be run concurrently for each target.

If a target simultaneously appears in the field of view(FOV) of multiple cameras, the trackers following the tar-

get on each of the different camera streams need to work

together to build a composite knowledge of the target.

Moreover, there exist complex data sharing and commu-

nication patterns among the different instances of detec-

tors and trackers. For example, the detector and trackers

have to work together to avoid duplicate detection of the

same target.The application model as presented above can easily be

realized on a small scale (i.e., on the order of tens of

camera streams) by implementing the application logic to

be executed on each of the cameras, and the output to be

centrally analyzed for correlation and refinement. Indeed,

there are already video analytics solution providers [17],

[18] that peddle mature commercial products for such

scenarios. However, programing such scenarios on a largescale requires a distributed approach, whose scalability is a

hard open problem.

How can a vision expert write an application for video-

based surveillance that spans thousands of cameras and

other sensors? How can we design a scalable infrastructure

that spans a huge geographical area such as an airport or a

city to support such applications? How do we reduce the

programing burden on the domain expert by providing theright high-level abstractions? What context information is

needed to support prioritization of data streams and asso-

ciated computations? How can we transparently migrate

computations between the edges of the network (i.e., at or

close to the sensors) and the computational workhorses

(e.g., cloud)? How do we adaptively adjust the fidelity of

the computation commensurate with the application dyna-

mics (e.g., increased number of targets to be observed thancan be sustained by the infrastructure)? These are some of

the questions that our vision for large-scale situation

awareness raises.

III . ENABLING TECHNOLOGIES

The objective in this section is to give a bird’s eye view of

the state of the art in technologies that are key enablers forlarge-scale situation awareness. We start with a brief survey

of computer vision technologies as they apply to video-

based surveillance (Section III-A). We then discuss smart

camera technology that is aimed at reducing the stress on

the compute and networking infrastructure by facilitating

efficient edge processing (such as filtering and motion

detection) to quench the uninteresting camera streams at

the source (Section III-B). We then survey wireless tech-nologies (Section III-C) that allow smart cameras to be

connected to backend servers given the computationally

intensive nature of computer vision tasks. This is followed

by reviewing the middleware framework for context-aware

computing, a key enabler to paying selective attention to

streams of interest for deeper analysis in situation-

awareness applications (Section III-D). Last, we review

programing models and execution frameworksVperhapsthe most important piece of the puzzle for developing large-

scale situation-awareness applications (Section III-E).

A. Computer VisionComputer vision technologies have advanced dramat-

ically during the last decade in a number of ways. Many

algorithms have been proposed in different subareas of

computer vision and have significantly improved the per-formance of computer vision processing tasks. There are

two aspects to performance when it comes to vision tasks:

accuracy and latency. Accuracy has to do with the correct-

ness of the inference made by the vision processing task

(e.g., how precise is the bounding box around a face

generated by a face detection algorithm?). Latency, on the

other hand, has to do with the time it takes for a vision

Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 881

processing task to complete its work. Traditionally, com-puter vision research has been focused on developing

algorithms that increase the accuracy of detection, track-

ing, etc. However, when computer vision techniques are

applied to situation-awareness applications, there is a ten-

sion between accuracy and latency. Algorithms that

increase the accuracy of event detection are clearly pre-

ferable. However, if the algorithm is too slow then the

outcome of the event detection may be too late to serve asactionable knowledge. In general, in a video-based surveil-

lance application the objective is to shrink the elapsed time

(i.e., latency) between sensing and actuation. Since video

processing is continuous in nature, computer vision algo-

rithms strive to achieve a higher processing frame rate

(i.e., frames per second) to ensure that important events

are not missed. Therefore, computer vision research has

been focusing on improving performance both in terms ofaccuracy and latency for computer vision tasks of relevance

to situation awareness, namely, 1) object detection; 2) objecttracking; and 3) event recognition.

Object detection algorithms, as the name suggests, de-

tect and localize objects in a given image frame. As the

same object can have significant variations in appearance

due to the orientation, lighting, etc., accurate object de-

tection is a hard problem. For example, with human sub-jects there can be variation from frame to frame in poses,

hand position, and face expressions [19]. Object detection

can also suffer from occlusion [20]. This is the reason

detection algorithms tend to be slow and do not achieve a

very high frame rate. For example, a representative detec-

tion algorithm proposed by Felzenszwalb et al. [19] takes 3 sto train one frame and 2 s to evaluate one frame. To put this

performance in perspective, camera systems are capable ofgrabbing frames at rates upwards of 30 frames/s.

Object tracking research has addressed online algo-

rithms that train and track in real time (see [21] for a

comprehensive survey). While previous research has typi-

cally used lab environment with static background and

slowly moving objects in the foreground, recent research

[22]–[24] has focused on improving 1) real-time proces-

sing; 2) occlusions; 3) movement of both target and back-ground; and 4) the scenarios where an object leaves the

FOV of one camera and appears in front of another. Real-

time performance to accuracy tradeoff is evident in the

design and experimental results reported by the authors of

these algorithms. For example, the algorithm proposed by

Babenko et al. [22] runs at 25 frames/s while the algorithm

proposed by Kwon and Lee [24] takes 1–5 s/frame (for

similar sized video frames). However, Kwon and Lee [24]show through experimental results that their algorithm

results in higher accuracy over a larger data set of videos.

Event recognition is a higher level computer vision task

that plays an important role in situation-awareness appli-

cations. There are many different types of event recogni-

tion algorithms that are trained to recognize certain events

and/or actions from video data. Examples of high level

events include modeling individual object trajectories [25],[26], recognizing specific human poses [27], and detection

of anomalies and unusual activities [28]–[30].

In recent years, the state of the art in automated visual

surveillance has advanced quite a bit for many tasks in-

cluding: detecting humans in a given scene [4], [5]; track-

ing targets within a given scene from a single camera or

multiple cameras [8], [9]; following targets in a wide FOV

given overlapping sensors; classification of targets intopeople, vehicles, animals, etc.; collecting biometric infor-

mation such as face [10] and gait signatures [11]; and

understanding human motion and activities [13], [14].

In general, it should be noted that computer vision

algorithms for tasks of importance to situation awareness,

namely, detection, tracking, and recognition, are compu-

tationally intensive. The first line of defense is to quickly

eliminate streams that are uninteresting. This is one of theadvantages of using smart cameras (to be discussed next).

More generally, facilitating real-time execution of such

algorithms on a large-scale deployment of camera sensors

necessarily points to a parallel/distributed solution (see

Section III-E).

B. Smart CamerasOne of the keys to a scalable infrastructure for large-

scale situation awareness is to quench the camera streams

at the source if they are not relevant (for, e.g., no action in

front of a camera). One possibility is moving some aspects

of the vision processing (e.g., object detection) to the

cameras themselves. This would reduce the communica-

tion requirements from the cameras to the backend servers

and add to the overall scalability of the wireless infrastruc-

ture (see Section III-C). Further, it will also reduce theoverall computation requirements in the backend server

for the vision processing tasks.

With the evolution of sensing and computation tech-

nologies, smart cameras have also evolved along three di-

mensions: data acquisition, computational capability, and

configurability. Data acquisition sensors are used for cap-

turing the images of camera views. There are two current

alternatives for such sensors: charge-coupled device(CCD) sensors and complementary metal–oxide–

semiconductor (CMOS) sensors. Despite the superior

image quality obtainable with CCD sensors, CMOS sensors

are more common in today’s smart cameras mainly because

of their flexible digital control, high-speed exposure, and

other functionalities.

Computational element, needed for real-time proces-

sing of the sensed data, is in general implemented usingone of (or a combination of) the following technologies:

digital signal processors (DSPs), microcontroller or micro-

processors, field programmable gate arrays (FPGAs), mul-

timedia processors, and application specific integrated

circuits (ASICs). Microcontrollers provide the most flexi-

bility among these options but may be less suitable for the

implementation of image processing algorithms compared

Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

882 Proceedings of the IEEE | Vol. 100, No. 4, April 2012

to DSPs or FPGAs. With recent advances, memory con-trollers and microcontrollers are integrated with FPGA

circuits to attain hardware-level parallelism while main-

taining the reconfigurability of microprocessors. Thus,

FPGAs are emerging to be a good choice for implementing

the computational elements of smart cameras [31].

Finally, because of the reconfigurability of CMOS sen-

sors and FPGA-based computational flexibility of today’s

smart cameras, it is now possible to have fine-grainedcontrol of both sensing and processing units, leading to a

whole new field of computational cameras [32]. This newbreed of cameras can quickly adjust their optical circuitry

to obtain high-quality images even under dynamic lighting

or depth of view conditions. Such computational cameras,

combined with pan-tilt-zoom (PTZ) controllers, FPGA-

based image-processing elements, and a communication

element to interact with other cameras or remote servers,can be considered today as the state-of-the-art design of a

smart camera. CITRIC [33] is a recent example from an

academic setting of a wireless camera platform. Multi-

tiered camera platforms have also been proposed wherein

low-power camera motes wake up higher resolution came-

ras to capture and process interesting images. SensEye [34]

is one such platform; it achieves low latency from sensing

to actuation without sacrificing energy efficiency.Companies such as Philips, Siemens, Sony, and Texas

Instruments [35], [36] have commercial smart camera

products, and such smart cameras usually have program-

mable interfaces for customization. Axis [37], while focus-

ing on IP camera, incorporates multimodal sensors and

passive infrared (PIR) sensors (for motion detection) in

their camera offerings. The entertainment industry has

also embraced cameras with additional sensing modalities,e.g., Microsoft Kinect [38] uses advanced sensor technol-

ogies to construct 3-D video data with depth information

using a combination of CMOS cameras and infrared sensing.

One of the problems with depending on only one

sensing technology is the potential for increasing falsepositives (false alarm for a nonexistent threat situation) and

false negatives (a real threat missed by the system). Despite

the sophistication in computer vision algorithms, it is stillthe case that these algorithms are susceptible to lighting

conditions, ambient noise, occlusions, etc. One way of

enhancing the quality of the inference is to augment the

vision techniques with other sensing modalities that may

be less error prone. Because of the obvious advantage of

multimodal sensing, many smart camera manufacturers

today add different sensors along with optics and provide

an intelligent surveillance system that takes advantage ofthe nonoptical data, e.g., use of integrated global position-

ing system (GPS) to tag location awareness to the streamed

data.

C. Wireless InfrastructureThe physical deployment for a camera-based situation-

awareness application would consist of a plethora of wired

and wireless infrastructure components: simple and smartcameras, wireless access points, wireless routers, gateways,

and Internet connected backend servers. The cameras will,

of course, be distributed spatially in a given region along

with wireless routers and gateways. The role of the wire-

less routers is to stream the camera images to backend

servers in the Internet (e.g., cloud computing resources)

using one or more gateways. The gateways connect the

wireless infrastructure with the wired infrastructure andare connected to the routers using long-range links referredto as backhaul links. Similarly, the links between the wire-

less cameras and the wireless routers are short-range linksand are referred to as access links. Additionally, wirelessaccess points may be available to directly connect the

cameras to the wired infrastructure. A typical deployment

may in fact combine wireless access points and gateways

together, or access points may be connected to gatewaysvia gigabit Ethernet.

1) Short-Range Technologies: IEEE 802.11n [39] is a very

high throughput standard for wireless local area networks

(WLANs). The 802.11n standard has evolved considerably

from its predecessors: 802.11b and 802.11a/g. The 802.11n

standard includes unique capabilities such as the use of

multiple antennas at the transmitter and the receiver to real-ize high throughput links along with frame aggregation

and channel bonding. These features enable a maximum

physical layer data rate of up to 600 Mb/s. The 802.11

standards provide an indoor communication range of less

than 100 m and hence are good candidates for short-range

links.

IEEE 802.15.4 (Zigbee) [40] is another standard for

small low-power radios intended for networking low-bitrate sensors. The protocol specifies a maximum physical

layer data rate of 250 kb/s and a transmission range be-

tween 10 and 75 m. Zigbee uses multihop routing built

upon the ad hoc on demand distance vector (AODV; [41])

routing protocol. In the context of situation-awareness

applications, Zigbee would be useful for networking other

sensing modalities in support of the cameras (e.g., radio-

frequency identification (RFID), temperature, and humid-ity sensors).

The key issue in the use of the above technologies in a

camera sensor network is the performance versus energy

tradeoff. The IEEE 802.11n provides much higher data

rates and wider coverage but is less energy efficient when

compared to Zigbee. Depending on the power constraints

and data rate requirements in a given deployment, either

of this technologies would be more appropriate than theother.

2) Long-Range Technologies: The two main candidate

technologies for long-range links (for connecting routers

to gateways) are Long-Term Evolution (LTE) [42] and IEEE802.16 Worldwide Interoperability for Microwave Access(WiMax) [43]. The LTE specification provides an uplink

Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 883

data rate of 50 Mb/s and communication ranges from 1 to100 km. WiMax provides high data rates of up to 128 Mb/s

uplink and a maximum range of 50 km. Thus, both these

technologies are well suited as backhaul links for camera

sensor networks. There is an interesting rate-range trade-

off between access links and backhaul links. To support the

high data rates (but short range) of access links, it is quite

common to bundle multiple backhaul links together.

While both the above technologies allow long-range com-munication, the use of one technology in a given environ-

ment would depend on the spectrum that is available in a

given deployment (licensed versus unlicensed) and the

existence of prior cellular core networks in the deploy-

ment area.

3) Higher Layer Protocols: In addition to the link layer

technologies that comprise a camera sensor network,higher layer protocols for routing are also essential for

successful operation of camera networks.

a) Surge mesh routing is a popular routing protocol

used in several commercial Zigbee devices such

as the Crossbow Micaz motes [44]. This provides

automatic rerouting when a camera sensor link

fails and constructs a topology dynamically by

keeping track of link conditions.b) RPL [45] is an IPv6 routing protocol for com-

municating multipoint-to-point traffic from low-

power devices toward a central control point, as

well as point-to-multipoint traffic from the cen-

tral control point to the devices. RPL allows dy-

namic construction of routing trees with the root

at the gateways in a camera sensor network.

c) Mobile IP is a protocol that is designed to allowmobile device users to move from one network to

another while maintaining a permanent IP ad-

dress [46]. This is especially useful in applica-

tions with mobile cameras (mounted on vehicles

or robots). In the context of large-scale camera

sensor networks, high throughput and scalability

are essential.

While mobile IP and surge mesh routing are easier to de-ploy, they are usable only for small-size networks. For

large-scale networks, RPL is more suited but is more

complex to operate as well.

D. Context-Aware FrameworksSituation-awareness applications are context sensitive,

i.e., they adapt their behavior depending on the state of

their physical environment or information derived fromthe environment. Further, the context associated with ap-

plications is increasing in both volume and geographic

extent over time. The notion of context is application de-

pendent. For example, IBM’s Smarter Planet [47] vision

includes the global management of the scarce resources of

our earth like energy and water, traffic, transportation, and

manufacturing.

Specifically, with respect to video-based surveillancethe context information is the metadata associated with a

specific object that is being tracked. An infrastructure for

context management should integrate both the collected

sensor data and geographic data, including maps, floor

plans, 3-D building models, and utility network plans.

Such context information can be used by the automated

surveillance system to preferentially allocate computing,

networking, and cognitive resources. In other words, con-text awareness enables selective attention leading to better

situation awareness.

An appropriate framework for managing and sharing

context information for situation-awareness applications is

the sensor web concept, exemplified by Microsoft Sense-

Web [48], and Nokia Sensor Planet [49]. Also various

middleware systems for monitoring and processing sensor

data have been proposed (e.g., [50]–[52]). These infra-structures provide facilities for sensor discovery (e.g., by

type or location), retrieval and processing of sensor data,

and more sophisticated operations such as filtering and/or

event recognition.

Federated Context Management goes one step further

than sensor webs. It integrates not only sensor data but

also geographic data from various sources. The Nexus

framework [53], [54] federates the context information ofthe different providers and offers context-aware applica-

tions a global and consistent view of the federated context.

This global context not only includes observable context

information (such as sensor values, road maps, and 3-D

models), but also higher level situation information

inferred from other contexts.

E. Programing ModelsBy far the biggest challenge in building large-scale

situation-awareness applications is the programing com-

plexity associated with large-scale distributed systems,

both in terms of ease of development for the domain expert

and efficiency of the execution environment to ensure

good performance. We will explore a couple of different

approaches, addressing the programing challenge.

1) Thread-Based Programing Model: The lowest level

approach to building surveillance systems is to have the

application developer handle all aspects of the system,

including traditional systems aspects, such as resource

management, and more application-specific aspects, such

as mapping targets to cameras. In this model, a developer

wishing to exploit the inherent application-level concur-

rency (see Section I) has to manage the concurrently exe-cuting threads over a large number of computing nodes.

This approach gives maximum flexibility to the developer

for optimizing the computational resources since he/she

has complete control of the system resources and the

application logic.

However, effectively managing the computational re-

sources for multiple targets and cameras is a daunting

Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

884 Proceedings of the IEEE | Vol. 100, No. 4, April 2012

responsibility for the domain expert. For example, theshared data structures between detectors and trackers en-

suring target uniqueness should be carefully synchronized

to achieve the most efficient parallel implementation.

Multiple trackers operating on different video streams may

also need to share data structures when they are monitor-

ing the same target. These complex patterns of data

communication and synchronization place an unnecessary

burden on an application developer, which is exacerbatedby the need to scale the system to hundreds or even thou-

sands of cameras and targets in a large-scale deployment

(airports, cities).

2) Stream-Oriented Programing Model: Another approachis to use a stream-oriented programing model [55]–[58] as

a high-level abstraction for developing surveillance appli-

cations. In this model, a programer does not need to dealwith low-level system issues such as communication and

synchronization. Rather, she can focus on writing an

application as a stream graph consisting of computation

vertices and communication edges. Once a programer

provides necessary information including a stream graph,

the underlying stream processing system manages the

computational resources to execute the stream graph over

multiple nodes. Various optimizations are applied at thesystem level shielding the programers from having to con-

sider performance issues.

Fig. 3 illustrates a stream graph for a target tracking

application using IBM System S [55]Vone of the represen-

tative off-the-shelf stream processing engines. A detector

processes each frame from a camera, and produces a digest

containing newly detected blobs, the original camera

frame, and a foreground mask. A second stream stage,

trackerlist, maintains a list of trackers following differenttargets within a camera stream. It internally creates a new

tracker for newly detected blobs by the detector. Each

tracker in the trackerlist uses the original camera frame

and a foreground mask to update each target’s blob posi-

tion. The updated blob position will be sent back to the

detector (associated with this camera stream), to prevent

redundant detection of the same target.

There are several limitations for using this approach fora large-scale surveillance application. First, a complete

stream graph should be provided by a programer. In a

large-scale setting, specifying such a stream graph with a

huge number of stream stages and connections among

them (taking into account camera proximities) is a very

tedious task. Second, it cannot exploit the inherent paral-

lelism of target tracking. Dynamically creating a new

stream stage is not supported by System S; therefore, asingle stream stage, namely, trackerlist, has to execute

multiple trackers internally. This limitation creates a

significant load imbalance among the stream stages of

different trackerlists, as well as low target tracking per-

formance due to the sequential execution of the trackers by

a given stream stage. Last, stream stages can only com-

municate through statically defined stream channels

(internal to IBM System S), which prohibits arbitraryreal-time data sharing among different computation mod-

ules. As shown in Fig. 3, a programer has to explicitly

connect stream stages using stream channels and deal with

the ensuing communication latency under conditions of

infrastructure overload.

IV. SYSTEM BUILDING EXPERIENCES

In this section, we report on our experiences in building a

system infrastructure for large-scale situation-awareness

applications. Specifically, we describe 1) a novel program-ing model called target container (TC) (Section IV-A) that

addresses some of the limitations of the existing program-

ing models described in Section III-E; and 2) a peer-to-

peer distributed software architecture called ASAP1 that

addresses many of the scalability challenges for scaling up

situation-awareness applications that were identified in

Section III-A–D. ASAP uses the principles in the wireless

model for deployment of the physical infrastructure (seeSection III-C); allows for edge processing (where possible

with smart cameras) to conserve computing and network-

ing resources; incorporates multimodal sensing to reduce

the ill-effects of false positives and false negatives; and

exploits context awareness to prioritize camera streams of

interest.

A. TC Programing ModelThe TC programing model is designed for domain

experts to rapidly develop large-scale surveillance

Fig. 3. Target tracking based on stream-oriented models. 1ASAP stands for Bpriority-aware situation awareness[ read backwards.

Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 885

applications. The key insight is to elevate target as a first

class entity both from the perspective of the programer and

from the perspective of resource allocation by the execu-

tion environment. Consequently, all application level

vision tasks become target centric, which is more naturalfrom the point of view of the domain expert. The runtime

system is also able to optimize the resource allocation

commensurate with the application’s perception of

importance (expressed as target priority; see Table 1)

and equitably across all equal priority targets. In principle,

the TC model generalizes to dealing with heterogeneous

sensors (cameras, RFID readers, microphones, etc.). How-

ever, for the sake of clarity of the exposition, we adhere tocameras as the only sensors in this section.

TC programing model shares with large-scale stream

processing engines [55], [56], the concept of providing a

high-level abstraction for large-scale stream analytics.

However, TC is specifically designed for real-time surveil-

lance applications with special support based on the notionof a target.

1) TC Handlers and API: The intuition behind the TC

programing model is quite simple and straightforward.

Fig. 4 shows the conceptual picture of how a surveillance

application will be structured using the new programing

model, and Table 1 summarizes APIs provided by the TC

system. The application is written as a collection of hand-lers. There is a detector handler associated with each

Table 1 Target Container API

Fig. 4. Surveillance application using TC model.

Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

886 Proceedings of the IEEE | Vol. 100, No. 4, April 2012

camera stream. The role of the detector handler is to ana-lyze each camera image it receives to detect any new target

that is not already known to the surveillance system. The

detector creates a target container for each new target it

identifies in a camera frame by calling TC_create_targetwith initial tracker and TC data.

In the simple case, where a target is observed in only

one camera, the target container contains a single trackerhandler, which receives images from the camera andupdates the target information on every frame arrival.

However, due to overlapping FOVs, a target may appear

in multiple cameras. Thus, in the general case, a target

container may contain multiple trackers following a

target observed by multiple cameras. A tracker can call

TC_stop_track to notify the TC system that this tracker

need not be scheduled anymore; it would do that upon

realizing that the target it is tracking is leaving thecamera’s FOV.

In addition to the detectors (one for each sensor

stream) and the trackers (one per target per sensor stream

associated with this target), the application must provide

additional handlers to the TC system for the purposes of

merging TCs as explained below. Upon detecting a new

target in its FOV, a detector would create a new target

container. However, it is possible that this is not a newtarget but simply an already identified target that hap-

pened to move into the FOV of this camera. To address

this situation, the application would also provide a

handler for equality checking of two targets. Upon estab-

lishing the equality of two targets, the associated contain-

ers will be merged to encompass the two trackers (see

target container in Fig. 4). The application would provide

a merger handler to accomplish this merging of two targetsby combining two application-specific target data struc-

tures (TC data) into one. Incidentally, the application

may also choose to merge two distinct targets into a single

one (for example, consider a potential threat situation

when two cohorts join together and walk in unison in an

airport).

As shown in Fig. 4, there are three categories of data

with different sharing properties and life cycles. Detectordata are the result of processing per-stream input, which is

associated with a detector. The data can be used to main-

tain detector context such as detection history and average

motion level in the camera’s FOV, which are potentially

useful for surveillance applications using per-camera infor-

mation. The detector data are potentially shared by the

detector and the trackers spawned thereof. The trackers

spawned by the detector as a result of blob detection mayneed to inspect this detector data. The tracker data main-

tain the tracking context for each tracker. The detector

may inspect these data to ensure target uniqueness. TC

data represent a target. It is the composite of the tracking

results of all the trackers within a single TC. The equality

checking handler inspects TC data to see if two TCs are

following the same target.

While all three categories of data are shared, the loca-lity and degree of sharing for these three categories can be

vastly different. For example, the tracker data are unique

to a specific tracker and at most shared with the detector

that spawned the data. On the other hand, the TC data may

be shared by multiple trackers potentially spanning mul-

tiple computational nodes if an object is in the FOV of

several cameras. The detector data are also shared among

all the trackers that are working off a specific stream andthe detector associated with that stream. This is the reason

our API (see Table 1) includes six different access calls for

these three categories of shared data.

When programing a target tracking application, the

developer has to be aware of the fact that the handlers

may be executed concurrently. Therefore, the handlers

should be written as sequential codes with no side effects

to shared data structures to avoid explicit application-levelsynchronization. The TC programing model allows an

application developer to use optimized handlers written in

low-level programing languages such as C and C++. To

shield the domain expert from having to deal with

concurrency bugs, data sharing between different hand-

lers is only allowed through TC API calls (shown in

Table 1), which subsume data access with synchronization

guarantees.

2) TC Merge Model: To seamlessly merge two TCs into

one while tracking the targets in real time, the TC system

periodically calls equality checker on candidates for merge

operation. After merge, one of the two TCs is eliminated,

while the other TC becomes the union of the two pre-

vious TCs.

Execution of the equality checker on different pairs ofTCs can be done in parallel since it does not update any TC

data. Similarly, merger operations can go on in parallel so

long as the TCs involved in the parallel merges are all

distinct.

TC system may use camera topology information for

efficient merge operations. For example, if many targets

are being tracked in a large-scale camera network, only

those targets in nearby cameras should be compared andmerged to reduce the performance overhead of real-time

surveillance application.

B. ASAPSituation-awareness applications (such as video-based

surveillance) are capable of stressing the available compu-

tation and communication infrastructures to their limits.

Hence, the underlying system infrastructure should be:1) highly scalable (i.e., designed to reduce infrastructure

overload, cognitive overload, and false positives and false

negatives); 2) flexible to use (i.e., provide query- and

policy-based user interfaces to exploit context-sensitive

information for selective attention); and 3) easily exten-

sible (i.e., accommodate heterogeneous sensing and allow

for incorporation of new sensing modalities).

Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 887

We have designed and implemented a distributed

software architecture called ASAP for situation-awareness

applications. Fig. 5 shows the logical organization of the

ASAP architecture into control and data network. The

control network deals with low-level sensor-specific pro-

cessing to derive priority cues. These cues in turn are used

by the data network to prioritize the streams and carry out

further processing such as filtering and fusion of streams.It should be emphasized that this logical separation is

simply a convenient vehicle to partition the functionalities

of the ASAP architecture. The two networks are in fact

overlaid on the same physical network and share the

computational and sensing resources. For example, low

bitrate sensing, such as an RFID tag or a fire alarm, is part

of the control network. However, a high bitrate camera

sensor while serving the video stream for the data networkmay also be used by the control network for discerning

motion.

Fig. 6 shows the software architecture of ASAP: it is a

peer-to-peer network of ASAP agents (AAs) that execute on

independent nodes of the distributed system. The software

organization in each node consists of two parts: ASAP agent

(AA) and sensor agent (SA). There is one SA per sensor, and

a collection of SAs are assigned dynamically to an AA.

1) Sensor Agent: SA provides a virtual sensor abstractionthat provides a uniform interface for incorporating hete-

rogeneous sensing devices as well as to support multimodal

sensing in an extensible manner. This abstraction allows

new sensor types to be added without requiring any changeof the AA. There is a potential danger in such a virtual-

ization that some specific capability of a sensor may get

masked from full utilization. To avoid such semantic loss,

we have designed a minimal interface that serves the needs

of situation-awareness applications.

The virtual sensor abstraction allows the same physical

sensor to be used for providing multiple sensing services.

For example, a camera can serve not only as a video datastream, but also as a motion or a face detection sensor.

Similarly, an SA may even combine multiple physical

sensors to provide a multimodal sensing capability. Once

these different sensing modalities are registered with AAs,

they are displayed as a list of available features that users

can select to construct a query for the ASAP platform.

Fig. 6. ASAP software architecture.

Fig. 5. Functional view of ASAP.

Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

888 Proceedings of the IEEE | Vol. 100, No. 4, April 2012

ASAP platform uses these features as control cues forprioritization (to be discussed shortly).

2) ASAP Agent: As shown in Fig. 6, an AA is associated

with a set of SAs. The association is dynamic, and is engi-

neered at runtime in a peer-to-peer fashion among the

AAs. The components of AA are shown in Fig. 6. AA

provides a simple query interface with SQL-like2 syntax.

Clients can pose an SQL query using control cues as attri-butes. Different cues can be combined using Band[ and

Bor[ operators to create multimodal sensing queries.

False positives and negatives: Fig. 5 shows that senseddata lead to events, which when filtered and fused ulti-

mately lead to actionable knowledge. Unfortunately,

individual sensors may often be unreliable due to envi-

ronmental conditions (e.g., poor lighting conditions near a

camera). Thus, it may not always be possible to have highconfidence in the sensed data; consequently there is a

danger that the system may experience high levels of false

negatives and false positives. It is generally recognized

that multimodal sensors would help reduce the ill effects

of false positives and negatives. The virtual sensor

abstraction of ASAP allows multiple sensors to be fused

together and registered as a new sensor. Unlike multi-

feature fusion (a la face recognizer) where features arederived from the same (possibly noisy) image, multisensor

fusion uses different sensing modalities. ASAP exploits a

quorum system to make a decision. Even though a majority

vote is implemented at the present time, AA may assign

different weights to the different sensors commensurate

with the error rates of the sensors to make the voting more

accurate.

Prioritization strategies: ASAP needs to continuouslyextract prioritization cues from all the cameras and other

sensors (control network), and disseminate the selected

camera streams (data network) to interested clients

(which could be detectors/trackers of the TC system

from Section IV-A and/or a end user such as a security

personnel). ASAP extracts information from a sensor

stream by invoking the corresponding SA. Since there may

be many SAs registered at any time, invoking all SAs maybe very compute intensive. ASAP needs to prioritize the

invocations of SAs to scale well with the number of sen-

sors. This leads to the need for priority-aware computationin the control network. Once a set of SAs that are relevant

to client queries is identified, the corresponding camera

feeds need to be disseminated to the clients. If the band-

width required to disseminate all streams exceeds the

available bandwidth near the clients, network will end updropping packets. This leads to the need for priority-awarecommunication in the data network. Based on these needs,

the prioritization strategies employed by ASAP can be

grouped into the following categories: priority-aware com-putation and priority-aware communication.

Priority-aware computation: The challenge is dynami-

cally determining a set of SAs among all available SAs that

need to be invoked such that overall value of the derived

actionable knowledge (benefit for the application) is maxi-

mized. We use the term measure of effectiveness (MOE) to

denote this overall benefit. ASAP currently uses a simple

MOE based on clients’ priorities.The priority of an SA should reflect the amount of

possibly Bnew[ information the SA output may have and

its importance to the query in progress. Therefore, the

priority value is dynamic, and it depends on multiple fac-

tors, including the application requirements, and the in-

formation already available from other SAs. In its simplest

form, priority assignment can be derived from the priority

of the queries themselves. For instance, given two queriesfrom an application, if the first query is more important

than the second one, the SAs relevant to the first query

will have higher priority compared to the SAs corre-

sponding to the second query. More importantly, com-

putations do not need to be initiated at all of SAs since

1) such information extracted from sensed data may not

be required by any AA; and 2) unnecessary computation

can degrade overall system performance. The BWHERE[clause in the SQL-like query is used to activate a specific

sensing task. If multiple WHERE conditions exist, the

lowest computation-intensive task is initiated first that

activates the next task in turn. While it has a tradeoff

between latency and overhead, ASAP uses this heuristic for

the sake of scalability.

Priority-aware communication: The challenge is design-ing prioritization techniques for communication on thedata network such that application-specific MOE can be

maximized. Questions to be explored here include: How

do we assign priorities to different data streams and how

do we adjust their spatial or temporal fidelities that maxi-

mize the MOE?

In general, the control network packets are given higher

priority than data network packets. Since the control net-

work packets are typically much smaller than the datanetwork packets, supporting a cluster of SAs with each AA

does not overload the communication infrastructure.

C. Summary of ResultsWe have built a testbed with network cameras and

RFID readers for object tracking based on RFID tags and

motion detection. This testbed allows us to both under-

stand the programmability of large-scale camera networksusing the TC model, as well as understand the scalability of

the ASAP architecture. Specifically, in implementing

ASAP, we had three important goals: 1) platform neutrality

for the Bbox[ that hosts the AA and SA; 2) ability to sup-

port a variety of sensors seamlessly (for, e.g., network

cameras as well as USB cameras); and 3) extensibility to

support a wide range of handheld devices. We augmented

2SQL is derived from the original acronym SEQUEL, which stands forStructured English QUEry Language, for relational database systems.

Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 889

our real testbed consisting of tens of cameras, RFIDreaders, and microphones with emulated sensors. The

emulated sensors use the uniform virtual sensor interface

discussed in Section IV-B. Due to the virtual sensor

abstraction, an AA does not distinguish whether data comes

from an emulated sensor or a real sensor. The emulated

camera sends JPEG images at a rate requested by a client.

The emulated RFID reader sends tag detection event based

on an event file, where different event files mimic differentobject movement scenarios.

By using real devices (cameras, RFID readers, and micro-

phones) and emulated sensors, we were able to conduct

experiments to verify the scalability of our proposed software

architecture to scale to a large number of cameras. The

workload used is as follows. An area is assumed to bemade of a

set of cells, organized as a grid. Objects start from a randomly

selected cell, wait for a predefined time, and move to aneighbor cell. The number of objects, the grid size, and the

object wait time are workload parameters.We used end-to-end

latency (from sensing to actuation), network bandwidth usage,

and CPU utilization as figures of merit as we scale up the

system size (i.e., the number of cameras from 20 to 980) and

the number of queries (i.e., interesting events to be observed).

The scalability is attested by two facts: 1) the end-to-end

latency remains the same as we increase the number of queries(for a system with 980 camera streams); and 2) the CPU load

and the network bandwidth requirements grow linearly with

the number of interesting events to be observed (i.e., number

of queries) and not proportional to the size of the system

(i.e., the number of camera sensors in the deployment).3

V. CASE STUDY: IBM S3

The IBM Smart Surveillance project [16], [61] is one of the

few research projects in smart surveillance systems that

turned into a product, which has been recently used toaugment Chicago’s video surveillance network [62]. Quite

a bit of fundamental research in computer vision

technologies forms the cornerstone for IBM’s smart

surveillance solution. Indeed, IBM S3 transformed video-

based surveillance systems from a pure data acquisition

endeavor (i.e., recording the video streams on DVRs for

postmortem analysis) to an intelligent real-time online

video analysis engine that converts raw data intoactionable knowledge. IBM S3 product includes several

novel technologies [63] including multiscale video acquisi-

tion and analysis, salient motion detection, 2-D multiobject

tracking, 3-D stereo object tracking, video-tracking-based

object classification, object structure analysis, face categori-

zation following face detection to prevent Btailgating,[etc. Backed by the IBM DB2 product, IBM S3 is a powerful

engine for online querying of live and historical data in thehands of security personnel.

Our work, focusing on the programing model for large-scale situation awareness and a scalable peer-to-peer system

architecture for multimodal sensing, is complementary to

the state of the art established by the IBM S3 research.

VI. CONCLUDING REMARKS

Large-scale situation-awareness applications will continue

to grow in scale and importance as our penchant for in-strumenting the world with sensors of various modalities

and capabilities continues. Using video-based surveillance

as a concrete example, we have reviewed the enabling

technologies spanning wireless networking, smart cam-

eras, computer vision, context-aware frameworks, and

programing models. We have reported our own experi-

ences in building scalable programing models and software

infrastructure for situation awareness.Any interesting research answers a few questions and

raises several more. This is no different. One of the most

hairy problems with physical deployment is the heteroge-

neity and lack of standards for smart cameras. A typical

large-scale deployment will include smart cameras of dif-

ferent models and capabilities. Vendors typically provide

their own proprietary software for analyzing camera feeds

and controlling the cameras from a dashboard. Interoper-ability of camera systems from different vendors is difficult

if not impossible.

From the perspective of computer vision, one of the

major challenges is increasing the accuracy of detection

and/or scene analysis in the presence of ambient noise,

occlusion, and rapid movement of objects. Multiple views

of the same object help in improving the accuracy of de-

tection and analysis; with the ubiquity of cameras in recentyears it is now feasible to deploy several tens if not hun-

dreds of cameras in relatively small spaces (e.g., one gate

of an airport); but the challenge of using these multiple

views to develop accurate and scalable object detection

algorithms still remains an open problem.

From systems perspective, there is considerable amount

of work to be done in aiding the domain expert. There needs

to be closer synergy between vision researchers and systemsresearchers to develop the right abstractions for programing

large-scale camera networks, facilitating seamless handoff

from one camera to another as objects move, state and

computation migration between smart cameras and backend

servers, elastically increasing the computational resources to

deal with dynamic application needs, etc.

Last but not the least, one of the most thorny problems

that is plaguing the Internet today is bound to hit sensor-based distributed computing in the near future, namely,

spam. We have intentionally avoided discussing tamper-

proofing techniques such as stenography in camera sys-

tems; but as we explore mission-critical applications (such

as surveillance, urban terrorism, emergency response, and

healthcare) ensuring the veracity of sensor sources will

become increasingly important. h3Details of the TC programing system can found in [59], and detailed

results of the ASAP system evaluation can be found in [60].

Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

890 Proceedings of the IEEE | Vol. 100, No. 4, April 2012

REFERENCES

[1] M. McCahill and C. Norris, BEstimatingthe extent, sophistication and legalityof CCTV in London,[ in CCTV, M. Gill, Ed.London, U.K.: Palgrave Macmillan, 2003.

[2] R. Hunter, Chicago’s Surveillance Plan isan Ambitious Experiment, Gartner Research,2004. [Online]. Available: http://www.gartner.com/DisplayDocument?doc_cd=123919.

[3] C. Norris, M. McCahill, and D. Wood,BThe growth of CCTV: A global perspectiveon the international diffusion of videosurveillance in publicly accessible space,[Surveill. Soc., vol. 2, no. 2/3, pp. 110–135,2004.

[4] W. E. L. Grimson and C. Stauffer, BAdaptivebackground mixture models for real-timetracking,[ in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., 1999, DOI: 10.1109/CVPR.1999.784637.

[5] A. Elgammal, D. Harwood, and L. S. Davis,BNonparametric background model forbackground subtraction,[ in Proc. 6th Eur.Conf. Comput. Vis., pp. 751–767, 2000.

[6] D. M. Gavrila, BPedestrian detection froma moving vehicle,[ in Proc. 6th Eur. Conf.Comput. Vis. II, 2000, pp. 37–49.

[7] P. A. Viola and M. J. Jones, BRobust real-timeface detection,[ Int. J. Comput. Vis., vol. 57,no. 2, pp. 137–154, 2004.

[8] I. Haritaoglu, D. Harwood, and L. S. Davis,BW4: Who? when? where? what? a real timesystem for detecting and tracking people,[ inProc. IEEE Int. Conf. Autom. Face GestureRecognit., 1998, pp. 222–227.

[9] C. R. Wern, A. Azarbayejani, T. Darrell, andA. P. Pentland, BPfinder: Real-time trackingof human body,[ IEEE Trans. Pattern Anal.Mach. Intell., vol. 19, no. 7, pp. 780–785,Jul. 1997.

[10] W. Zhao, R. Chellappa, P. J. Phillips, andA. Rosenfeld, BFace recognition: A literaturesurvey,[ ACM Comput. Surv., vol. 35, no. 4,pp. 399–458, 2003.

[11] J. Little and J. Boyd, BRecognizing peopleby their gait: The shape of motion,[Videre, vol. 1, no. 2, pp. 1–32, 1998.

[12] A. Bissacco, A. Chiuso, Y. Ma, and S. Soatto,BRecognition of human gaits,[ in Proc.IEEE Conf. Comput. Vis. Pattern Recognit.,2001, vol. 2, pp. 52–58.

[13] D. M. Gavrila, BThe visual analysis of humanmovement: A survey,[ Comput. Vis. ImageUnderstand., vol. 73, no. 1, pp. 82–98, 1999.

[14] A. F. Bobick and J. W. Davis, BThe recognitionof human movement using temporaltemplates,[ IEEE Trans. Pattern Anal. Mach.Intell., vol. 23, no. 3, pp. 257–267, Mar. 2001.

[15] T. B. Moeslund and E. Granum, BA surveyof computer vision-based human motioncapture,[ Comput. Vis. Image Understand.,vol. 81, no. 3, pp. 231–268, 2001.

[16] IBM Smart Surveillance System (S3). [Online].Available: http://www.research.ibm.com/peoplevision/

[17] Video Surveillance Integrated SurveillanceSystems. [Online]. Available: https://www.buildingtechnologies.siemens.com

[18] Products That Make Surveillance Smart.[Online]. Available: http://www.objectvideo.com/products/

[19] P. F. Felzenszwalb, R. B. Girshick,D. A. McAllester, and D. Ramanan,BObject detection with discriminativelytrained part-based models,[ IEEE Trans.Pattern Anal. Mach. Intell., vol. 32, no. 9,pp. 1627–1645, Sep. 2010.

[20] X. Wang, T. X. Han, and S. Yan, BAnHOG-LBP human detector with partialocclusion handling,[ in Proc. Int. Conf.Comput. Vis., 2009, pp. 32–39.

[21] A. Yilmaz, O. Javed, and M. Shah, BObjecttracking: A survey,[ ACM Comput. Surv.,vol. 38, no. 4, p. 13, 2006.

[22] B. Babenko, M. Yang, and S. J. Belongie,BVisual tracking with online multiple instancelearning,[ Proc. IEEE Conf. Comput. Vis.Pattern Recognit., 2009, pp. 983–990.

[23] D. A. Ross, J. Lim, R. Lin, and M. Yang,BIncremental learning for robust visualtracking,[ Int. J. Comput. Vis., vol. 77,pp. 125–141, 2008.

[24] J. Kwon and K. M. Lee, BVisual trackingdecomposition,[ Proc. IEEE Conf. Comput.Vis. Pattern Recognit., 2010, pp. 1269–1276.

[25] A. Basharat, A. Gritai, and M. Shah, BLearningobject motion patterns for anomaly detectionand improved object detection,[ in Proc.IEEE Conf. Comput. Vis. Pattern Recognit.,2008, DOI: 10.1109/CVPR.2008.4587510.

[26] I. Saleemi, K. Shafique, and M. Shah,BProbabilistic modeling of scene dynamicsfor applications in visual surveillance,[IEEE Trans. Pattern Anal. Mach. Intell., vol. 31,no. 8, pp. 1472–1485, Aug. 2009.

[27] B. Yao and L. Fei-Fei, BModeling mutualcontext of object and human pose inhuman-object interaction activities,[ inProc. IEEE Conf. Comput. Vis. PatternRecognit., 2010, pp. 17–24.

[28] V. Mahadevan, W. Li, V. Bhalodia, andN. Vasconcelos, BAnomaly detectionin crowded scenes,[ in Proc. IEEEConf. Comput. Vis. Pattern Recognit.,2010, pp. 1975–1981.

[29] X. Wang, X. Ma, and W. Grimson,BUnsupervised activity perception in crowdedand complicated scenes using hierarchicalBayesian models,[ IEEE Trans. Pattern Anal.Mach. Intell., vol. 31, no. 3, pp. 539–555,Mar. 2009.

[30] H. Zhong, J. Shi, and M. Visontai, BDetectingunusual activity in video,[ in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2004,vol. 2, pp. 819–826.

[31] I. Bravo, J. Balias, A. Gardel, J. L. Lzaro,F. Espinosa, and J. Garca. (2011). Efficientsmart CMOS camera based on FPGAsoriented to embedded image processing.Sensors [Online]. 11(3), pp. 2282–2303.Available: http://www.mdpi.com/1424-8220/11/3/2282/

[32] S. K. Nayar, BComputational cameras:Redefining the image,[ IEEE Comput.Mag., vol. 39, Special Issue on ComputationalPhotography, no. 8, pp. 30–38, Aug. 2006.

[33] P. Chen, P. Ahammad, C. Boyer, S. Huang,L. Lin, E. Lobaton, M. Meingast, S. Oh,S. Wang, P. Yan, A. Y. Yang, C. Yeo,L. Chung Chang, J. D. Tygar, and S. S. Sastry,BCitric: A low-bandwidth wireless cameranetwork platform,[ in Proc. ACM/IEEEInt. Conf. Distrib. Smart Cameras, 2008,DOI: 10.1109/ICDSC.2008.4635675.

[34] P. Kulkarni, D. Ganesan, P. Shenoy, andQ. Lu, BSenseye: A multi-tier camera sensornetwork,[ in Proc. 13th Annu. ACM Int.Conf. Multimedia, 2005, pp. 229–238.

[35] H. Broers, W. Caarls, P. Jonker, andR. Kleihorst, BArchitecture study for smartcameras,[ in Proc. EOS Conf. Ind. Imag. Mach.Vis., 2005, pp. 39–49.

[36] Texas Instruments IP Camera. [Online].Available: http://www.ti.com/ipcamera

[37] Axis Communications. [Online]. Available:http://www.axis.com/

[38] Microsoft Kinect. [Online]. Available: http://www.xbox.com/kinect

[39] IEEE Standard for Wireless Local AreaNetworks, 802.11n. [Online]. Available:http://www.ieee802.org/11.

[40] The Zigbee Specification. [Online]. Available:http://www.zigbee.org

[41] C. Perkins, E. B. Royer, and S. Das, BAd hocon-demand distance vector (AODV) routing,[IETF RFC 3561, 2003.

[42] 3GPP Long Term Evolution. [Online]. Available:http://www.3gpp.org/article/lte

[43] Worldwide Interoperability for MicrowaveAccess. [Online]. Available: http://www.wimaxforum.org

[44] Xbow Micaz Motes. [Online]. Available:http://www.xbow.com

[45] RPL: IPv6 Routing Protocol for Low Power andLossy Networks, draft-ietf-roll-rpl-19. [Online].Available: http://tools.ietf.org/html/draft-ietf-roll-rpl-19

[46] RFC 3344: IP Mobility Support for IPv4, U.S.,2002, Tech. Rep.

[47] IBMA Smarter Planet. [Online]. Available:http://www.ibm.com/smarterplanet.

[48] L. Luo, A. Kansal, S. Nath, and F. Zhao,BSharing and exploring sensor streams overgeocentric interfaces,[ in Proc. 16th ACMSIGSPATIAL Int. Conf. Adv. Geograph. Inf.Syst., Irvine, CA, Nov. 2008, pp. 3–12.

[49] Nokia, Sensor Planet. [Online]. Available:http://www.sensorplanet.org/.

[50] L. Sanchez, J. Lanza, M. Bauer, R. L. Olsen,and M. G. Genet, BA generic contextmanagement framework for personalnetworking environments,[ in Proc. 3rdAnnu. Int. Conf. Mobile Ubiquitous Syst.,2006, DOI: 10.1109/MOBIQW.2006.361743.

[51] S. Kang, J. Lee, H. Jang, H. Lee, Y. Lee,S. Park, T. Park, and J. Song, BSeemon:Scalable and energy-efficient contextmonitoring framwork for sensor-rich mobileenvironments,[ in Proc. ACM Int. Conf.Mobile Syst., 2008.

[52] D. J. Lillethun, D. Hilley, S. Horrigan, andU. Ramachandran, BMB++: An integratedarchitecture for pervasive computing andhigh-performance computing,[ in Proc. 13thIEEE Int. Conf. Embedded Real-Time Comput.Syst. Appl., Aug. 2007, pp. 241–248.

[53] F. Hohl, U. Kubach, A. Leonhardi,K. Rothermel, and M. Schwehm, BNextcentury challenges: NexusVAn open globalinfrastructure for spatial-aware applications,[in Proc. 5th ACM/IEEE Int. Conf. MobileComput. Netw., Seattle, WA, Aug. 1999,pp. 249–255.

[54] R. Lange, N. Cipriani, L. Geiger,M. Grossmann, H. Weinschrott, A. Brodt,M. Wieland, S. Rizou, and K. Rothermel,BMaking the world wide space happen:New challenges for the nexus platform,[ inProc. 7th IEEE Int. Conf. Pervasive Comput.Commun., 2009, DOI: 10.1109/PERCOM.2009.4912782.

[55] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, andM. Doo, BSpade: The system S declarativestream processing engine,[ in Proc. ACMSIGMOD Int. Conf. Manage. Data, 2008,pp. 1123–1134. [Online]. Available: http://doi.acm.org/10.1145/1376616.1376729.

[56] L. Neumeyer, B. Robbins, A. Nair, andA. Kesari, BS4: Distributed stream computingplatform,[ in Proc. IEEE Int. Conf. DataMining Workshops, 2010, pp. 170–177.

[57] W. Thies, M. Karczmarek, andS. P. Amarasinghe, BStreamit: A languagefor streaming applications,[ in Proc. 11thInt. Conf. Compiler Construct., London, U.K.,

Ramachandran et al.: Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

Vol. 100, No. 4, April 2012 | Proceedings of the IEEE 891

2002, pp. 179–196. [Online]. Available:http://portal.acm.org/citation.cfm?id=647478.727935.

[58] P. S. Pillai, L. B. Mummert, S. W. Schlosser,R. Sukthankar, and C. J. Helfrich,BSlipstream: Scalable low-latency interactiveperception on streaming data,[ in Proc. 18thInt. Workshop Netw. Oper. Syst. Support DigitalAudio Video, 2009, pp. 43–48. [Online].Available: http://doi.acm.org/10.1145/1542245.1542256.

[59] K. Hong, B. Branzoi, J. Shin, S. Smaldone,L. Iftode, and U. Ramachandran, TargetContainer: A Target-Centric ParallelProgramming Abstraction for Video-Based

Surveillance, 2010. [Online]. Available:http://hdl.handle.net/1853/36186.

[60] J. Shin, R. Kumar, D. Mohapatra,U. Ramachandran, and M. Ammar, BASAP:A camera sensor network for situationawareness,[ in OPODIS 2007, vol. 4878.Berlin, Germany: Springer-Verlag, 2007,pp. 31–47, ser. Lecture Notes in ComputerScience.

[61] R. Feris, A. Hampapur, Y. Zhai, R. Bobbitt,L. Brown, D. Vaquero, Y. Tian, H. Liu, andM.-T. Sun, BCasestudy: IBM smartsurveillance system,[ in Intelligent VideoSurveillance: Systems and Technologies,

Y. Ma and G. Qian, Eds. London, U.K.:Taylor & Francis/CRC Press, 2009.

[62] ABC News, ABC7 Puts Video Analytics to theTest, Feb. 23, 2010. [Online]. Available:http://abclocal.go.com/wls/story?section=news/special_segments&id=7294108.

[63] A. Hampapur, L. Brown, J. Connell, A. Ekin,N. Haas, M. Lu, H. Merkl, S. Pankanti,A. Senior, C.-F. Shu, and Y. L. Tian,BSmart video surveillance: Exploringthe concept of multiscale spatiotemporaltracking,[ IEEE Signal Process. Mag.,vol. 22, no. 2, pp. 38–51, Mar. 2005.

ABOUT THE AUTHORS

Umakishore Ramachandran (Senior Member,

IEEE) received the Ph.D. degree in computer

science from the University of Wisconsin-Madison,

Madison, in 1986.

He is the Director of Samsung Tech Advanced

Research (STAR) Center and a Professor in the

College of Computing, Georgia Institute of Tech-

nology, Atlanta. His research interests span par-

allel and distributed systems, sensor networks,

pervasive computing, and mobile and embedded

computing.

Kirak Hong received the B.S. degree in computer

science from Yonsei University, Seoul, Korea, in

2009. Currently, he is working towards the Ph.D.

degree at the College of Computing, Georgia

Institute of Technology, Atlanta. His dissertation

research focuses on programing models and

execution frameworks for large-scale situation

awareness applications.

His research interests span distributed sys-

tems, mobile and embedded computing, and

sensor networks.

Liviu Iftode (Senior Member, IEEE) received the

Ph.D. degree in computer science from Princeton

University, Princeton, NJ, in 1998.

He is a Professor of Computer Science at

Rutgers University, Piscataway, NJ. His re-

search interests include operating systems, dis-

tributed systems, mobile, vehicular, and pervasive

computing.

Prof. Iftode is a member of the Association for

Computing Machinery (ACM).

Ramesh Jain (Fellow, IEEE) received the B.E.

degree from Visvesaraya Regional College of

Engineering, Nagpur, India, in 1969 and the Ph.D.

degree from the Indian Institute of Technology,

Kharagpur, India, in 1975.

He is a Donald Bren Professor in Information &

Computer Sciences at the University of California

at Irvine, Irvine, where he is doing research in

EventWeb and experiential computing. His current

research interests are in searching multimedia

data and creating EventWebs for experiential computing.

Dr. Jain is a Fellow of the Association for Computing Machinery (ACM),

the Association for the Advancement of Artificial Intelligence (AAAI), the

International Association for Pattern Recognition (IAPR), and The

International Society for Optics and Photonics (SPIE).

Rajnish Kumar received the Ph.D. degree in

computer science from Georgia Institute of Tech-

nology, Atlanta, in 2006. As part of his disserta-

tion, he designed and implemented SensorStack

that provides systems support for cross layering in

network stack for adaptability.

He is currently Chief Technology Officer at

Weyond, Princeton, NJ. His research interests are

in systems support for large-scale streaming data

analytics.

Kurt Rothermel received the Ph.D. degree in

computer science from University of Stuttgart,

Stuttgart, Germany, in 1985.

Since 1990, he has been with the University of

Stuttgart, where he is a Professor of Computer

Science and the Director of the Institute of Parallel

and Distributed Systems (IPVS). His research

interests span distributed systems, computer net-

works, mobile computing, and sensor networks.

Junsuk Shin received the B.S. degree in electrical

engineering from Yonsei University, Seoul, Korea

and the M.S. degree in computer science from

Georgia Institute of Technology, Atlanta, where he

is currently working towards the Ph.D. degree.

He joined Microsoft in 2009. His research

interest includes distributed system, sensor net-

work, mobile computing, and embedded system.

Raghupathy Sivakumar received the Ph.D. de-

gree in computer science from the University of

Illinois at Urbana-Champaign, Urbana, in 2000.

He is a Professor in the School of Electrical and

Computer Engineering at Georgia Institute of

Technology, Atlanta. He leads the Georgia Tech

Networking and Mobile Computing (GNAN) Re-

search Group, conducting research in the areas of

wireless networking, mobile computing, and com-

puter networks.

Ramachandran et al. : Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing

892 Proceedings of the IEEE | Vol. 100, No. 4, April 2012

(PDF) Large-Scale Situation Awareness With Camera Networks and Multimodal Sensing - DOKUMEN.TIPS (2024)

References