2012 IEEE High Performance Extreme Computing Conference (HPEC ‘12) Sixteenth Annual HPEC Conference 10 - 12 September 2012 Westin Hotel, Waltham, MA USA

Site created and maintained by Ballos Associates

A Third Generation Many-Core Processor for Secure Embedded Computing Systems John Irza*, Coherent Logix, Inc. Abstract: As compute-intensive products proliferate, there is an ever growing need to provide security features to detect tampering, identify cloned or

counterfeit hardware, and deter cybersecurity threats. This paper describes the security features of the third generation 100-core HyperX™

processor which addresses these needs. Programmable security barriers allow the processor to implement a red-black System on Chip solution.

The implementation of Physically Unclonable Functions (PUFs), encryption/decryption engines, a secure boot controller, and anti-tamper features

enable the engineer to realize a secure embedded computing solution in an ultra-low power, many-core, C programmable processor-memory

network.

:::::::

Exploiting SPM-aware Scheduling on EPIC Architectures for High-Performance Real-Time Systems Wei Zhang*, Virginia Commonwealth University Abstract: In contemporary computer architectures, the Explicitly Parallel Instruction Computing Architectures (EPIC) permits microprocessors to implement

Instruction Level Parallelism (ILP) by using the compiler, rather than complex ondie circuitry to control parallel instruction execution like the

superscalar architecture. Based on the EPIC, this paper proposes a time predictable two-level scratchpad based memory architecture, and an

ILP based static memory objects assignment algorithm is utilized in the compiler not to harm the characteristic of time predictability of scratchpad

memories. Then, to exploit the load/store latencies that are statically known in this architecture, we study a Scratchpad-aware Scheduling method

to improve the performance by optimizing the Load-To-Use Distance. Our experimental results indicate that the performance of the two-level

scratchpad based architecture on EPIC processors can be improved by the Scratchpad-aware Scheduling, while keeping time predictability.

:::::::

Using Copper Water Loop Heat Pipes to Efficiently Cool CPUs and GPUs Stephen Fried*, Microway Inc. and Passive Thermal Technology Abstract: As the amount of power being rejected by 1U servers starts to approach and exceed 2 KW, the question in HPC continues to be, how can we not

only cool devices which reject this amount of heat, but also, how can we reject that heat efficiently.

:::::::

High locality and increased intra-node parallelism for solving finite element models on GPUs by novel element-by-element implementation Zsolt Badics*, Tensor Research LLC Abstract: The utilization of Graphical Processing Units (GPUs) for the element-by-element (EbE) finite element method (FEM) is demonstrated. EbE FEM

is a long known technique, by which a conjugate gradient (CG) type iterative solution scheme can be entirely decomposed into computations on

the element level, i.e., without assembling the global system matrix. In our implementation NVIDIA’s parallel computing solution, the Compute

Unified Device Architecture (CUDA) is used to perform the required element-wise computations in parallel. Since element matrices need not be

stored, the memory requirement can be kept extremely low. It is shown that this low-storage but computationintensive technique is better suited

for GPUs than those requiring the massive manipulation of large data sets. This first study of the proposed parallel model illustrates a highly

improved locality and minimization of data movement, which could also significantly reduce energy consumption in other HPC architectures.

:::::::

Accelerating Fully Homomorphic Encryption Using GPU Wei Wang, ECE, Worcester Polytechnic Institute; Yin Hu, ECE, Worcester Polytechnic Institutue; Lianmu Chen, ECE, Worcester Polytechnic Institute; Xinming Huang*, ECE, Worcester Polytechnic Institute; Berk Sunar, ECE, Worcester Polytechnic Institute Abstract: In a major breakthrough, in 2009 Gentry introduced the first plausible construction of a fully homomorphic encryption (FHE) scheme. FHE allows

the evaluation of arbitrary functions directly on encrypted data on untwisted servers. In 2010, Gentry and Halevi presented the first FHE

implementation on an IBM x3500 server. However, this implementation remains impractical due to the high latency of encryption and recryption.

The Gentry- Halevi (GH) FHE primitives, utilize multi-million-bit modular multiplications and additions – time-consuming tasks for general purpose

processors. In the GH-FHE implementation, the most computationintensive arithmetic operation is modular multiplication. In this paper, the

million-bit multiplication is calculated in two steps: large-number multiplication and modular reduction. Strassen’s FFT based algorithm is used so

that Graphics processing units (GPU) can employ massive parallelism to accelerate the largenumber number multiplication. In what follows, the

Barrett Modular Reduction algorithm is used to realize modular multiplication. We implemented the encryption, decryption and recryption

primitives on the NVIDIA C2050. Experimental results show factors of up to 7.68, 7.4 and 6.59 speed improvement for encryption, decryption and

recrypt, respectively, when compared to the GH implementation for the small setting in dimension 2048.

:::::::

Use of CUDA for the Continuous Space Language Model Elizabeth Thompson*, Purdue University Fort Wayne Abstract: The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute

Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls

on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated.

:::::::

Graph Programming Model - An Efficient Approach for Sensor Signal Processing Domain Steve Kirsch*, Raytheon Abstract: The HPC community has struggled to find an optimal parallel programming model that can efficiently expose algorithmic parallelism in a

sequential program and automate the implementation of a highly efficient parallel program. A plethora of parallel programming languages have

been developed along with sophisticated compilers and runtimes, but none of these approach have been successful enough to became the

defacto standard. Graph Programming Model has the capability and efficiencies to become that ubiquitous standard for the signal processing

domain.

:::::::

An Application of the Constraint Programming to the Design and Operation of Synthetic Aperture Radars Michael Holzrichter*, Sandia National Laboratories Abstract: The design and operation of synthetic aperture radars require compatible sets of hundreds of quantities. Compatibility is achieved when these

quantities satisfy constraints arising from physics, geometry etc. In the aggregate these quantities and constraints form a logical model of the

radar. In practice the logical model is distributed over multiple people, documents and software modules thereby becoming fragmented.

Fragmentation gives rise to inconsistencies and errors. The SAR Inference Engine addresses the fragmentation problem by implementing the

logical model of a Sandia synthetic aperture radar in a form that is intended to be usable from system design to mission planning to actual

operation of the radar. These diverse contexts require extreme flexibility that is achieved by employing the constraint programming paradigm.

:::::::

LLMORE: A Framework for Data Mapping and Architecture Analysis Michael Wolf*, MIT Lincoln Laboratory Abstract: We outline our recent efforts in developing MIT Lincoln Laboratory’s Mapping and Optimization Runtime Environment (LLMORE). The LLMORE

framework consists of several components that together estimate and optimize performance critical sections of an application. This framework

can be used to improve the performance of parallel applications and as an important tool for analyzing different hardware architectures. In this

paper, we describe the use cases that have driven the development of LLMORE. We also give two concrete examples of how LLMORE can be

used to improve the parallel performance of a numerical operation and characterize the power efficiency of numerical algorithms and computer

architectures. :::::::

Unitary Qubit Lattice Algorithm for Two-Component Bose-Einstein Condensate Gases: the Kelvin-Helmholtz and Counter-Superflow Instabilities George Vahala*, William & Mary Abstract: A unitary qubit lattice algorithm, employing four qubits per lattice site, is introduced to model a set of coupled Bose-Einstein condensates (BECs)

described by the Gross- Pitaevskii (GP) equation for the ground state wave functions. Using a series of unitary collide-stream-rotate operators,

the ideally parallelized (tested to over 210,000 cores) mesoscopic algorithm recovers the coupled GP equations in the diffusion limit. Both the

quantum Kelvin-Helmholtz (KH) and quantum counter-superflow instabilities will be examined on high resolution grids for both 2D and 3D. With

mean velocity shear between the two components, the Kelvin-Helmholtz and counter-superflow instabilities are driven. Recent 2D simulations of

Tsubota et. al. [1] on such BECs, using pseudospectral codes, have uncovered novel features not seen in the classical analogues of these

instabilities. in particular, as the shear velocity interface forms a sawtooth oscillation of increasing amplitude, quantum vorticies are spun off the

crests and troughs and propagate within their own condensates and so stabilize the KH instability. For thicker 2D interface boundaries, the two-

stream counterflow instability leads to the creation of quantum vortex pairs with complex dynamical behavior. These results will first be verified by

our 2D qubit algorithms and then extended to 3D where the quantum vortices can now interaction strongly and undergo reconnection and loop

ejection. Because the qubit algorithms are so well parallelized, detailed 3D structures will be examined with excellent spatial resolution. The

principal significance to DoD is in the development of unitary qubit codes that are immediately portable to quantum computers as they come

online. It aids in the interplay between quantum and classical turbulence and in the control of BECs. [1] H. Takeuchi, N. Suzuki, K. Kasamatsu, H.

Saito and M. Tsubota, Phys. Rev. B81, 094517 (2010).

:::::::

Early Experiences with Energy-Aware Scheduling Kathleen Smith*, ARL DSRC Abstract: This paper documents the early experiences and recent progress with employing the Energy-Aware Scheduler (EAS) at the DoD

Supercomputing Resource Centers (DSRC). The U.S. Army Research Laboratory (ARL) has partnered with Lockheed Martin, Altair, and

Instrumental to assess feasibility on current DSRC High Performance Computing (HPC) systems. Developmental work was completed on the

ARL DSRC Test and Development systems and ported to the production systems at the ARL DSRC. The (EAS) is written in Python and works

with the current program-wide scheduler, Altair PBS Professional, that is deployed across the DSRCs. EAS reduces power and cooling costs by

intelligently powering off compute nodes that are not actively being used by the currently running or reserved for near future jobs. It has been

estimated that the Energy Aware Scheduler could potentially save millions of Kilowatt-hours each year throughout the program. We will describe

the extent of our work to date at the DSRC centers and our plans to complete our work by September 30, 2012.

:::::::

Isolating Runtime Faults with Callstack Debugging using TAU Sameer Shende*, ParaTools, Inc. Abstract: We present a tool that can help identify the nature of runtime errors in a program at the point of failure. This debugging tool, integrated in the TAU

Performance System, allows a developer to isolate the fault in a multi‐language program by capturing the signal associated with the fault and

examining the program callstack. It captures the performance data at the point of failure, stores detailed information for each frame in the

callstack, and generates a file that may be shipped back to the developers for further analysis. Technical Approach:

:::::::

Fast Functional Simulation with a Dynamic Language Craig Steele*, Exogi LLC Abstract: Simulation of large computational systems-on-a-chip (SoCs) is increasing challenging as the number and complexity of components is scaled up.

With the ubiquity of programmable components in computational SoCs, fast functional instructionset simulation (ISS) is increasingly important.

Much ISS has been done with straightforward unit-delay models of a non-pipelined fetch-decode-execute iteration written in a low-to-mid-level

Cfamily static language, delivering mid-level efficiency. Some ISS programs, such as QEMU, perform binary translation to allow software

emulation to reach more usable speeds. This relatively complex methodology has not been widely adopted for system modeling. We

demonstrate a fresh approach to ISS that achieves much better performance than a fast binary-to-binary translator by exploiting recent advances

in just-in-time (JIT) compilers for dynamic languages, such as JavaScript and Lua, together with a specific programming idiom inspired by

pipelined processor design. We believe that this approach is relatively accessible to system designers familiar with C-family functional simulator

coding styles, and generally useful for fast modeling of complex SoC components.

:::::::

Power and Performance Comparison of HPEC Challenge Benchmarks on Various Processors Sharad Mehta*, Mercury Computer Systems,Inc. Abstract: The HPEC Challenge Benchmarks may be used to compare power and performance characteristics of various processors. The objective is to

enable data-driven decisions to be taken during selection of components and system architectures for high performance embedded computing

applications. The system architect has a wide range of choices available in terms of selection of software, firmware and hardware. These choices

include various types of processors (CPUs, GPUs, FPGAs). These components may be configured within various network topologies to

accommodate the processing and data rate requirements of the application at hand. In addition to the complexity of the algorithms and the

volume and rate of the incoming data, embedded systems are challenged to be deployed in harsh conditions with restrictions in size, weight and

power (SWaP). The system architect is driven to use new computer processing elements within new architectures. The prediction and

comparison of the performance of different processors is difficult and a uniform methodology is needed to compare their performance in various

possible architectures. Several metrics may be used for comparison. For example, the processing latency, data transfer rate in and out of the

processor and power consumption for key mathematical kernels are some of the measurable parameters that drive the decision-making process.

Various component vendors provide peak theoretical performance of a component or sub-system. However, when the system is constructed, the

overall performance of the system is generally found to be much lower in comparison to the peak theoretical performance of the sub-systems.

The performance may be improved significantly by using optimization techniques which are dependent among other things, upon processor

features, system capabilities and programming and diagnostic tools.

:::::::

Synthetic Aperture Radar on Low Power Multi-Core Digital Signal Processor Dan Wang*, Texas Instruments Abstract: Commercial off-the-self (COTS) components have recently gained popularity in Synthetic Aperture Radar (SAR) applications. The compute

capabilities of these devices have advanced to a level where real time processing of complex SAR algorithms have become feasible. In this

paper, we focus on a low power multi-core Digital Signal Processor (DSP) from Texas Instruments Inc. and evaluate its capability for SAR signal

processing. The specific DSP studied here is an eight-core device, codenamed TMS320C6678, that provides a peak performance of 128

GFLOPS (single precision) for only 10 watts. We describe how the basic SAR operations, like compression and corner turning can be

implemented efficiently in such a device. Our results indicate that a baseline SAR range-Doppler algorithm takes 0.2 sec for a 16 M (4K 4K)

image. :::::::

Integration and Development of the 500 TFLOPS Heterogeneous Cluster (Condor) mark Barnell*, Air Forece Research Laboratory Abstract: The Air Force Research Laboratory Information Directorate Advanced Computing Division (AFRL/RIT) High Performance Computing Affiliated

Resource Center (HPC-ARC) is the host to a very large scale interactive computing cluster consisting of about 1800 nodes. Condor, the largest

interactive Cell cluster in the world, consists of integrated heterogeneous processors of IBM Cell Broadband Engine (Cell BE) multicore CPUs,

NVIDIA General Purpose Graphic Processing Units (GPGPUs) and Intel x86 server nodes in a 10Gb Ethernet Star Hub network and 20Gb/s

Infiniband Mesh, with a combined capability of 500 trillion floating operations per second (TFLOPS). Applications developed and running on

CONDOR include large-scale computational intelligence models, video synthetic aperture radar (SAR) back-projection, Space Situational

Awareness (SSA), video target tracking, linear algebra and others. This presentation will discuss the design and integration of the system. It will

also show progress on performance optimization efforts and lessons learned on algorithm scalability on a heterogeneous architecture.

:::::::

Ruggedization of MXM Graphics Modules Ivan Straznicky*, Curtiss-Wright Controls Defense Solutions Abstract: MXM modules, used to package graphics processing devices for use in benign environments, have been tested for use in harsh environments

typical of deployed defense and aerospace systems. Results show that specially mechanically designed MXM GP-GPU modules can survive

these environments, and successfully provide the enormous processing capability offered by the latest generation of GPU to harsh environment

applications. :::::::

Parallel Search of k-Nearest Neighbors with Synchronous Operations Nikos Pitsianis*, Aristotle University and Duke University Abstract: We present a new study of parallel algorithms for locating k-nearest neighbors of each single query in a high dimensional (feature) space on a

many-core processor or accelerator that favors synchronous operations, such as on a graphics processing unit. Exploiting the intimate

relationships between two primitive operations, select and sort, we introduce a cohort of truncated sort algorithms for select. The truncated bitonic

sort (TBiS) in particular has desirable data locality, synchronous concurrency and simple data and program structures, which outweigh its single

drawback in taking more logical comparisons. TBiS can serve two special roles. One is as a reference point or a benchmark for quantitative study

of integral effect of multiple performance factors in algorithms and architectures for kNN search. The other is as a record holder at present for fast

kNN search on a parallel processor that imposes high synchronization cost. We provide with algorithm analysis and experimental results.

:::::::

AN INGENIOUS APPROACH FOR IMPROVING TURNAROUND TIME OF GRID JOBS WITH RESOURCE ASSURANCE AND ALLOCATION MECHANISM Prachi Pandey*, Abstract: In a heavily used grid scenario, where there are many jobs competing for the best resource, the meta-scheduler is burdened with the task of

judiciously allocating appropriate resources to the jobs. However, as the demand for the resources increases more and more, it becomes really

difficult to manage the jobs and allocate resources to them and hence most of the jobs will be in the queued state waiting for the resources to be

free. Gradually, it leads to a situation where the jobs stay in queued state longer than the execution state, resulting in highly increased turnaround

times. The challenge therefore is to make sure that the jobs don’t take an unreasonable time to complete because of the increased waiting time.

In this paper, we discuss about the advance reservation mechanism adopted in Garuda Grid for assuring the availability of compute resources

and QoS based resource allocation. Results of the experiments carried out with this setup confirm the reduction in queuing time of jobs in grid,

thereby improving the turnaround time.

:::::::

Large-Scale Molecular Dynamics Simulations of Early- and Intermediate-Stage Sintering of Nanocrystalline SiC Bryce Devine*, US Army Corps of Engineers Abstract: Polycrystalline silicon carbide (SiC) has tremendous potential as a lightweight structural material if its fracture toughness and tensile strength

could be significantly (factor of 4) improved, which is the long-term goal of this and related research. Such a “super” ceramic would allow for two-

thirds weight reduction, or more, over that of steel and aluminum for most structural applications. The potential impact on military logistics is

enormous. Key to the realization of such a super ceramic is the development of appropriate SiC composite designs and the development of

methods to fabricate SiC composites to meet these designs through sintering. Technologies to support SiC composite design development are

addressed in a companion paperThis paper discusses research to develop sintering fabrication methods. Recently developed sintering

techniques allow for the production of ceramic materials with nanocrystalline grain structures and for the incorporation of organic reinforcements

in ceramic composites. Both reduction in grain size and the incorporation of tensile members have been shown to improve the fracture toughness

of SiC. We are performing multi-million-atom classical molecular dynamics (MD) simulations of early- and intermediate-stage Spark Plasma

Sintering (SPS) of nanocrystalline SiC to better understand, and then engineer, the sintering process. We have developed continuum models to

predict the thermal, electric, and displacement fields inside the sintering chamber. These provide boundary and initial conditions for the MD

simulations of sintering. Several mechanisms were observed during each stage of sintering consolidation, with the rate limiting mechanism

dependent upon temperature, pressure and grain size. This research helps lay the technical foundation for development of a lightweight

structural “super” ceramic matrix composite. :::::::

High Performance Java Jordan Ruloff*, DRC Abstract: At this point in time, it is apparent that future programming paradigms will be based around many-core processors and heterogeneous

computing. Diversity in new processor architectures has led to a large variety of processors, which were designed to address different issues

found in past architectures while, unfortunately and unintentionally, burdening programmers to use these new architectures effectively. As more

programming libraries and languages are developed, programmers will be able to design algorithms for these different architectures to maximize

their code efficiency, whether to maximize performance or minimize power usage. Unfortunately, not all code can scale efficiently in many-core

architectures nor can all code efficiently utilize heterogeneous architectures. Sometimes, a programmer may even have to deal with a task that is

inherently serial in nature. Even if the task is trivially parallel, a programmer may even find that, due to limiting constraints of a particular

architecture, like memory or interconnect speed, an algorithm best suited for a particular problem may not be the most desirable to maximize

performance. In order to efficiently utilize the computing hardware, programmers must have a basic understanding of the fundamental differences

between the various architectures and how to best utilize them. This paper covers the methods that were employed for addressing task and data

parallelism within the Java language to maximize performance of the World Wind Java Ballistic Interface code, Java 7’s fork/join framework and

AMD’s Aparapi Java bindings as well as the importance of parallel execution time and how to map it to the various execution frameworks.

:::::::

HPC-VMs: Virtual Machines in High Performance Computing Systems Albert Reuther*, MIT Lincoln Laboratory Abstract: The concept of virtual machines dates back to the 1960s. Both IBM and MIT developed operating system features that enabled user and

peripheral time sharing, the underpinnings of which were early virtual machines. Modern virtual machines present a translation layer of system

devices between a guest operating system and the host operating system executing on a computer system, while isolating each of the guest

operating systems from each other. In the past several years, enterprise computing has embraced virtual machines to deploy a wide variety of

capabilities from business management systems to email server farms. Those who have adopted virtual deployment environments have

capitalized on a variety of advantages including server consolidation, service migration, and higher service reliability. But they have also ended

up with some challenges including a sacrifice in performance and more complex system management. Some of these advantages and

challenges also apply to HPC in virtualized environments. In this paper, we analyze the effectiveness of using virtual machines in a high

performance computing (HPC) environment. We propose adding some virtual machine capability to already robust HPC environments for specific

scenarios where the productivity gained outweighs the performance lost for using virtual machines. Finally, we discuss an implementation of

augmenting virtual machines into the software stack of a HPC cluster, and we analyze the affect on job launch time of this implementation.

:::::::

Complex Network Modeling with an Emulab HPC Virginia Ross*, AFRL/RITB Abstract: To support DoD networks in the field, next generation complex network product designs need to be evaluated for optimum performance. Network

emulation plays an important role in evaluating these next generation complex network product designs. From the component level to the

system-of-systems level, emulation enables evaluation in a real system context, greatly reducing the cost and time of testing and validation

throughout the design cycle. For accurate network synthesis, emulation must support real-time speed, full packet fidelity, and provide

transparency. For example, the Joint Tactical Radio System (JTRS) has critical needs for network evaluation, including researching the JTRS

networking waveforms. With JTRS currently undergoing massive revision, this emulation can help save time and resources in modeling the

network for system development and testing. The Network Modeling and Simulation Environment (NEMSE) capability was developed and

installed on the Air Force Research Laboratory/Information Directorate (AFRL/RI) EMULAB high performance computer (HPC), a network

emulation testbed, to demonstrate this capability for future network modeling. The NEMSE environment has demonstrated the capability to

incorporate hardware and software elements to provide hardware-in-the-loop network emulation testing and support true network emulation.

NEMSE provides parallel execution and highest fidelity models and the scalability and interactivity required to test and evaluate advanced

network communication devices and architectures. This capability benefits the DoD by enabling rapid technology transition of complex network

architectures from research laboratories to the field. Actual Joint Tactical Radio System (JTRS) radios, Operations Network (OPNET) emulations,

and GNU (recursive definition for GNU is Not Unix) open-source software-defined-radio software/ firmware/ hardware emulations can be

accommodated. :::::::

Signal & image processing technology transfer to army fielded combat robots Peter Raeth*, DRC Abstract: :::::::

Use of Code Execution Profiles and Traces in the HPCMP Sustained Systems Performance Test Paul Bennett*, U.S. Army Engineer Research and Development Center DoD Supercomputing Abstract: The High Performance Computing Modernization Program (HPCMP) sustained systems performance (SSP) test plays a vital role in ensuring that

the highest level of performance is delivered to users of the HPCMP HPC systems. A subset of the benchmark codes from the system

acquisition cycle is used to benchmark system performance in order to quantitatively evaluate updates to system software, hardware repairs,

modifications to job queuing policies, and revisions to the job scheduler. The SSP codes have proven migration capability to HPCMP HPC

systems and nonempirical tests for numerical accuracy. Metrics such as compilation time, queue wait time, benchmark execution time, and total

test throughput time are gathered and compared against data from previous tests to monitor the systems under test while minimizing impact to

the users. Jobs failing to execute properly or in anomalously short or long times are investigated, and the results are reported to systems

administrators and Center Directors at each Center for appropriate actions. In the past few years, many of the SSP performance issues have

been found to arise from contention for the interconnecting networks, and as such, are transient in nature. Unfortunately, without additional

investigation, it is impossible to determine whether any given performance issue is a systemic problem or arises from network contention. This

poster presents the results of a study made to determine the feasibility of using a lightweight profiling and tracing tool to determine whether it can

be used to more easily distinguish between systemic performance problems or transient interconnection problems.

:::::::

Parallel Circuit and Interconnect Simulation Using Multi-core PC Chun-Jung Chen*, Chinese Culture University Abstract: This paper presents methods that utilize multi-core PC to perform MOSFET circuit simulation and transmission line calculation. A very coarse-

grained parallel computing strategy is proposed for circuit simulation. A parallel transmission line calculation method based on Method of

Characteristic is also described. All proposed methods have been implemented and tested. Experimental

:::::::

A Novel Probe Concept for Computational Imaging of the Physiological Activity of Large Numbers of Cells Michael Henninger*, Massachusetts Institute of Technology Abstract: We propose a new kind of imaging probe that is small enough to be implanted in the body where it can monitor the physiology of many individual

cells embedded in an intact tissue environment (e.g., neurons in the brain). Imaging cells in intact tissue presents dual challenges of probe

miniaturization and microscopic imaging in a highly scattering medium. We find a simultaneous solution to both of these challenges by replacing

the lens optics of a conventional imaging system with a patterned aperture of opaque and transparent areas. The probe consists of a long, thin

shank densely arrayed with CMOS imaging pixels. The pixels are covered with a transparent standoff, a patterned array of apertures, and a

fluorescence emission filter. When implanted in conjunction with an excitation light source, such a device enables the measurement of

fluorescence from arbitrary locations in the brain. The probe’s patterned array causes the CMOS sensor to record both spatial and angular

information about the incident light. The captured 4D light field—two spatial and two angular dimensions—provides a powerful dataset for a

variety of computational imaging techniques and reconstruction algorithms. Indeed, we demonstrate that this data can be used to reconstruct

single cell sources in the full 3D volume from a single image shot, without any moving parts. We present representative numerical simulations of

the light field probe’s operation and feasibility, and demonstrate simple examples of aperture patterns and reconstruction algorithms.

:::::::

Senior Advisory Board
Technical Committee
Sponsors
Program Committee
Registration Rates
Call for Papers
Submit Technical Talk/Poster
Vendor Demo Sign Up
Paper Submission Guidelines
Presentation Guidelines
Projected Conference Dates
Past Proceedings