© Lorem Ipsum Dolor 2010
2012 IEEE High Performance Extreme Computing Conference (HPEC ‘12) Sixteenth Annual HPEC Conference 10 - 12 September 2012 Westin Hotel, Waltham, MA USA
Home Organizers General Information Call for Papers Agenda Author Guidelines Hotel Past & Future Conferences Conference Registration
Site created and maintained by Ballos Associates
A Third Generation Many-Core Processor for Secure Embedded Computing Systems John Irza*, Coherent Logix, Inc. Abstract: As compute-intensive products proliferate, there is an ever growing need to provide security features to detect tampering, identify cloned or  counterfeit hardware, and deter cybersecurity threats. This paper describes the security features of the third generation 100-core HyperX™  processor which addresses these needs. Programmable security barriers allow the processor to implement a red-black System on Chip solution.  The implementation of Physically Unclonable Functions (PUFs), encryption/decryption engines, a secure boot controller, and anti-tamper features  enable the engineer to realize a secure embedded computing solution in an ultra-low power, many-core, C programmable processor-memory  network.  :::::::  Exploiting SPM-aware Scheduling on EPIC Architectures for High-Performance Real-Time Systems Wei Zhang*, Virginia Commonwealth University Abstract: In contemporary computer architectures, the Explicitly Parallel Instruction Computing Architectures (EPIC) permits microprocessors to implement  Instruction Level Parallelism (ILP) by using the compiler, rather than complex ondie circuitry to control parallel instruction execution like the  superscalar architecture. Based on the EPIC, this paper proposes a time predictable two-level scratchpad based memory architecture, and an  ILP based static memory objects assignment algorithm is utilized in the compiler not to harm the characteristic of time predictability of scratchpad  memories. Then, to exploit the load/store latencies that are statically known in this architecture, we study a Scratchpad-aware Scheduling method  to improve the performance by optimizing the Load-To-Use Distance. Our experimental results indicate that the performance of the two-level  scratchpad based architecture on EPIC processors can be improved by the Scratchpad-aware Scheduling, while keeping time predictability.  :::::::  Using Copper Water Loop Heat Pipes to Efficiently Cool CPUs and GPUs Stephen Fried*, Microway Inc. and Passive Thermal Technology Abstract: As the amount of power being rejected by 1U servers starts to approach and exceed 2 KW, the question in HPC continues to be, how can we not  only cool devices which reject this amount of heat, but also, how can we reject that heat efficiently.  :::::::  High locality and increased intra-node parallelism for solving finite element models on GPUs by novel element-by-element implementation Zsolt Badics*, Tensor Research LLC Abstract: The utilization of Graphical Processing Units (GPUs) for the element-by-element (EbE) finite element method (FEM) is demonstrated. EbE FEM  is a long known technique, by which a conjugate gradient (CG) type iterative solution scheme can be entirely decomposed into computations on  the element level, i.e., without assembling the global system matrix. In our implementation NVIDIA’s parallel computing solution, the Compute  Unified Device Architecture (CUDA) is used to perform the required element-wise computations in parallel. Since element matrices need not be  stored, the memory requirement can be kept extremely low. It is shown that this low-storage but computationintensive technique is better suited  for GPUs than those requiring the massive manipulation of large data sets. This first study of the proposed parallel model illustrates a highly  improved locality and minimization of data movement, which could also significantly reduce energy consumption in other HPC architectures.  :::::::  Accelerating Fully Homomorphic Encryption Using GPU Wei Wang, ECE, Worcester Polytechnic Institute; Yin Hu, ECE, Worcester Polytechnic Institutue; Lianmu Chen, ECE, Worcester Polytechnic Institute; Xinming Huang*, ECE, Worcester Polytechnic Institute; Berk Sunar, ECE, Worcester Polytechnic Institute Abstract: In a major breakthrough, in 2009 Gentry introduced the first plausible construction of a fully homomorphic encryption (FHE) scheme. FHE allows  the evaluation of arbitrary functions directly on encrypted data on untwisted servers. In 2010, Gentry and Halevi presented the first FHE  implementation on an IBM x3500 server. However, this implementation remains impractical due to the high latency of encryption and recryption.  The Gentry- Halevi (GH) FHE primitives, utilize multi-million-bit modular multiplications and additions – time-consuming tasks for general purpose  processors. In the GH-FHE implementation, the most computationintensive arithmetic operation is modular multiplication. In this paper, the  million-bit multiplication is calculated in two steps: large-number multiplication and modular reduction. Strassen’s FFT based algorithm is used so  that Graphics processing units (GPU) can employ massive parallelism to accelerate the largenumber number multiplication. In what follows, the  Barrett Modular Reduction algorithm is used to realize modular multiplication. We implemented the encryption, decryption and recryption  primitives on the NVIDIA C2050. Experimental results show factors of up to 7.68, 7.4 and 6.59 speed improvement for encryption, decryption and  recrypt, respectively, when compared to the GH implementation for the small setting in dimension 2048.  :::::::  Use of CUDA for the Continuous Space Language Model Elizabeth Thompson*, Purdue University Fort Wayne Abstract: The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute  Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls  on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated.  :::::::  Graph Programming Model - An Efficient Approach for Sensor Signal Processing Domain Steve Kirsch*, Raytheon Abstract: The HPC community has struggled to find an optimal parallel programming model that can efficiently expose algorithmic parallelism in a  sequential program and automate the implementation of a highly efficient parallel program. A plethora of  parallel programming languages have  been developed along with sophisticated  compilers and runtimes, but none of these approach have been successful enough to became the  defacto standard.  Graph Programming Model has the capability and efficiencies to become that ubiquitous standard for the signal processing  domain.  :::::::  An Application of the Constraint Programming  to the Design and Operation of Synthetic Aperture Radars Michael Holzrichter*, Sandia National Laboratories Abstract: The design and operation of synthetic aperture radars require compatible sets of hundreds of quantities. Compatibility is achieved when these  quantities satisfy constraints arising from physics, geometry etc. In the aggregate these quantities and constraints form a logical model of the  radar. In practice the logical model is distributed over multiple people, documents and software modules thereby becoming fragmented.  Fragmentation gives rise to inconsistencies and errors. The SAR Inference Engine addresses the fragmentation problem by implementing the  logical model of a Sandia synthetic aperture radar in a form that is intended to be usable from system design to mission planning to actual  operation of the radar. These diverse contexts require extreme flexibility that is achieved by employing the constraint programming paradigm.  :::::::  LLMORE: A Framework for Data Mapping and Architecture Analysis Michael Wolf*, MIT Lincoln Laboratory Abstract: We outline our recent efforts in developing MIT Lincoln Laboratory’s Mapping and Optimization Runtime Environment (LLMORE). The LLMORE  framework consists of several components that together estimate and optimize performance critical sections of an application. This framework  can be used to improve the performance of parallel applications and as an important tool for analyzing different hardware architectures. In this  paper, we describe the use cases that have driven the development of LLMORE. We also give two concrete examples of how LLMORE can be  used to improve the parallel performance of a numerical operation and characterize the power efficiency of numerical algorithms and computer  architectures. :::::::  Unitary Qubit Lattice Algorithm for Two-Component Bose-Einstein Condensate Gases: the Kelvin-Helmholtz and Counter-Superflow Instabilities George Vahala*, William & Mary Abstract: A unitary qubit lattice algorithm, employing four qubits per lattice site, is introduced to model a set of coupled Bose-Einstein condensates (BECs)  described by the Gross- Pitaevskii (GP) equation for the ground state wave functions. Using a series of unitary collide-stream-rotate operators,  the ideally parallelized (tested to over 210,000 cores) mesoscopic algorithm recovers the coupled GP equations in the diffusion limit. Both the  quantum Kelvin-Helmholtz (KH) and quantum counter-superflow instabilities will be examined on high resolution grids for both 2D and 3D. With  mean velocity shear between the two components, the Kelvin-Helmholtz and counter-superflow instabilities are driven. Recent 2D simulations of  Tsubota et. al. [1] on such BECs, using pseudospectral codes, have uncovered novel features not seen in the classical analogues of these  instabilities. in particular, as the shear velocity interface forms a sawtooth oscillation of increasing amplitude, quantum vorticies are spun off the  crests and troughs and propagate within their own condensates and so stabilize the KH instability. For thicker 2D interface boundaries, the two-  stream counterflow instability leads to the creation of quantum vortex pairs with complex dynamical behavior. These results will first be verified by  our 2D qubit algorithms and then extended to 3D where the quantum vortices can now interaction strongly and undergo reconnection and loop  ejection. Because the qubit algorithms are so well parallelized, detailed 3D structures will be examined with excellent spatial resolution. The  principal significance to DoD is in the development of unitary qubit codes that are immediately portable to quantum computers as they come  online. It aids in the interplay between quantum and classical turbulence and in the control of BECs. [1] H. Takeuchi, N. Suzuki, K. Kasamatsu, H.  Saito and M. Tsubota, Phys. Rev. B81, 094517 (2010).  :::::::  Early Experiences with Energy-Aware Scheduling Kathleen Smith*, ARL DSRC Abstract: This paper documents the early experiences and recent progress with employing the Energy-Aware Scheduler (EAS) at the DoD  Supercomputing Resource Centers (DSRC). The U.S. Army Research Laboratory (ARL) has partnered with Lockheed Martin, Altair, and  Instrumental to assess feasibility on current DSRC High Performance Computing (HPC) systems. Developmental work was completed on the  ARL DSRC Test and Development systems and ported to the production systems at the ARL DSRC. The (EAS) is written in Python and works  with the current program-wide scheduler, Altair PBS Professional, that is deployed across the DSRCs. EAS reduces power and cooling costs by  intelligently powering off compute nodes that are not actively being used by the currently running or reserved for near future jobs.    It has been  estimated that the Energy Aware Scheduler could potentially save millions of Kilowatt-hours each year throughout the program.  We will describe  the extent of our work to date at the DSRC centers and our plans to complete our work by September 30, 2012.  :::::::  Isolating Runtime Faults with Callstack Debugging using TAU Sameer Shende*, ParaTools, Inc. Abstract: We present a tool that can help identify the nature of runtime errors in a program at the point of failure. This debugging tool, integrated in the TAU  Performance System, allows a developer to isolate the fault in a multilanguage program by capturing the signal associated with the fault and  examining the program callstack. It captures the performance data at the point of failure, stores detailed information for each frame in the  callstack, and generates a file that may be shipped back to the developers for further analysis. Technical Approach:  :::::::  Fast Functional Simulation with a Dynamic Language Craig Steele*, Exogi LLC Abstract: Simulation of large computational systems-on-a-chip (SoCs) is increasing challenging as the number and complexity of components is scaled up.  With the ubiquity of programmable components in computational SoCs, fast functional instructionset simulation (ISS) is increasingly important.  Much ISS has been done with straightforward unit-delay models of a non-pipelined fetch-decode-execute iteration written in a low-to-mid-level  Cfamily static language, delivering mid-level efficiency. Some ISS programs, such as QEMU, perform binary translation to allow software  emulation to reach more usable speeds. This relatively complex methodology has not been widely adopted for system modeling.  We  demonstrate a fresh approach to ISS that achieves much better performance than a fast binary-to-binary translator by exploiting recent advances  in just-in-time (JIT) compilers for dynamic languages, such as JavaScript and Lua, together with a specific programming idiom inspired by  pipelined processor design. We believe that this approach is relatively accessible to system designers familiar with C-family functional simulator  coding styles, and generally useful for fast modeling of complex SoC components.  :::::::  Power and Performance Comparison of HPEC Challenge Benchmarks on Various Processors Sharad Mehta*, Mercury Computer Systems,Inc. Abstract: The HPEC Challenge Benchmarks may be used to compare power and performance characteristics of various processors. The objective is to  enable data-driven decisions to be taken during selection of components and system architectures for high performance embedded computing  applications. The system architect has a wide range of choices available in terms of selection of software, firmware and hardware. These choices  include various types of processors (CPUs, GPUs, FPGAs). These components may be configured within various network topologies to  accommodate the processing and data rate requirements of the application at hand. In addition to the complexity of the algorithms and the  volume and rate of the incoming data, embedded systems are challenged to be deployed in harsh conditions with restrictions in size, weight and  power (SWaP). The system architect is driven to use new computer processing elements within new architectures. The prediction and  comparison of the performance of different processors is difficult and a uniform methodology is needed to compare their performance in various  possible architectures. Several metrics may be used for comparison. For example, the processing latency, data transfer rate in and out of the  processor and power consumption for key mathematical kernels are some of the measurable parameters that drive the decision-making process.  Various component vendors provide peak theoretical performance of a component or sub-system. However, when the system is constructed, the  overall performance of the system is generally found to be much lower in comparison to the peak theoretical performance of the sub-systems.  The performance may be improved significantly by using optimization techniques which are dependent among other things, upon processor  features, system capabilities and programming and diagnostic tools.  :::::::  Synthetic Aperture Radar on Low Power Multi-Core Digital Signal Processor Dan Wang*, Texas Instruments Abstract: Commercial off-the-self (COTS) components have recently gained popularity in Synthetic Aperture Radar (SAR) applications. The compute  capabilities of these devices have advanced to a level where real time processing of complex SAR algorithms have become feasible. In this  paper, we focus on a low power multi-core Digital Signal Processor (DSP) from Texas Instruments Inc. and evaluate its capability for SAR signal  processing. The specific DSP studied here is an eight-core device, codenamed TMS320C6678, that provides a peak performance of 128  GFLOPS (single precision) for only 10 watts. We describe how the basic SAR operations, like compression and corner turning can be  implemented efficiently in such a device. Our results indicate that a baseline SAR range-Doppler algorithm takes 0.2 sec for a 16 M (4K   4K)  image. :::::::  Integration and Development of the 500 TFLOPS Heterogeneous Cluster (Condor) mark Barnell*, Air Forece Research Laboratory Abstract: The Air Force Research Laboratory Information Directorate Advanced Computing Division (AFRL/RIT) High Performance Computing Affiliated  Resource Center (HPC-ARC) is the host to a very large scale interactive computing cluster consisting of about 1800 nodes. Condor, the largest  interactive Cell cluster in the world, consists of integrated heterogeneous processors of IBM Cell Broadband Engine (Cell BE) multicore CPUs,  NVIDIA General Purpose Graphic Processing Units (GPGPUs) and Intel x86 server nodes in a 10Gb Ethernet Star Hub network and 20Gb/s  Infiniband Mesh, with a combined capability of 500 trillion floating operations per second (TFLOPS). Applications developed and running on  CONDOR include large-scale computational intelligence models, video synthetic aperture radar (SAR) back-projection, Space Situational  Awareness (SSA), video target tracking, linear algebra and others. This presentation will discuss the design and integration of the system. It will  also show progress on performance optimization efforts and lessons learned on algorithm scalability on a heterogeneous architecture.  :::::::  Ruggedization of MXM Graphics Modules Ivan Straznicky*, Curtiss-Wright Controls Defense Solutions Abstract: MXM modules, used to package graphics processing devices for use in benign environments, have been tested for use in harsh environments  typical of deployed defense and aerospace systems. Results show that specially mechanically designed MXM GP-GPU modules can survive  these environments, and successfully provide the enormous processing capability offered by the latest generation of GPU to harsh environment  applications. :::::::  Parallel Search of k-Nearest Neighbors with Synchronous Operations Nikos Pitsianis*, Aristotle University and Duke University Abstract: We present a new study of parallel algorithms for locating k-nearest neighbors of each single query in a high dimensional (feature) space on a  many-core processor or accelerator that favors synchronous operations, such as on a graphics processing unit. Exploiting the intimate  relationships between two primitive operations, select and sort, we introduce a cohort of truncated sort algorithms for select. The truncated bitonic  sort (TBiS) in particular has desirable data locality, synchronous concurrency and simple data and program structures, which outweigh its single  drawback in taking more logical comparisons. TBiS can serve two special roles. One is as a reference point or a benchmark for quantitative study  of integral effect of multiple performance factors in algorithms and architectures for kNN search. The other is as a record holder at present for fast  kNN search on a parallel processor that imposes high synchronization cost. We provide with algorithm analysis and experimental results.  :::::::  AN INGENIOUS APPROACH FOR IMPROVING TURNAROUND TIME OF GRID JOBS WITH RESOURCE ASSURANCE AND ALLOCATION MECHANISM Prachi Pandey*, Abstract: In a heavily used grid scenario, where there are many jobs competing for the best resource, the meta-scheduler is burdened with the task of  judiciously allocating appropriate resources to the jobs. However, as the demand for the resources increases more and more, it becomes really  difficult to manage the jobs and allocate resources to them and hence most of the jobs will be in the queued state waiting for the resources to be  free. Gradually, it leads to a situation where the jobs stay in queued state longer than the execution state, resulting in highly increased turnaround  times. The challenge therefore is to make sure that the jobs don’t take an unreasonable time to complete because of the increased waiting time.  In this paper, we discuss about the advance reservation mechanism adopted in Garuda Grid for assuring the availability of compute resources  and QoS based resource allocation. Results of the experiments carried out with this setup confirm the reduction in queuing time of jobs in grid,  thereby improving the turnaround time.  :::::::  Large-Scale Molecular Dynamics Simulations of Early- and Intermediate-Stage Sintering of Nanocrystalline SiC Bryce Devine*, US Army Corps of Engineers Abstract: Polycrystalline silicon carbide (SiC) has tremendous potential as a lightweight structural material if its fracture toughness and tensile strength  could be significantly (factor of 4) improved, which is the long-term goal of this and related research. Such a “super” ceramic would allow for two-  thirds weight reduction, or more, over that of steel and aluminum for most structural applications. The potential impact on military logistics is  enormous. Key to the realization of such a super ceramic is the development of appropriate SiC composite designs and the development of  methods to fabricate SiC composites to meet these designs through sintering. Technologies to support SiC composite design development are  addressed in a companion paperThis paper discusses research to develop sintering fabrication methods. Recently developed sintering  techniques allow for the production of ceramic materials with nanocrystalline grain structures and for the incorporation of organic reinforcements  in ceramic composites. Both reduction in grain size and the incorporation of tensile members have been shown to improve the fracture toughness  of SiC. We are performing multi-million-atom classical molecular dynamics (MD) simulations of early- and intermediate-stage Spark Plasma  Sintering (SPS) of nanocrystalline SiC to better understand, and then engineer, the sintering process. We have developed continuum models to  predict the thermal, electric, and displacement fields inside the sintering chamber. These provide boundary and initial conditions for the MD  simulations of sintering. Several mechanisms were observed during each stage of sintering consolidation, with the rate limiting mechanism  dependent upon temperature, pressure and grain size. This research helps lay the technical foundation for development of a lightweight  structural “super” ceramic matrix composite. :::::::  High Performance Java Jordan Ruloff*, DRC Abstract: At this point in time, it is apparent that future programming paradigms will be based around many-core processors and heterogeneous  computing. Diversity in new processor architectures has led to a large variety of processors, which were designed to address different issues  found in past architectures while, unfortunately and unintentionally, burdening programmers to use these new architectures effectively. As more  programming libraries and languages are developed, programmers will be able to design algorithms for these different architectures to maximize  their code efficiency, whether to maximize performance or minimize power usage. Unfortunately, not all code can scale efficiently in many-core  architectures nor can all code efficiently utilize heterogeneous architectures. Sometimes, a programmer may even have to deal with a task that is  inherently serial in nature. Even if the task is trivially parallel, a programmer may even find that, due to limiting constraints of a particular  architecture, like memory or interconnect speed, an algorithm best suited for a particular problem may not be the most desirable to maximize  performance. In order to efficiently utilize the computing hardware, programmers must have a basic understanding of the fundamental differences  between the various architectures and how to best utilize them. This paper covers the methods that were employed for addressing task and data  parallelism within the Java language to maximize performance of the World Wind Java Ballistic Interface code, Java 7’s fork/join framework and  AMD’s Aparapi Java bindings as well as the importance of parallel execution time and how to map it to the various execution frameworks.  :::::::  HPC-VMs:  Virtual Machines in High Performance Computing Systems Albert Reuther*, MIT Lincoln Laboratory Abstract: The concept of virtual machines dates back to the 1960s. Both IBM and MIT developed operating system features that enabled user and  peripheral time sharing, the underpinnings of which were early virtual machines. Modern virtual machines present a translation layer of system  devices between a guest operating system and the host operating system executing on a computer system, while isolating each of the guest  operating systems from each other.  In the past several years, enterprise computing has embraced virtual machines to deploy a wide variety of  capabilities from business management systems to email server farms. Those who have adopted virtual deployment environments have  capitalized on a variety of advantages including server consolidation, service migration, and higher service reliability. But they have also ended  up with some challenges including a sacrifice in performance and more complex system management. Some of these advantages and  challenges also apply to HPC in virtualized environments. In this paper, we analyze the effectiveness of using virtual machines in a high  performance computing (HPC) environment. We propose adding some virtual machine capability to already robust HPC environments for specific  scenarios where the productivity gained outweighs the performance lost for using virtual machines. Finally, we discuss an implementation of  augmenting virtual machines into the software stack of a HPC cluster, and we analyze the affect on job launch time of this implementation.  :::::::  Complex Network Modeling with an Emulab HPC Virginia Ross*, AFRL/RITB Abstract: To support DoD networks in the field, next generation complex network product designs need to be evaluated for optimum performance. Network  emulation plays an important role in evaluating these next generation complex network product designs. From the component level to the  system-of-systems level, emulation enables evaluation in a real system context, greatly reducing the cost and time of testing and validation  throughout the design cycle. For accurate network synthesis, emulation must support real-time speed, full packet fidelity, and provide  transparency.  For example, the Joint Tactical Radio System (JTRS) has critical needs for network evaluation, including researching the JTRS  networking waveforms. With JTRS currently undergoing massive revision, this emulation can help save time and resources in modeling the  network for system development and testing. The Network Modeling and Simulation Environment (NEMSE) capability was developed and  installed on the Air Force Research Laboratory/Information Directorate (AFRL/RI) EMULAB high performance computer (HPC), a network  emulation testbed, to demonstrate this capability for future network modeling.  The NEMSE environment has demonstrated the capability to  incorporate hardware and software elements to provide hardware-in-the-loop network emulation testing and support true network emulation.  NEMSE provides parallel execution and highest fidelity models and the scalability and interactivity required to test and evaluate advanced  network communication devices and architectures. This capability benefits the DoD by enabling rapid technology transition of complex network  architectures from research laboratories to the field. Actual Joint Tactical Radio System (JTRS) radios, Operations Network (OPNET) emulations,  and GNU (recursive definition for GNU is Not Unix) open-source software-defined-radio software/ firmware/ hardware emulations can be  accommodated. :::::::  Signal & image processing technology transfer to army fielded combat robots Peter Raeth*, DRC Abstract: :::::::  Use of Code Execution Profiles and Traces in the HPCMP Sustained Systems Performance Test Paul Bennett*, U.S. Army Engineer Research and Development Center DoD Supercomputing Abstract: The High Performance Computing Modernization Program (HPCMP) sustained systems performance (SSP) test plays a vital role in ensuring that  the highest level of performance is delivered to users of the HPCMP HPC systems.  A subset of the benchmark codes from the system  acquisition cycle is used to benchmark system performance in order to quantitatively evaluate updates to system software, hardware repairs,  modifications to job queuing policies, and revisions to the job scheduler.  The SSP codes have proven migration capability to HPCMP HPC  systems and nonempirical tests for numerical accuracy.  Metrics such as compilation time, queue wait time, benchmark execution time, and total  test throughput time are gathered and compared against data from previous tests to monitor the systems under test while minimizing impact to  the users. Jobs failing to execute properly or in anomalously short or long times are investigated, and the results are reported to systems  administrators and Center Directors at each Center for appropriate actions. In the past few years, many of the SSP performance issues have  been found to arise from contention for the interconnecting networks, and as such, are transient in nature. Unfortunately, without additional  investigation, it is impossible to determine whether any given performance issue is a systemic problem or arises from network contention. This  poster presents the results of a study made to determine the feasibility of using a lightweight profiling and tracing tool to determine whether it can  be used to more easily distinguish between systemic performance problems or transient interconnection problems.  :::::::  Parallel Circuit and Interconnect Simulation Using Multi-core PC Chun-Jung Chen*, Chinese Culture University Abstract: This paper presents methods that utilize multi-core PC to perform MOSFET circuit simulation and transmission line calculation. A very coarse-  grained parallel computing strategy is proposed for circuit simulation. A parallel transmission line calculation method based on Method of  Characteristic is also described. All proposed methods have been implemented and tested. Experimental  :::::::  A Novel Probe Concept for Computational Imaging of the Physiological Activity of Large Numbers of Cells Michael Henninger*, Massachusetts Institute of Technology Abstract: We propose a new kind of imaging probe that is small enough to be implanted in the body where it can monitor the physiology of many individual  cells embedded in an intact tissue environment (e.g., neurons in the brain). Imaging cells in intact tissue presents dual challenges of probe  miniaturization and microscopic imaging in a highly scattering medium. We find a simultaneous solution to both of these challenges by replacing  the lens optics of a conventional imaging system with a patterned aperture of opaque and transparent areas. The probe consists of a long, thin  shank densely arrayed with CMOS imaging pixels. The pixels are covered with a transparent standoff, a patterned array of apertures, and a  fluorescence emission filter. When implanted in conjunction with an excitation light source, such a device enables the measurement of  fluorescence from arbitrary locations in the brain. The probe’s patterned array causes the CMOS sensor to record both spatial and angular  information about the incident light. The captured 4D light field—two spatial and two angular dimensions—provides a powerful dataset for a  variety of computational imaging techniques and reconstruction algorithms. Indeed, we demonstrate that this data can be used to reconstruct  single cell sources in the full 3D volume from a single image shot, without any moving parts. We present representative numerical simulations of  the light field probe’s operation and feasibility, and demonstrate simple examples of aperture patterns and reconstruction algorithms.  :::::::