FastData: A symposium on the challenges of Big Data (and beyond) open to researchers, industrial stakeholders, students, and practitioners.
Computer Science Department, University of Torino
March 21st, 2016 – 9.00-16.00
As planet evolves, an increasingly connected ecosystem of heterogeneous devices produces more volumes and variety of digital data. They range from devices-in-the-fog to insanely complicated machineries looking with ever increasing precision to “the Ultimate Question of Life, the Universe, and Everything”. To keep up with the pace, very large volumes of dynamically changing data ought to be processed, synthesised, and eventually turned into knowledge. High-velocity data brings high value, especially to volatile business processes, mission-critical tasks and scientific grand challenges. Some of this data loses its operational value in a short time frame, some other is simply too much to be stored. Ultimately, any forecast on tomorrow must arrive by tomorrow.
This is the ground where the data science meets high-performance computing.
Both disciplines have undergone impressive change over recent years. On the one hand, the astonishing availability of digital data has boost methods to automatically extract knowledge or insights from it. On the other hand, new architectures and the ubiquitous nature of data promoted parallel and distributed techniques to mainstream. In order to turn fast data into innovation, it is of paramount importance to jointly review and assess new developments and recent research achievements with both academic community and industry. FastData@UNITO will provide an open, freely accessible forum for the presentation of these and other issues through scientific talks and will facilitate the exchange of knowledge and new ideas at the highest technical level among researchers, graduate students and industrial stakeholders.
Participation is free of charge
FastData Program (tentative)
- Scientific talks on tools and methods for Fast Data (9.00 – 11.15)
- Yoonho Park (IBM TJ Watson) OpenPOWER, Big Data and HPC
(abstract)Following the historic path for increased parallelism is unlikely to be sufficient for future high performance computing needs. Growing data volumes and emerging analytics workloads are changing the demands placed on system performance. Theses changes lead us to a data-centric system view with new requirements at every level of the system. To effectively manage this transition, we propose system design principles to guide the development of next-generation systems and to lead us to new, more balanced systems. In the coming years, a broad array of analytics will be employed to drive business results, and these algorithms will be combined with traditional modeling and simulation methods for added value. This “convergence of purpose” (we no longer have one machine for business processing and one machine for physical modeling) combined with the growing system demands from big data growth, lead us to the realization that traditional machine balance points are no longer sustainable with standard technology approaches. At a macro level, work flows will take advantage of different elements of the systems hierarchy, at dynamically varying times, in different ways and with different data distributions throughout the hierarchy. This leads us to a data-centric design that has the flexibility to handle these data-driven demands. This flexibility will come from balanced and composable systems built from modular components with computation distributed to all elements of the system hierarchy.
- Claudia Misale and Maurizio Drocco (UNITO, IBM interns at IBM TJ Watson and IBM Scholarship award 2015). One Programming Model to dominate them all — Hadoop, Spark, Flink, Storm, Tensorflow: Instructions for Use
(abstract)Tools for Big Data analytics are fighting to be the first class citizen in a scenario that gives more attention to marketing rather than research. A brief visit to the website of one of such tools would convince yourself. This race is bringing us a series of tools that compete mostly on being the top according to self-defined criteria, by only claiming they do more and faster and better than the competitors. The documentation they provide looks somehow “misleading by purpose” and little attention is given to the actual programming model. This sometimes forces the user to think to non-functional aspects (e.g. parallel execution) that should be completely hidden. In this talk, we show all those frameworks share a common old-school DataFlow model at the top level. Moreover we show they provide a similar expressiveness while differing in the underlying parallel runtime.
- Massimo Torquati (UNIPI). FastFlow, a programming model for FastData
(abstract)In recent years, we have seen an explosion of data streaming coming from the huge number of ubiquitous devices. Traditional big data management systems are designed for high throughput of batch jobs; they can hardly deal with urgency and latency constraints. Moreover, modern applications ask for elasticity and self-adaptivity features. They must be gracefully and dynamically adaptable to match real context needs. In the setting of the Internet of (Every)Thing and Fog computing, applications cannot be statically configured to sustain peak loads, this is simply too costly both in terms of resources needed and power consumed. To tackle all these problems, we need suitable abstractions for application developers. Domain-specific experts need to concentrate on developing newer and better applications and do not have to spend their time in writing highly-tuned low-level code. In this context, the FastFlow parallel programming framework may provide a step forward not only in terms of functionalities offered but also in terms of methodological approach. FastFlow has been originally designed having in mind three keywords: streaming, performance and flexibility. We do believe that these keywords are among the basic ingredients for building “FastData” applications. During the talk, we present the basic concepts of FastFlow and the latest research directions towards the “FastData” computing era.
- K. Selçuk Candan (Arizona State University). Assured and Scalable Data Engineering — Challenges and Opportunities
(abstract)A data revolution is transforming all aspects of our life, our science, and all sectors of our economy. This necessitates a fundamental shift from current ad hoc approaches to the design of data systems, towards a principled framework for reliable and timely data-driven decision making. The center will support the innovation of data architectures that can match the scale of the data, and support timely and assured decision making, through data integration, processing, and analysis, to help non-data-experts make decisions and generate value. If we want to fundamentally alter the way data systems are designed and significantly change current practices, we need to first ensure that data analysis, data assurance, and data management technology components are developed synergistically to achieve the following targets: (a) the design and development of each component is informed by the requirements and limitations of the others; (b) each takes full advantage of the services and capabilities provided by the others; and (c) they continuously adapt as the analysis, assurance, and management contexts evolve with the needs of the deployed application systems that they all support. Therefore, there is urgent need for assured and scalable data engineering paradigms to enable algorithms, tools, and systems that securely manage, share, access, and analyze heterogeneous sets of static or transient data to accommodate diverse security requirements, including trust, availability, confidentiality, and integrity.
- Francesco Bonchi (ISI Foundation). Big Data: what’s really new, what is not, and the need for new algorithms
(abstract)The data deluge we are witnessing brings in great opportunities for society, businesses, and science at large. But big promises always come with big challenges. In this non-technical and a bit provocative (given the context) position talk I will argue that the “Big Data” revolution is not only about massively parallel computation frameworks. In particular, I will stress the challenges that the increasing volume, velocity, and variety of information bring, and how these require the definition of new data analysis models and, ultimately, new algorithms.
- Carlo Nardone (NVidia Corp). Fast Data = Big Data + GPU acceleration
(abstract)In this short presentation, I will argue that Big Data Analytics requires GPU acceleration, as the field of Machine Learning (and HPC before it) is showing vividly by the explosion of Deep Learning applications. Some early examples will be shown.
- Coffee break (11.15-11.30)
- Industrial-Academic panel (11.30 – 12.45):
- Marco Aldinucci – UNITO (moderator)
- Fabrizio Antonelli – Ernst&Young
- Stefano Bagnasco – INFN
- Daniele Bonetta – Oracle Labs
- Raffaele Calogero – MBC-UNITO
- Paolo Secondo Crosta – Italtel
- Cristian Dittamo – List group
- Cristina Chesta – Concept Reply
- Chiara Ferroni – Torino wireless
- Marco Panzeri – Noesis Solutions
- Carlo Nardone – NVidia Corp.
- Luca Vignaroli – RAI–CRIT
- Daniele Sereno – Nuance
- Top-IX, Camera di commercio di Torino, unione industriali di Torino
- Lunch (13.00-14.00, invited only)
- Jam session (14.00 – 16.00, invited only): Rapid fire talks at the blackboard. No slides. Open discussion.
Marco Aldinucci leads the parallel computing research group and the NVidia GPU research center at University of Torino. He is the author of over 110 papers together with more than 90 different co-authors. He is the recipient of the HPC Advisory Council University Award 2011, IEEE HPCC outstanding leadership award 2014, IBM faculty award 2015. Overall, he participated in over 40 competitive research grants concerning parallel computing, including currently ongoing 5 EU FP7 and H2020 (Repara, HiPEAC, Rephrase, Toreador, HyVar), and 2 EU COST Actions (Nesus and Chipset). His main research is focused on languages and tools parallel and distributed computing, and in particular on models and tools for high-level parallel programming, lock-free algorithms, and autonomic computing. He the co-designers of a number of parallel programming frameworks, including FastFlow.
Francesco Bonchi is Research Leader at the ISI Foundation, Turin, Italy, where he leads the “Algorithmic Data Analytics” group. Before he was Director of Research at Yahoo Labs in Barcelona, Spain, where he was leading the Web Mining Research group. His recent research interests include mining query-logs, social networks, and social media, as well as the privacy issues related to mining these kinds of sensible data. In the past he has been interested in data mining query languages, constrained pattern mining, mining spatiotemporal and mobility data, and privacy preserving data mining.
K. Selçuk Candan is a Professor of Computer Science and Engineering and the Director of the Center for Assured and Scalable Data Engineering (CASCADE) at the Arizona State University. He has published over 170 journal and peer-reviewed conference articles, one book, and 16 book chapters. He has 9 patents. Prof. Candan served as an associate editor of one of the most respected database journals, the Very Large Databases (VLDB) journal. He is also in the editorial board of the ACM Transactions on Database Systems, IEEE Transactions on Multimedia, and the Journal of Multimedia. He has served in the organization and program committees of various conferences. In 2006, he served as an organization committee member for SIGMOD’06, the flagship database conference of the ACM. In 2008, he served as a PC Chair for another leading, flagship conference of the ACM, this time focusing on multimedia research (MM’08). More recently, he served as a program committee group leader for ACM SIGMOD’10. He also serves in the review board of the Proceedings of the VLDB Endowment (PVLDB). In 2011, he served as a general co-chair for the ACM MM’11 conference. In 2012 he served as a general co-chair for ACM SIGMOD’12. In 2015, he serves as a general co-chair for the IEEE International Conference on Cloud Engineering (I2CE). He has successfully served as the PI or co-PI of numerous grants, including from the National Science Foundation, Air Force Office of Research, Army Research Office, Mellon Foundation, and HP Labs. He also served as a Visiting Research Scientist at NEC Laboratories America for over 10 years. He is a member of the Executive Committee of ACM SIGMOD and an ACM Distinguished Scientist. You can find more information about his research and an up-to-date resume at http://aria.asu.edu/candan.
Maurizio Drocco is a Ph.D. Student in Computer Science at University of Turin. He has been Research Intern at IBM Thomas J. Watson Lab (NY) and at IBM Dublin Research Lab. He is research associate at University of Torino since 2009. He has co-authored more than 20 papers in international journals and conference proceedings. He is participating in the EU-FP7 REPARA and the EU-H2020 RePhrase projects. Formerly he has been participating in the EU-FP7 Paraphrase and the Regione Piemonte BioBITs projects. His research focuses on high-level parallel programming, in particular models and methods for high performance computing on heterogeneous platforms.
Claudia Misale is a PhD candidate at Computer Science Department of the University of Torino and a member of the parallel computing Alpha group. She was participating in the European STREP FP7 ParaPhrase and REPARA projects and she has been a research intern in the High Performance System Software research group in the in the Data Centric Systems Department at IBM T.J. Watson, working on optimizing big data analytics framework for Data-Centric Systems (DCS). Her research is focused on high performance computing, in particular on high level models and patterns for distributed computing and high-performance big data analytics for HPC platforms.
Carlo Nardone. Physicist by background, Carlo Nardone is a HPC professional with more than 25 years of experience in mathematical modelling and numerical simulation, scientific data analysis, parallel and distributed computing, and accelerated computing. He joined NVIDIA about 1.5 years ago, as a Senior Solution Architect in the EMEA Enterprise team. His current focus, alongside traditional HPC projects, is on the latest “killer app” in this field, namely Deep Learning, helping partners and customers in the adoption of NVIDIA Deep Learning technologies, particularly for autonomous driving applications.
Yoonho (Yoon) Park manages the System Software group in the Data Centric Systems Department at IBM T. J. Watson Research Center. The DCS Department produced the Blue Gene supercomputers and is now constructing the CORAL Sierra and Summit supercomputers. Yoon received a PhD in Computer Science and Engineering from the University of Michigan. He developed the virtual memory system and file system for IBM SawMill Linux — a decomposed Linux running on the L4 microkernel, the network stack for ReefEdge Networks, the network stack for IBM InfoSphere Streams, and IBM FusedOS — a hybrid operating system that combines Linux and CNK.
Massimo Torquati is a researcher in the Computer Science Department of the University of Pisa. He has published more than 50 peer-reviewed papers, mostly in the fields of parallel and distributed programming and runtime systems. He has been involved in a number of Italian government, EU and industry-supported research projects, most recently the EU ParaPhrase, REPARA and ParaPhrase. His current research interests are on pattern-based parallel programming models, concurrent lock-free data structures and autonomic management in parallel systems. He contributed in the design and development of several frameworks for parallel programming. Currently, he is the maintainer and the main developer of the FastFlow parallel programming framework.
Computer Science Department, University of Torino
Entryway from Via Pessinetto 12, Torino