Paper "“All Roads Lead to Rome:” Optimistic Recovery for Distributed Iterative Data Processing" accepted at CIKM 2013

21 Oct 2013

Our paper "“All Roads Lead to Rome:” Optimistic Recovery for Distributed Iterative Data Processing" authored by Sebastian Schelter, Kostas Tzoumas, Stephan Ewen and Volker Markl has been accepted accepted at the ACM International Conference on Information and Knowledge Management (CIKM 2013) in San Francisco.

Abstract

Executing data-parallel iterative algorithms on large datasets is crucial for many advanced analytical applications in the fields of data mining and machine learning. Current systems for executing iterative tasks in large clusters typically achieve fault tolerance through rollback recovery. The principle behind this pessimistic approach is to periodically checkpoint the algorithm state. Upon failure, the system restores a consistent state from a previously written checkpoint and resumes execution from that point.

We propose an optimistic recovery mechanism using algorithmic compensations. Our method leverages the robust, self-correcting nature of a large class of fixpoint algorithms used in data mining and machine learning, which converge to the correct solution from various intermediate consistent states. In the case of a failure, we apply a user-defined compensate function that algorithmically creates such a consistent state, instead of rolling back to a previous checkpointed state. Our optimistic recovery does not checkpoint any state and hence achieves optimal failure-free performance with respect to the overhead necessary for guaranteeing fault tolerance. We illustrate the applicability of this approach for three wide classes of problems. Furthermore, we show how to implement the proposed optimistic recovery mechanism in a data flow system. Similar to the Combine operator in MapReduce, our proposed functionality is optional and can be applied to increase performance without changing the semantics of programs. In an experimental evaluation on large datasets, we show that our proposed approach provides optimal failure-free performance. In the absence of failures our optimistic scheme is able to outperform a pessimistic approach by a factor of two to five. In presence of failures, our approach provides fast recovery and outperforms pessimistic approaches in the majority of cases.

Download the paper [PDF]

Paper "“All Roads Lead to Rome:” Optimistic Recovery for Distributed Iterative Data Processing" accepted at CIKM 2013

Demo Paper "Large-Scale Social-Media Analytics on Stratosphere" Accepted at WWW 2013

27 Mar 2013

Our demo submission
"Large-Scale Social-Media Analytics on Stratosphere"
by Christoph Boden, Marcel Karnstedt, Miriam Fernandez and Volker Markl
has been accepted for WWW 2013 in Rio de Janeiro, Brazil.

Visit our demo, and talk to us if you are attending WWW 2013.

Abstract:
The importance of social-media platforms and online communities - in business as well as public context - is more and more acknowledged and appreciated by industry and researchers alike. Consequently, a wide range of analytics has been proposed to understand, steer, and exploit the mechanics and laws driving their functionality and creating the resulting benefits. However, analysts usually face significant problems in scaling existing and novel approaches to match the data volume and size of modern online communities. In this work, we propose and demonstrate the usage of the massively parallel data prossesing system Stratosphere, based on second order functions as an extended notion of the MapReduce paradigm, to provide a new level of scalability to such social-media analytics. Based on the popular example of role analysis, we present and illustrate how this massively parallel approach can be leveraged to scale out complex data-mining tasks, while providing a programming approach that eases the formulation of complete analytical workflows.

Demo Paper "Large-Scale Social-Media Analytics on Stratosphere" Accepted at WWW 2013

ICDE 2013 Demo Preview

21 Nov 2012

This is a preview of our demo that will be presented at ICDE 2013 in Brisbane.
The demo shows how static code analysis can be leveraged to reordered UDF operators in data flow programs.

Detailed information can be found in our papers which are available on the publication page.

ICDE 2013 Demo Preview

Stratosphere Demo Paper Accepted for BTW 2013

12 Nov 2012

Our demo submission
"Applying Stratosphere for Big Data Analytics"
has been accepted for BTW 2013 in Magdeburg, Germany.
The demo focuses on Stratosphere's query language Meteor, which has been presented in our paper "Meteor/Sopremo: An Extensible Query Language and Operator Model" [pdf] at the BigData workshop associated with VLDB 2012 in Istanbul.

Visit our demo, and talk to us if you are going to attend BTW 2013.

Abstract:
Analyzing big data sets as they occur in modern business and science applications requires query languages that allow for the specification of complex data processing tasks. Moreover, these ideally declarative query specifications have to be optimized, parallelized and scheduled for processing on massively parallel data processing platforms. This paper demonstrates the application of Stratosphere to different kinds of Big Data Analytics tasks. Using examples from different application domains, we show how to formulate analytical tasks as Meteor queries and execute them with Stratosphere. These examples include data cleansing and information extraction tasks, and a correlation analysis of microblogging and stock trade volume data that we describe in detail in this paper.

Stratosphere Demo Paper Accepted for BTW 2013

Stratosphere Demo Accepted for ICDE 2013

15 Oct 2012

Our demo submission
"Peeking into the Optimization of Data Flow Programs with MapReduce-style UDFs"
has been accepted for ICDE 2013 in Brisbane, Australia.
The demo illustrates the contributions of our VLDB 2012 paper "Opening the Black Boxes in Data Flow Optimization" [PDF] and [Poster PDF].

Visit our poster, enjoy the demo, and talk to us if you are going to attend ICDE 2013.

Abstract:
Data flows are a popular abstraction to define data-intensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude.
We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties.
We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and non-relational data flow programs which highlight the salient features of our approach.

Stratosphere Demo Accepted for ICDE 2013

Version 0.2 Released

21 Aug 2012

We are happy to announce that version 0.2 of the Stratosphere System has been released. It has a lot of performance improvements as well as a bunch of exciting new features like:

  • The new Sopremo Algebra Layer and the Meteor Scripting Language
  • The whole new tuple data model for the PACT API
  • Fault tolerance through local checkpoints
  • A ton of performance improvements on all layers
  • Support for plug-ins on the data flow channel layer
  • Many new library classes (for example new Input-/Output-Formats)

For a complete list of new features, check out the change log.

Version 0.2 Released