It combines the strengths of MapReduce/Hadoop with powerful programming abstractions in Java and Scala and a high performance runtime. Stratosphere has native support for iterations, incremental iterations, and programs consisting of large DAGs of operations.
Download and run Stratosphere programs in less than 5 minutes.
Beauty of Scala programming: specify what you want out of the data, not how the job is executed.
Iterative, arbitrarily large programs with multiple inputs and outputs.
Instantly deploy Stratosphere on Amazon's EC2 and run your data analysis in the cloud.
Scale out to large clusters, exploit multi-core processors and in-memory processing.
Our optimizer automatically parallelizes and optimizes your programs.
Stratosphere extends the well-known MapReduce model with new operators. These operators represent many common data analysis tasks more naturally and efficiently. All operators will start working in memory and gracefully go out of core under memory pressure.
Stratosphere allows to model analysis programs as advanced data flow graphs. For many applications, this is a more natural fit than the constrained MapReduce interface (strictly Map followed by Reduce). Furthermore, data pipelining and in-memory data transfers increase performance by drastically reducing disk and network I/O.
See how Hadoop does complex data flowsExecuting the plan shown on the left using MapReduce leads to a composition of multiple MapReduce jobs. Intermediate results are stored in HDFS after each job. This causes a lot of network and disk I/O. Remember also that a just the setup of a MapReduce job itself takes some time. This example shows that many real world applications do not fit the MapReduce model. Also, the implementation of complex data flows using MapReduce is very time-consuming.
Stratosphere is able to natively execute the job. Everything is processed in-memory. Only if the data does not fit into the memory anymore, it starts using the local hard disks.
You can write your parallel, distributed applications for Stratosphere in Java or Scala. The Java API will feel similar to Hadoop's MapReduce abstraction, but offers more functions and a more flexible data model. Scala is a functional object oriented programming language with powerful language features. Stratosphere's Scala API supports fluent and concise, yet efficient, analysis programs. Behind the scenes, Stratosphere uses code generation techniques to bridge between the Scala language and the runtime.
See our Quickstart guides for Scala and Javaval input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
.groupBy { word => word }
.count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))
Data Mining, Machine Learning and Graph processing algorithms often require to loop over the working data multiple times. Stratosphere supports iterative algorithms in its core. (The runtime allows for very fast iteration times and the optimizer deals with caching loop-invariant data.) The advanced incremental iterations support algorithms that focus on the "hot part" of the evolving solution and may converge even faster.
Iterative algorithms with HadoopIterative algorithms are implemented in Hadoop MapReduce using a central driver that spawns MapReduce jobs until the result has been computed. This approach has many disadvantages:
Stratosphere natively executes iterative algorithms. The result of the last operator is fed back to the input of the first operator (in-memory). It is not required to start a new job on each iteration. Stratosphere detects which parts of the data need processing for further iterations. Only those are loaded.
Stratosphere comes with an optimizer that is independent of the actual programming interface. It chooses a fitting execution strategy depending on the inputs and operations. For example the "Join" operator will choose between partitioning and broadcasting the data, as well as between running a sort-merge-join or a hybrid hash join algorithm.
Focus on your application logic rather than parallel execution.
Stratosphere seamlessly integrates into existing Hadoop setups and runs side-by-side with Hadoop's TaskTrackers and DataNodes. Stratosphere can read data from Hadoop sources, but comes with its own efficient runtime. Similar to Hadoop, Stratosphere scales by adding more machines to the cluster.
Stratosphere runs also on Hadoop 2.2 (YARN), so you don not need to change your infrastructure.
The Local execution mode allows to debug and analyze your application right from your favorite IDE, without having Stratosphere installed.
Stratosphere is an active, community driven open-source project. It is licensed under the Apache License.
Our friendly community is always open to new users and developers. Join us and shape the future of Big Data.
Download and try Stratosphere. Our Quickstart scripts make it easy for developers to create an empty Stratosphere program skeleton to start from. Dependencies are seamlessly handled by Maven without any installation. Testing and debugging is possible directly inside the IDE with Stratosphere's embedded mode. Ready-made binaries for cluster setups are available as well.
Visit bigdataclass.org for Stratosphere programming exercises.