Using the Apache Spark Tool

by www.big-data.tips · Published March 20, 2016 · Updated March 20, 2017

The Apache Spark Big Data tool provides a solution for fast large-scale data processing. This is possible, because Spark uses an approach of parallelization like Apache Hadoop. That means your data is not analzed only on one machine or small Laptop, but instead in parallel on a large server machine. It is a flexible in-memory analytics framework that is able to handle batch and real-time data processing workloads.

Spark is very powerful but it requires an underlying computing infrastructure that is different from a usual desktop computer or simple laptop. Users of this tool usually work together with some data center that offers a computing infrastructure for it, but solutions based on Spark have shown performance of 100% faster than traditional Hadoop solutions. There are different libraries available in Spark such as streaming, SQL, Graph, and Spark Machine Learning.

Spark offers two properties beyond traditional map-reduce by including extensions towards Directed Acyclic Graphs (DAGs) and improvements on ‘data sharing capabilities’ with in-memory computing. The key concepts of Spark are Resilient Distributed Datasets (RDD) & usability. RDD are fault-tolerant collections of elements to operate on in parallel. The improved usability means to write less code via rich APIs using various programming language support. Apache Spark is open source and compatible with many data storage sytems (e.g. HDFS, S3).

Free Apache Spark Book available

There is an interesting book about Apache Spark officially available here.

More details on Apache Spark

There is an interesting video available that we recommend to watch:

Using the Apache Spark tool with 'big data mountains': http://goo.gl/iS6ODr

Posted by big-data.tips on Sunday, March 20, 2016

Using the Apache Spark Tool

You may also like...

Subscribe to our Newsletter!

Using the Apache Spark Tool

Free Apache Spark Book available

More details on Apache Spark

You may also like...

SPSS Big Data Tool

Spark RDD – Resilient Distributed Dataset

Caffe Deep Learning

Subscribe to our Newsletter!