Why You Should Use Apache Spark for Data Analytics

Traducciones al Español
Estamos traduciendo nuestros guías y tutoriales al Español. Es posible que usted esté viendo una traducción generada automáticamente. Estamos trabajando con traductores profesionales para verificar las traducciones de nuestro sitio web. Este proyecto es un trabajo en curso.
Create a Linode account to try this guide with a $ credit.
This credit will be applied to any valid services used during your first  days.

Within the growing field of data science, Apache Spark has established itself as a leading open source analytics engine. Spark includes components for SQL queries, machine learning, graphing, and stream processing. This guide provides some background on Spark and explains its many advantages and use cases.

What Is Apache Spark?

Spark is a unified analytics engine for highly distributed and scaled data processing. Its rich feature set and high performance have allowed it to become one of the premier big data frameworks. Spark also plays an increasingly central role in the machine learning and artificial intelligence domains.

Note
Throughout this guide, the terms “Apache Spark” and “Spark” are used interchangeably.

Spark is an open source application originally developed at the Berkeley campus of the University of California and subsequently donated to Apache. Apache continues to maintain, refine, and enhance the application. Spark uses a cluster manager and distributed storage, but it cannot perform distributed file management. Therefore, on a cluster, it requires a file management system such as Hadoop, Kubernetes, or Apache Mesos. For testing or development purposes, it can also run on a single system. However, it is designed to process massive amounts of data in a parallelized manner, so it almost always runs on a large number of servers. A Spark cluster can either operate in the cloud or on physical servers.

Spark uses a driver-executor approach. Developers supply a driver program containing a sequence of high-level operations. The Spark Core engine then analyzes the program and determines the tasks to run. It dispatches these tasks to executor processes running on the cluster. The executors return incremental data to the engine, which collates the results.

Spark can be used whenever there is a large amount of data to analyze and transform. It is packaged with several powerful tools, greatly extending its range. The main use cases for Spark include data engineering, data science, and machine learning. It is most commonly used within the retail, manufacturing, financial, technology, gaming, and media industries.

How Does Apache Spark Work?

At the heart of Spark is the Spark Core engine. This application component dispatches tasks and provides support for the various Spark tools. The responsibilities of the Spark Core include memory management, job scheduling, storage access, performance monitoring, and input/output operations. The Spark Core is accessed through APIs which are available for either Java, Scala, Python, or R. For the Python API, both NumPy and Pandas are supported. Third-party support is available for some other languages.

Spark is organized around the concept of a resilient distributed dataset (RDD). An RDD is a fault-tolerant read-only collection of data, also known as a multiset. It can be distributed across a cluster and processed in parallel. An RDD is often created from data in external storage, such as Hadoop or a shared filesystem, or from a file. However, an existing RDD can be converted into a new RDD through data transformations. To increase efficiency, all analytical operations act upon an RDD rather than the original data. The Spark Core implements fault tolerance, keeping track of all operations and reconstructing data in the event of errors.

Spark converts the instructions in the user’s driver program into a Directed Acyclic Graph (DAG). In a DAG, a node represents an RDD, while each edge signifies an operation on the data. Spark uses this graph to construct an optimized scheduling algorithm and distribute the lower-level tasks to executor processes running on the cluster nodes.

A DataFrame forms a higher layer abstraction on top of the RDD object. It organizes an RDD into a series of columns, similar to a database table. The result is a collection of objects that can be stored in memory and reused throughout the program. DataFrames can also be derived from structured data files and other databases.

Spark uses a few different techniques to enhance its performance. Shared variables allow access to variables across parallel tasks. They enable the use of iterative algorithms for cycling through the same data. Spark uses the Catalyst component to optimize queries for better speed and lower latency. It analyzes the query and recompiles it down to Java bytecode. Catalyst works with all Spark tools, but is especially helpful for SQL queries and stream processing.

Spark can be downloaded from the Spark downloads page. It requires a Java Virtual Machine (JVM), and works best with Hadoop. However, many companies now use Kubernetes to manage Spark. To help users get started, Spark provides some Examples, including word count and text search algorithms. These code snippets can be used as a template for other programs.

The Advantages of Apache Spark

Apache Spark is well regarded due to its high performance and rich feature set. Some of its advantages and highlights include:

  • Free and Open Source Access: Apache Spark is free to use and the source code is publicly available.
  • Performance/Speed: Spark is very fast, with low latency. The SQL engine features optimized columnar storage for faster query results. Spark reuses data from earlier calculations in subsequent steps to reduce computational demands. It performs key data science transformations many times faster than its competitors.
  • Scalability: Spark is highly scalable. A cluster can grow to encompass thousands of nodes and process over a petabyte of data.
  • Memory Management: Spark stores data in memory for faster results with lower latency. But it also works on very large data sets that cannot fit into memory. In this case, it uses disc storage and recomputation during processing. Spark determines the best approach for the given data set and system capacity.
  • Cluster Support: Spark is optimized to run on a cluster. The Standalone deploy mode only requires a Java runtime. However, cluster managers like Hadoop YARN allow for quicker deployment and easier management. Spark can also run locally on a single instance through the use of parallel threads.
  • Ease of Use: Spark includes well-defined, stable, and straightforward APIs. Its collection of high-level operators reduce complexity, allowing developers to quickly build and deploy powerful applications and pipelines. Many jobs require few instructions. In some cases, a single command can read data, calculate the results, and display the output.
  • Code Reuse: Spark has a modular design, making it easy to reuse the same routines in different programs and for different tasks.
  • Language Support: Spark provides APIs for many popular programming languages, including Java, Scala, Python, and R. It does not require any modifications or additional libraries. For example, it works well with the standard Python implementation and common libraries like NumPy.
  • Advanced Tools: Spark includes a full complement of useful tools. These components also include extensive libraries containing the most essential algorithms for the domain. Spark includes the following built-in tools:
    • Spark SQL for queries.
    • MLlib for machine learning.
    • GraphX for graph processing.
    • Structured Streaming to incrementally process streams.
  • Fault Tolerance: Spark is stable, resilient, and can handle malformed data. It includes mid-query error handling and manages unexpected input gracefully.
  • Batched Processing: Spark can break data into branches for more efficient processing. It integrates data parallelism into its data structures. Developers can create a job to run on parallel systems without worrying about dispatching tasks or resource management. The Spark Engine handles scheduling and task distribution.
  • Widespread Usage: Spark has many users and contributors. In fact, the majority of Fortune 500 companies use it. Support is available through forums, online resources, and training materials.

What Are the Apache Spark Tools?

Spark contains several built-in tools, and each adds a different capability to Spark, extending its range. The tools are thoroughly integrated into Spark and use the same Spark APIs. The main set of tools includes:

  • Spark SQL: This is the most important and widely used Spark tool. Spark SQL accepts standard ANSI SQL queries and runs them against either structured or unstructured data. It can interrogate Spark DataFrames or popular file formats such as JSON. Spark SQL can handle massive amounts of data and works well in conjunction with corporate dashboards and ad hoc queries. The API allows programmers to integrate interactive SQL queries into their programs. The performance of Spark SQL is comparable to, or even better than, most data warehousing applications.
  • Structured Streaming: This feature replaces the older Spark Streaming tool. It processes streams of data, enabling real-time analytics. Spark Structured Streaming can accept data from many applications and in many formats. Streams can be processed like tables, and tables can be acted upon like streams. Structured Streaming removes the underlying complexity and is built on the Spark SQL engine. This allows users to build streaming pipelines using the same APIs as the rest of Spark. Spark even provides tools for migrating batch jobs to streaming jobs.
  • MLlib: MLlib is the Spark machine learning library for data extraction, processing, and transformation. It can process data residing on thousands of machines. MLlib can be used with Java, Scala, R, or Python, where it inter-operates with NumPy. It uses iterative computation, enabling high performance and allowing for more complex and useful algorithms. The MLlib algorithms support classification, regression, decision trees, alternating least squares, clustering, topic modeling, filtering, and more. It can also run workflows for data set transformations, pipeline construction, model evaluation, parameter tuning, and persistence. It is designed to be used with a Hadoop data source, such as HDFS, and can integrate with Hadoop workflows. For more information on this complex tool, see the MLlib library introduction.
  • GraphX: This component, currently available as a beta version, specializes in graphs, collections, and graph-parallel computation. It allows users to transform and join graphs and create custom algorithms. The GraphX library includes routines for page ranking, label propagation, strongly connected components, triangle count, and singular value decomposition. It balances good performance with flexibility, robustness, and ease of use.

Apache Spark also supports a large number of third-party libraries, add-ons, and extensions. These accessories provide additional language bindings or specialized algorithms for applications including web analytics, genome sequencing, and natural language processing (NLP). Some of these projects are open source while others are available commercially. See the List of Third-Party Spark Projects for additional information.

Potential Issues and Drawbacks

Spark is extremely powerful and useful, but it is not the best choice for every scenario. One drawback is that Spark does not provide its own file management system. Spark requires a shared file system to run on a cluster of multiple machines. This necessitates additional infrastructure and integration work. Fortunately, Spark is compatible with Hadoop and can use any supported input format. Most users operate the two applications together, but additional solutions are available, such as Kubernetes. However, certain solutions can generate a large number of small interim data files. This increases the metadata overhead and reduces performance.

Unlike some programs, Spark does not automatically optimize the user-supplied driver code. Programmers are responsible for writing efficient self-optimized routines. Spark is also not particularly well-suited for concurrent multi-user use.

Spark RDD data sets are read-only, as are any data structures created from the data. This makes Spark a poor choice for any applications requiring real-time updates.

Conclusion

Apache Spark is a powerful analytics engine, with support for SQL queries, machine learning, stream analysis, and graph processing. Spark is very efficient, with fast performance and low latency, due to its optimized design. Spark can run on large clusters of thousands of nodes and manage a petabyte of data, but it requires a separate file management system. For more information on Spark, see the Apache Spark Documentation.

More Information

You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.

This page was originally published on


Your Feedback Is Important

Let us know if this guide was helpful to you.


Join the conversation.
Read other comments or post your own below. Comments must be respectful, constructive, and relevant to the topic of the guide. Do not post external links or advertisements. Before posting, consider if your comment would be better addressed by contacting our Support team or asking on our Community Site.
The Disqus commenting system for Linode Docs requires the acceptance of Functional Cookies, which allow us to analyze site usage so we can measure and improve performance. To view and create comments for this article, please update your Cookie Preferences on this website and refresh this web page. Please note: You must have JavaScript enabled in your browser.