Apache Spark has been gaining more and more attention from the Big Data community during the past 3 years. In fact, it has became a de facto standard for data processing on Hadoop due to both functional and performance benefits provided over legacy competitors such as MapReduce and Hive.
As with many rapidly emerging technologies, implementation of features often takes priority over non-functional aspects, such as security. Therefore, a careful architectural analysis is required to ensure that technology used for data processing does not pose a potential threat to the safety and confidentiality of these data.
Among a myriad potential security concerns, this article is focused on an investigation of how authorized access and confidentiality concerns are addressed in the Apache Spark security model. Spark 2.0, the latest major release, contains a considerable number of architectural changes and has been chosen as the subject of exploration. Additional commentary is provided about features that have dramatically changed between Spark 1.6 and 2.0.
Spark Architecture Overview
A security analysis would be incomplete without an in-depth understanding of a technology’s architectural design. In this article, we will concentrate on the YARN/Hadoop-based deployment configuration which has been widely adopted by the data community. Other deployment techniques supported by Spark (e.g. Mesos) could require the deployment and application of additional components and control measures.
In general, Spark follows a master-slave architecture with a slightly different terminology. The master is called the driver, while slaves are referred to as executors. Names of the components are quite eloquent — the former is responsible for scheduling the tasks, while the latter performs such tasks. The task could be anything; most frequently it is logic that performs some type of data processing. Both the driver and executor are essentially Java Virtual Machine processes.
The driver and executors are run on a Yet Another Resource Negotiator (YARN) cluster, which has its own master-slave implementation. The Resource Manager acts as the master, and controls resources of the whole cluster. The workers behave as slaves that are administered by a Node Manager. The Node Manager controls worker resources, such as processor time, memory, and disk space, which is important for our analysis.
This is a typical YARN deployment model, which other frameworks, like MapReduce, follow.
Spark components are ephemeral by nature — they only exist while scheduled computations are performed. After completion, the components are destroyed and resources freed for other tasks. Thus, the Hadoop Distributed File System (HDFS) is used as persistent storage for data that serves as input and output for tasks. The architecture also supports additional forms of permanent storage, such as Amazon S3. The local worker’s disk is considered transient storage, and should not be trusted for any long term storage.
Internally, both driver and executors are running a number of endpoints to fulfill their functionality. They are studied in the next section.
Components, Endpoints, and Control Mechanisms
The diagram below represents the most significant Spark components and control measures applied to them.
Drivers and Executors
Endpoints exposed by drivers and executors are characterized in the table below:
|Driver port||Driver||Used for task scheduling.
Executors connect to a driver and receive tasks for execution.
|Binary RPC||DIGEST-MD5 via SASL (before 2.2)
Custom protocol with AES (since 2.2)
(aka Block transfer service)
|Driver/Executor||A protocol used to transfer data blocks between cluster’s nodes.
Handles both the intermediate (shuffle) and the final results of a computation.
|Binary RPC||DIGEST-MD5 via SASL (before 2.2)
Custom protocol with AES (since 2.2)
|Web UI||Driver||A UI, which is used by an end-user to track a job’s progress, configuration, and resource allocation.
Also gives the capability to terminate a job.
|HTTP||IP-based filter with authentication via YARN;
Integration with Hadoop
Spark relies on Hadoop as an execution environment and for permanent storage. It authenticates with Hadoop components via the Kerberos protocol, generates a Hadoop-specific authentication secret, known as the delegation token. This is utilized to guarantee that when Kerberos credentials expire, the job can continue to communicate with other components. An additional significant benefit provided by the delegation token is the ability to tie activities performed on a cluster to a single user identity without propagating the original credentials to all of the cluster nodes.
It is important to note that Spark uses Kerberos for authentication in Hadoop services only. Other mechanisms, listed in table above, are used to protect Spark’s components.
Local Disk Storage
It is important to note that data blocks are not only transferred via wire by the block manager. They can also be persisted as shuffle files to a worker’s local disk if they do not fit into memory or if persistence has been explicitly requested by the job.
External Shuffle Service
On the topic of locally stored block files, it is worth mentioning a third important component in the YARN deployment model — the External Shuffle Service. Its purpose is to provide additional fault tolerance in the case of an executor failure. Even if an executor has failed, its intermediate blocks can still be of use to other executors. However, in a naïve implementation, such blocks would be lost. Spark provides an interesting solution to this problem. In addition to being available to an executor, shuffle files are shared with the External Shuffle Service that permanently runs as a separate component of the Node Manager process. Therefore, if the executor has been prematurely terminated due to any reason, the blocks will remain available to other executors and the driver as long as the job is still running on YARN.
The Shuffle Service exposes a single block manager interface that is similar to the executor’s block manager with the exception that it allows specification of a job ID.
Spark History Server
The last component to mention is the History Server. It provides the same functionality as a driver’s Web UI for jobs that have already completed.
It’s important to note that the History Server provides no authentication capabilities and is available to the open world.
Deprecated and Unused Endpoints
Akka, which is reflected in gray on the diagram above, has been used by Spark as a core mechanism for both task scheduling between executors and shuffling. It has been gradually deprecated and now is completely removed in version 2.0. However, for older versions of Spark (pre 1.6), the Akka transport protocol should still be taken into consideration. An important caveat is that Akka does not support a proper authentication algorithm. Instead, it uses a “secret” cookie, which is sent via open wire. The best solution to this security issue would be to upgrade to Spark 2.0, or, if that is not feasible, at the very least implement SSL/TLS in order to mitigate eavesdropping.
Spark documentation also mentions an HTTP server used for broadcast and as a file server. Neither is used in YARN deployment mode. Static variables are broadcasted via a block manager by default. The functionality of the file server is replaced with HDFS as the transient storage for a job’s code (contained in JAR files) and the configuration (contained in ZIP files).
Data in Motion Protection
As described in previous sections, Spark actively uses the Simple Authentication and Security Layer (SASL) framework for transport protection via service-to-service authentication and, optionally, encryption.
SASL is a well-known authentication framework that supports replaceable authentication and encryption algorithms. Among dozens of authentication mechanisms provided by SASL, Spark is using the obsolete DIGEST-MD5. This creates potential issues in terms of future maintainability and certification compliance as MD5 is not FIPS compliant.
The most plausible explanation why an already outdated algorithm has been chosen to support major security functionality in a new product that has no legacy compatibility requirements is the fact that it is the strongest authentication protocol supported out-of-the-box by the Java Virtual Machine. The only better option is GSSAPI (i.e. Kerberos), which was probably excluded from consideration due to its dependency on a large infrastructure.
DIGEST-MD5 is based on a pre-shared secret that is delivered to both the driver and executor instances through the Node Manager’s communication channel. Thus, to guarantee that the secret cannot be eavesdropped, the Node Manager communication should be encrypted.
While DIGEST-MD5 provides a minimum necessary level of authentication guarantees, most of its drawbacks are not applicable to a Spark implementation:
- The man-in-the-middle attack via protocol downgrade is mitigated by enforcing quality of protection on all sides of the communication.
- Weak password brute force is mitigated by the generation of a random secret 256 bits in length on job start.
- MD5-based digest authentication has no known complete attacks (except some partial attacks); however, this quite possibly could be due to the fact that MD5 is, in general, recognized as deprecated and, as a result, stimulates no enthusiasm for research.
On the encryption side of the situation, the state is less bright. DIGEST-MD5 limits the set of encryption algorithms that can be used to:
- 3DES and RC4-128 — considered high strength
- DES and RC4-56 — medium strength
- RC4-40 — low strength
The encryption algorithm that is used for a particular connection is chosen by a negotiation process based on specific priorities respective to strength. Spark makes no effort to disable obviously weak encryption types (such as DES, RC4-56, and RC4-40), though such capability is provided by the Java API.
Additionally, the algorithm at question is in general vulnerable to an active man-in-the-middle attack, which can lead to a ciphers downgrade condition to the lowest level of RC4-40 protection, which can be brute forced on commodity hardware.
This issue has been recognized by the Spark community, and authentication, in addition to data in transit encryption mechanisms, have been reimplemented to support Advanced Encryption Standard (AES). An initial redesign attempt resulted in a mechanism still based on DIGEST-MD5, however, and therefore remained vulnerable to the same attacks as the original mechanism. The second implementation (current at the moment of writing) employs a custom authentication protocol, which has yet to have a comprehensive security analysis. According to the documentation, “the protocol is influenced by the ISO/IEC 9798”. Which raises concerns as ISO/IEC 9798 is know to contain weaknesses. Changes are planned to be released with Spark version 2.2.0.
It is worth noting that the External Shuffle Service uses the same authentication and encryption mechanism as other Spark services. It also contains authorization capabilities in order to verify that the service instance belongs to the job, which should have access to the data. Authorization is implemented via a check to determine if the job ID corresponds to the authentication secret and that the job ID owns the data.
Both the Spark driver and the Spark History Server expose a Web UI to report a job’s status, display its list of executors, and provide control mechanisms, including the ability to kill the job. However, they provide totally different control mechanisms.
The Spark driver enforces both authentication and authorization on the Web UI. The authentication implementation relies on YARN Resource Manager authentication support via Kerberos/SPNEGO and, if successfull, the YARN Resource Manager would forward any HTTP requests to the Spark driver, where it would authenticate against a Spark driver by means of source IP verification. Thus, any code trusted to execute on the Resource Manager host or a user allowed to SSH could effectively impersonate any Spark Web UI user and perform a privilege escalation. Communication between the Resource Manager and the Spark driver could be encrypted with SSL/TLS.
The History Server does not provide any controls with the exception of optional SSL/TLS support.
Data in Use
As already mentioned, Spark architecture relies on temporary files that are stored on a worker’s hard drive. These files are necessary as an intermediate store of data that is going to be reused between computations or spill from memory, if data could not fully fit there.
Both the driver and executor containers are executed on behalf of the Linux user who started the job in YARN. Thus, the single user identity is used across all architectural levels to provide a type of jail that protects from unauthorized access.
For example, a screenshot of a YARN-based Spark application UI:
It shows that job has been started by the
jack.harkness user and
jack.harkness also owns the Spark temporary files stored on worker node’s file system. Sharing with the
yarn group is necessary to provide access for the External Shuffle Service:
ls -l /data/3/yarn/nm/usercache/<...>/blockmgr-2f58c672-f415-46a5-a2b2-b96c3ba1e4cb/1c/ -rw-r----- 1 jack.harkness yarn 166988 Mar 19 23:05 rdd_2_0
This approach provides both good security guarantees and scalability of the system. However, it only works on a Hadoop cluster with Kerberos authentication enabled. It is important to note that the jobs execute under the same user will have access to each other temporary files.
Older versions of Spark did not provide encryption for temporary files and therefore additional disk encryption should be included on the infrastructure level to prevent a data leak. Recent implementation has provided temporary file encryption as of Spark 2.1.
Encrypted temporary files do not completely solve a data safety concern due to how YARN implements shared secrets. The Node Manager shares secrets with a job’s container using a
container_tokens file stored on a disk:
ls -l /data/1/yarn/nm/usercache/<...>/container_e105_1488497713812_12146_01_000002/ -rw------- 1 jack.harkness yarn 399 Mar 19 22:51 container_tokens -rwx------ 1 jack.harkness yarn 59596 Mar 19 22:51 launch_container.sh lrwxrwxrwx 1 jack.harkness yarn 95 Mar 19 22:51 spark_conf -> /data/1/yarn/nm/usercache/jack.harkness/filecache/18/__spark_conf__4461305504335666972.zip drwxr-s--- 2 jack.harkness yarn 4096 Mar 19 23:05 tmp
To be more specific, this file contains a shared secret used by Spark’s service-to-service authentication mechanism and Hadoop’s delegation token.
Thus, disk-level encryption is still necessary to provide full end-to-end protection for the data used by the job.
Data at Rest
Spark does not provide any capability for persistent storage; therefore, permanent data storage protection concerns are out of scope for discussion of Spark’s Security model. However, as it is a crucial part of the overall design, HDFS should be mentioned as a quite popular and natively supported storage technology for Spark.
HDFS authentication is based on the Kerberos protocol wrapped into the SASL abstraction layer, not to be confused with Spark’s DIGEST-MD5 SASL. The latter supports the tunable quality of protection concept. The recommended quality of protection level is “privacy” (also known as
auth-conf), which stands for encrypted channel. The SASL implementation of HDFS provides an improved security guarantee in comparison with the one provided by older versions of Spark. Specifically, it supports mutual authentication and strong encryption. As shown in the overview section, Spark integrates with the Kerberos protocol and a custom delegation token mechanism to interact with HDFS.
Authorization in HDFS is performed via file system permissions. Only users who need to have access in order to perform their business duties should be granted such access.
HDFS additionally supports transparent data at rest encryption via integration with the Hadoop Key Management Service.
Due to outdated authentication and encryption capabilities, Spark should only be executed on a trusted network where, at minimum, a man-in-the-middle attack is not possible. The recent redesign of service-to-service communication mechanisms reduces this concern. However, as it is based on a custom protocol, there is no guarantee that a weakness will not be exposed in the near future.
Spark’s security relies on proper authentication and authorization of core Hadoop components (such as HDFS and YARN). Therefore, a proper Kerberos implementation plays a key role in protecting Spark data processing. Specifically, the SASL quality of protection should be on a “privacy” level (
auth-conf) to protect both the data and authentication secrets while they are in transit.
As temporary files are spilled on hard drive, encrypted partitions or disks should be utilized to prevent a data leak. As of Spark 2.1, this is less concerning; however, as with service-to-service communications, a fresh implementation could contain vulnerabilities that are not initially identified. With disk storage, the concern is even greater as persistent storage could preserve data blocks indefinitely.
If authorized access to Spark UI is a concern, effort should be made to guarantee that it is not possible for users to run arbitrary code on YARN Resource Manager instances.
Upgrading Spark to version 2.0 reduces the risk related to unused and improperly authenticated endpoints being exposed on an open network.
The Spark History Server allows unauthenticated access to a completed job’s status and configuration, thus an effort should be made to ensure that no sensitive information is exposed via configuration or other elements of the UI. For example, job execution errors. An additional HTTP proxy (e.g. Apache Knox) could be used to provide an authentication layer.
TLS/SSL should be enabled in addition to service-to-service authentication to provide protection for Web UI interfaces.
Spark relies on a client-side configuration for some of its critical security features, such as service-to-service authentication and encryption. As a result, protocol degradation is possible by, for example, modifying the default configuration used by the
spark-submit client to disable service-to-service authentication altogether. Thus, the default client configuration should be made read-only on cluster machines.
To summarize, Spark’s security cannot be used out-of-the-box. Additional countermeasures should be applied to provide protection for sensitive and valuable data.