Spark Cluster on Google Compute Engine

gce+sparkWhat is Spark and Why?

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop. In the past, I’ve wrote an intro on how to install Spark on GCE and since then, I wanted to do a follow up on the topic but with more real world example of installing a cluster. Luckily to me, a reader of the blog did the work! So after I got his approval, I wanted to share with you his script.

Installing Spark Cluster

In order to install it you just need to:

1. Install gcutil and authenticate your project.

2. Open a terminal and get the git repository with the python script in it.

$ git clone
$ cd spark_gce
$ python

You will need to create a new project in the Google Developer Console before you running ‘’ and make sure to add all the params.
Here is an example: project-name slaves slave-type master-type identity-file zone cluster-name

  • project-name: One of the hardest thing in software… Choosing good names. Here we wish a good name for our cluster.
  • slave: how many machines we will have in the cluster as slaves.
  • slave-type: Instance type. For example:  n1-standard-1 
  • master-type: Instance type for the master. Choose something powerful (e.g. n1-standard-1 and above).
  • identity-file: Identity file to authenticate. It will be around: ~/.ssh/google_compute_engine once you authenticate using gcutils.
  • zone: Specify the zone where you are going to launch the cluster. For example: us-central1-a
  • cluster-name: Name the Spark cluster.

That’s it.


  1. Spark Downloads and Spark quick start

  2. Google Compute Engine in 5min



6 thoughts on “Spark Cluster on Google Compute Engine

  1. Raghu says:

    I tried to install following the instruction but I am facing some errors.The errors something like

    ERROR: (gcloud.compute.instances.create) Some requests did not succeed:
    – Invalid value ‘Test’. Values must match the following regular expression: ‘(?:(?:[-a-z0-9]{1,63}.)*(?:a-z?):)?(?:[0-9]{1,19

    Could you help me to resolve this.

    I am passing the following

    python ‘Test’ ‘3’ ‘n1-standard-4’ ‘n1-standard-4’ ‘/home/raghuram_mundru_meredith_com/.ss
    h/google_compute_engine’ ‘us-central1-c’ ‘spark-test’

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s