Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop. In the past, I’ve wrote an intro on how to install Spark on GCE and since then, I wanted to do a follow up on the topic but with more real world example of installing a cluster. Luckily to me, a reader of the blog did the work! So after I got his approval, I wanted to share with you his script.
Installing Spark Cluster
In order to install it you just need to:
1. Install gcutil and authenticate your project.
2. Open a terminal and get the git repository with the python script in it.
$ git clone https://github.com/sigmoidanalytics/spark_gce.git
$ cd spark_gce
$ python spark_gce.py
You will need to create a new project in the Google Developer Console before you running ‘spark_gce.py’ and make sure to add all the params.
Here is an example:
spark_gce.py project-name slaves slave-type master-type identity-file zone cluster-name
- project-name: One of the hardest thing in software… Choosing good names. Here we wish a good name for our cluster.
- slave: how many machines we will have in the cluster as slaves.
- slave-type: Instance type. For example: n1-standard-1
- master-type: Instance type for the master. Choose something powerful (e.g. n1-standard-1 and above).
- identity-file: Identity file to authenticate. It will be around: ~/.ssh/google_compute_engine once you authenticate using gcutils.
- zone: Specify the zone where you are going to launch the cluster. For example: us-central1-a
- cluster-name: Name the Spark cluster.