v0.9.4-SNAPSHOT

GeoWave Quickstart Guide geowave-icon-logo-cropped

What you will need

Creating the Cluster

We will be using the GeoWave bootstrap script to provision our cluster. Then we will walk through the cli commands to download, ingest, analyze and visualize the data.

GeoWave currently supports the use of either Accumulo or HBase, so the version of the bootstrap script you use will be dependent upon which system you want to use as your datastore.

  • For Accumulo use: s3.amazonaws.com/geowave/latest/scripts/emr/accumulo/bootstrap-geowave.sh

  • For HBase use: s3.amazonaws.com/geowave/latest/scripts/emr/hbase/bootstrap-geowave.sh

We have also provided scripts that will perform all of the steps automatically. This will allow you to verify your own steps, or test out other geowave commands and features on an already conditioned data set.

If you would prefer to have all of the steps run automatically, please use these bootstrap scripts instead of the ones listed previously:

  • For Accumulo use: s3.amazonaws.com/geowave/latest/scripts/emr/quickstart/accumulo/bootstrap-geowave.sh

  • For HBase use: s3.amazonaws.com/geowave/latest/scripts/emr/quickstart/hbase/bootstrap-geowave.sh

AWS CLI Method

This is the basic makeup of the command you will call to create your geowave test cluster. All variables, designated as ${VARIABLES}, will need to be be replaced with your individual path, group, value, etc. An explanation of each of the variables is given below the command.

aws emr create-cluster \
--name ${CLUSTER_NAME} \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge InstanceGroupType=CORE,InstanceCount=${NUM_WORKERS},InstanceType=m4.xlarge \
--ec2-attributes "KeyName=${YOUR_KEYNAME},SubnetId=${YOUR_SUBNET_ID},EmrManagedMasterSecurityGroup=${YOUR_SECURITY_GROUP},EmrManagedSlaveSecurityGroup=${YOUR_SECURITY_GROUP}" \
--release-label ${EMR_VERSION} \
--applications Name=Hadoop Name=HBase \
--use-default-roles \
--no-auto-terminate \
--bootstrap-actions Path=s3.amazonaws.com/geowave/latest/scripts/emr/${DATASTORE}/bootstrap-geowave.sh,Name=Bootstrap_GeoWave \
--tags ${YOUR_TAGNAME} \
--region ${YOUR_REGION} \
  • ${CLUSTER_NAME} - The name you want to show up in the Cluster list in AWS

    • Example: “geowave-guide-cluster”

  • ${NUM_WORKERS} - The number core/worker nodes you want

    • You will be working with the relatively small amount of data in this walkthrough so we recommend using two

  • ${YOUR_KEYNAME} - The name of the key value pair you want to use for this cluster

    • Example: geowave-guide-keypair

    • If you have not created a keypair for this cluster please follow the steps here.

  • ${YOUR_SUBNET_ID} - The subnet id linked with your security group(s)

    • Example: subnet-bc123123

    • If you are unsure of which subnet to use please see the VPC (network interface/subnet id) section here.

  • ${YOUR_SECURITY_GROUP} - This is the security group(s) you want the cluster to be assigned to.

    • Example: sg-1a123456

    • If your AWS EMR account has default security groups setup you can leave the EmrManagedMasterSecurityGroup and EmrManagedSlaveSecurityGroup out of --ec2-attributes

    • If you are unsure of which groups to use here please see the EC2 Security Group section here.

  • ${EMR_VERSION} - The version of EMR that you want to use for your cluster

    • Example: emr-5.1.0

    • GeoWave version 0.9.3 suports up to EMR version 5.2.0

  • ${DATASTORE} - The datastore you want to use. GeoWave curently suports Accumulo and HBase.

    • Example: accumulo

  • ${YOUR_TAGNAME} - Tag name for the cluster you are creating

    • Example: “geowave-guide”

    • The --tags is completely optional, but may help you search for this cluster if there are many on the aws account you are using

  • ${YOUR_REGION} - Your aws region

    • Example: “us-east-1”

If your create-cluster command was successful it will return the ClusterId of your cluster, otherwise you will receive a message detailing why the command failed.

For more information on the create-cluster command please see the amazon documentation here.

The return of a ClusterId only verifies that aws understood your command and has begun setting up the desired cluster. There are many things that could still go wrong and cause the cluster to fail. You can open the AWS EMR GUI to follow the progress of your cluster’s creation.

Please view the Running The Steps section of this document for a walkthrough of downloading, ingesting, analyzing and visualizing data with GeoWave.

If you used the quickstart version of bootstrap script the script will now setup the environment, then download and process one month of gdelt data.

The entire process takes approximately 25 minutes on a three node cluster.

Please see the Interacting with the Cluster section of this document to see how the data can be visualized.

AWS GUI Method

Login to AWS and select EMR from the Services drop down menu.

select emr

Select the “Create cluster” button in the top left side of the page. Once the Create Cluster application opens select the “Go to advanced options” link at the top of the page.

Step 1:

select emr

Software Configuration

  • Vendor

    • Select Amazon.

  • Release

    • Select emr-5.1.0 from the dropdown list (older versions of GeoWave may not support all functions on newer versions of EMR)

    • Ensure Hadoop is selected

    • If you are using HBase you will need to select it here

    • It won’t hurt to have other software selected as well, but they aren’t needed for this guide

  • Edit software settings

    • Don’t touch anything here

Add Steps

  • We won’t be adding any steps for this quickstart guide

---

Step 2:

select emr

Hardware Configuration

  • Network

    • Select your VPC

    • If you haven’t setup a VPC please see the Create EC2 VPC section here).

  • EC2 Subnet

    • Select the subnet (or one of the subnets) associated with your VPC

  • Master

    • Select m4.xlarge from the EC2 instance type dropdown list

  • Core

    • Select m4.xlarge from the EC2 instance type dropdown list

    • Select 2 for the Instance count

  • Task

    • We won’t be using a task node in this walkthrough so leave the instance count at 0

---

You can request spot instances here to save money.

Step 3:

select emr

General Options

  • Cluster name

    • Enter the desired name for your cluster

    • Cluster names do not have to be unique

  • Logging

    • Leave selected

    • Click on the folder icon and select your bucket

  • Debugging

    • Leave selected

  • Termination Protection

    • Leave selected

  • Tags

    • Enter a tag name for your cluster

    • This is completely optional, but may make it easier to search for your cluster later on

Additional Options

  • EMRFS consistent view

    • Leave unselected

  • Bootstrap Actions: Expand the Bootstrap Actions section

    • Select Custom action from the Add bootstrap action drop down list

    • Click the “Configure and add” button

select emr
  • Name

    • Enter a name for the custom action

    • This can be left as the default value of “Custom action”

  • Script location

    • Enter the location of your desired bootstrap script

      • For Accumulo use: s3.amazonaws.com/geowave/latest/scripts/emr/quickstart/accumulo/bootstrap-geowave.sh

      • For HBase use: s3.amazonaws.com/geowave/latest/scripts/emr/quickstart/hbase/bootstrap-geowave.sh

      • You can use the quickstart versions either script here as well

    • If you have chosen to use your own bucket to host the bootstrap script you can click on the folder icon to bring up a list of your available buckets and chose a script from there.

  • Click the “Add” button

---

Step 4:

select emr

Security Options

  • EC2 key pair

    • Select your key pair for this cluster

    • If you haven’t created a key pair please see the Create EC2 Key Pair section here.

  • Cluster visible to all IAM users in account

    • Leave selected

  • Permissions

    • Leave “Default” selected

  • Expand the EC2 Security Groups section

    • Master: select your security group for the master node

    • Core & Task: select your security group for the core nodes

    • If you haven’t created a security group yet please see the Create EC2 Security Group section here.

---

Click the “Create Cluster” button to create and provision your cluster.

Please view the Running The Steps section of this document for a walkthrough of downloading, ingesting, analyzing and visualizing data with GeoWave.

If you used the quickstart version of bootstrap script the script will now setup the environment, then download and process one month of gdelt data.

The entire process takes approximately 25 minutes on a three node cluster.

Please see the Interacting with the Cluster section of this document to see how the data can be visualized.

Running The Steps

Connecting to the Cluster

Once your cluster is running and bootstrapped, ssh into the cluster.

Go to the Cluster List (“Services” dropdown, select EMR) and click on the cluster you created. You will use the “Master public DNS” value as your hostname and the security key you assigned to the cluster to access it.

If you are unsure of how to do this, click on the blue SSH link to the right of your Master public DNS to open a popup that will walk you though it.

select emr

The cluster status may show as waiting before the bootstrap script has completed. Please allow 5-7 minutes for the cluster to bet setup and bootstrapped. This may take longer if you are using spot instances.

Download and Source Files

Next you will need to download a few files that we will use later in this guide.

cd /mnt (1)
sudo wget s3.amazonaws.com/geowave/latest/scripts/emr/quickstart/geowave-env.sh
sudo wget s3.amazonaws.com/geowave/latest/scripts/emr/quickstart/KDEColorMap.sld
sudo wget s3.amazonaws.com/geowave/latest/scripts/emr/quickstart/SubsamplePoints.sld
1 It is recommended to use the mnt directory for this guide.

The geowave-env.sh script has a number of predefined variables that we will use in the other commands, so we will source it here.

source /mnt/geowave-env.sh

Download GDELT Data

We will be using data from the GDELT Project in this guide. For more information about the GDELT Project please visit their website here.

Download whatever gdelt data matches $TIME_REGEX. The example is set to 201602 in by sourcing the geowave-env.sh script

sudo mkdir $STAGING_DIR/gdelt;cd $STAGING_DIR/gdelt
sudo wget http://data.gdeltproject.org/events/md5sums
for file in `cat md5sums | cut -d' ' -f3 | grep "^${TIME_REGEX}"` ; \
do sudo wget http://data.gdeltproject.org/events/$file ; done
md5sum -c md5sums 2>&1 | grep "^${TIME_REGEX}"
cd $STAGING_DIR

You can verify that this script worked by viewing in the newly created /mnt/gdelt/ directory.

Config and Ingest

Add a GeoWave store (Accumulo).

geowave config addstore gdelt --gwNamespace geowave.gdelt \
-t accumulo --zookeeper $HOSTNAME:2181 --instance $ACCUMULO_INSTANCE --user geowave --password geowave

Add a GeoWave store (HBase).

geowave config addstore gdelt --gwNamespace geowave.gdelt \
-t hbase --zookeeper $HOSTNAME:2181

Add a spatial index.

geowave config addindex -t spatial gdelt-spatial --partitionStrategy round_robin --numPartitions $NUM_PARTITIONS

Ingest the data into geowave.

geowave ingest localtogw $STAGING_DIR/gdelt gdelt gdelt-spatial -f gdelt \
--gdelt.cql "BBOX(geometry,${WEST},${SOUTH},${EAST},${NORTH})"

The ingest should take ~5 minutes.

Kernel Density Estimate (KDE)

Once the ingest has completed, add another store for the kde.

Add a GeoWave store (Accumulo).

geowave config addstore gdelt-kde --gwNamespace geowave.kde_gdelt \
-t accumulo --zookeeper $HOSTNAME:2181 --instance $ACCUMULO_INSTANCE --user geowave --password geowave

Add a GeoWave store (HBase).

geowave config addstore gdelt-kde --gwNamespace geowave.kde_gdelt \
-t hbase --zookeeper $HOSTNAME:2181

Run the KDE analytic.

hadoop jar ${GEOWAVE_TOOLS_JAR} analytic kde --featureType gdeltevent --minLevel 5 --maxLevel 26 \
--minSplits $NUM_PARTITIONS --maxSplits $NUM_PARTITIONS --coverageName gdeltevent_kde \
--hdfsHostPort ${HOSTNAME}:${HDFS_PORT} --jobSubmissionHostPort ${HOSTNAME}:${RESOURCE_MAN_PORT} --tileSize 1 gdelt gdelt-kde

Integrate with GeoServer

Once the data has been ingested and the KDE has completed you can setup GeoServer to display it.

Configue the local host.

geowave config geoserver --url "$HOSTNAME:8000"

Add layers for the point and kde representations of the data.

geowave gs addlayer gdelt
geowave gs addlayer gdelt-kde

Add the KDEColorMap and SubsamplePoints style

geowave gs addstyle kdecolormap -sld /mnt/KDEColorMap.sld
geowave gs addstyle SubsamplePoints -sld /mnt/SubsamplePoints.sld

Set the kde layer default style to kdecolormap

geowave gs setls gdeltevent_kde --styleName kdecolormap

Interacting with the Cluster

Enable Web Connections

Go to the Cluster List (“Services” dropdown, select EMR) and click on the cluster you created. Use the “Master public DNS” value as your hostname and the security key you assigned to the cluster to enable the web connection.

select emr

If you are unfamiliar how to do this click on the “Enable Web Connection” link for detailed instructions on how to enable the web connection for Linux or Windows.

Accumulo Overview

You can follow the progress of the data ingest and scan (kde) performed by the cluster on hte accumulo web server.

Open a new tab in your web browser and enter the Master public DNS of your cluster followed by :50095

  • Example: ec2-52-91-215-215.compute-1.amazonaws.com:50095

You should see the following page:

select emr

GeoServer

Open a new tab in your web browser and enter the Master public DNS of your cluster followed by :8000/geoserver/web/

  • Example: ec2-52-91-215-215.compute-1.amazonaws.com:8000/geoserver/web/

select emr

Log into Geoserver

  • Username: admin

  • Password: geoserver

select emr

Once the bootstrap-geowave.sh script is finished you will see two layers have been created. To view them click on the “Layer Preview” link under the Data menu on the left side of the page.

select emr

Click the OpenLayers link for either one to view it in another tab.

gdeltevent - shows all of the gdelt events in a bounding box around western europe as individual points.

select emr

You may have noticed that it took a fair amount of time to render the ~1.5 million points. To speed this process up we can set the default style the Subsample Points style that we (or the quickstart version of the bootstrap script) downloaded previously. The style can be found in the geowave directory at geowave/examples/example-slds/SubsamplePoints.sld and can also be downloaded here.

If you haven’t added the Subsample Points style into GeoServer yet please see the Integrate with Geoserver section.

This can be done using the geowave cli commands or via the geoserver GUI.

Geowave CLI:

geowave gs setls gdeltevent --styleName SubsamplePoints

Geoserver GUI:

  • Click on the Layers link in the menu at the left side of the page and select the gdeltevent layer

  • Select the Publishing tab, open the Default Style dropdown and select SubsamplePoints

select emr
  • Click the Save button at the bottom of the page and reopen the image by going back to the Layer Preview and clicking the OpenLayers link

  • You should see a noticeable difference in the time it takes to render the points

select emr

gdeltevent_kde - a heat map produced using kernel density estimation in a bounding box around western europe.

select emr

Quickstart Bootstrap Script Breakdown

The quickstart bootstrap scripts listed as an option in this tutorial have a few steps and runs a number of other scripts to setup the environment, download the data, ingest the data, run the kde and set up the layers for geoserver. This section gives a basic breakdown of each script. All scripts can be found in the geowave s3 bucket in the geowave/latest/scripts/emr/quickstart/ directory.

  • bootstrap-geowave.sh

    • This is the main script and has five major steps:

      • Download and source the other scripts

      • Delays the rest of the script until EMR is done setting up the desired environment

      • Configures zookeeper and accumulo

      • Runs the install_geowave and setup-geowave scripts

      • Initializes all volumes

  • geowave-install-lib.sh

    • This script is a group of methods that are called by the bootstrap-geowave script. It contains the majority of the actual code that will be run.

*=- geowave-env.sh * Defines variables (port numbers, timeframe, bounding box, versions, etc.) for the other scripts.

  • ingest-and-kde-gdelt.sh

    • Creates an accumulo user and namespace, downloads the gdelt dataset defined in the geowave-env script, ingests that data, and runs a kde on the data. It also calls the setup-geoserver-geowave-workspace script. A good script to look though if you want to see the commands used to perform these actions.

  • setup-geoserver-geowave-workspace.sh

    • Uses the geowave cli commands and the styles downloaded by the script to setup your geoserver workspace, stores and layers. This can also be done by the user through the geoserver GUI.

      <<<

Appendices

Version

This documentation was generated for GeoWave version 0.9.4-SNAPSHOT from commit 69120eaa2d725711bddb164d0de2913c40868ba1.

Create EC2 VPC (Network Interface/Subnet Id)

From the “Services” dropdown, select VPC. Then click on the “Start VPC Wizard” button.

VPC wizard

The default VPC setup is VPC with a single public subnet. This is what we will use for the example here, however other VPC setups will work as well.

VPC subnet

You can use the default values for everything in this step and create a useable VPC. We recommend that you add a VPC name and change the default Subnet name to make them both easier to identify later on.

Click the “Create VPC” button and after a short period of time you will receive a confirmation of your VPC creation.

Click the “Subnets” link on the left side of the page and find your new subnet.

VPC created

Record the Subnet ID. You will need it if you are using the AWS CLI method to create your cluster.

For a more detailed walkthough of creating an AWS VPC please see the Amazon documentation here.

Create AWS S3 Bucket

From the “Services” dropdown, select S3 then click the “Create Bucket” button.

Create bucket

Enter your desired name for the bucket, select your region and click the “Create” button.

For more detailed information on creating and using S3 buckets please see the Amazon documentation here.

Create EC2 Key Pair

From the “Services” dropdown, select EC2. Then select the “Key Pairs” link on the left side of the page and click the “Create Key Pair” button.

Ensure that your selected region (top right side of the page) is the same as the one you will be creating you cluster in. Key pairs cannot be used across regions.

Enter a name for the key pair in the popup and click the “Create” button.

Create key pair

When you create the key pair Amazon will automatically begin to download your private key. Save this somewhere you will remember, because you will need it to ssh into your cluster.

For more detailed information on AWS EC2 Key Pairs please see the Amazon documentation here.

Create EC2 Security Group

From the “Services” dropdown, select EC2. Then select the “Security Groups” link on the left side of the page and click the “Create Security Group” button.

Create security group

Enter a name for the security group, a description (if desired) and select the VPC to associate this security group with.

If you haven’t created a VPC please see the Create EC2 VPC section.

Click the “Create” button to create your security group.

Select your security group from the list. Click on the “Inbound” tab towards the bottom of the page and click the “Edit” button.

Create security group

In the popup window, select SSH from the “Type” drop down, Anywhere from the “Source” drop down, then click the “Save” button.

Create security group

For more detailed information on AWS EC2 Security Groups please see the Amazon documentation here.

AWS CLI Setup

Please see the Amazon documentation here.