HStreaming Console Overview¶
HStreaming Cloud is a web front end for managing real-time and batch-processing Hadoop jobs as a cloud service. HStreaming.com Console is an easy-to-use platform to launch HStreaming technology hosted on Amazon Web Services.
The console allows you to launch jobs on freshly instantiated clusters, to add new jobs to an existing cluster, and to terminate jobs or clusters. Jobs can be native Hadoop Jobs directly running on HStreaming’s distribution of Hadoop or Pig scripts that are interpreted and executed by HStreaming’s enhanced version of Apache Pig. HStreaming’s distribution of Pig is fully backward-compatible to the original Apache Pig version; it contains enhancements that will allow you to process streaming data both at the input and at the output side of your job (as described below).
The console keeps all jobs in a Job Library, which can be accessed via the navigation menu on the left side. The navigation menu also lists all HStreaming job clusters that are currently running on your HStreaming Cloud platform. The navigation menu further shows a job history, which will allow you to easily re-run jobs that you already executed in the past.
Launching an Example Pig Script from the Library¶
The following example will walk through the steps to launch the Twitter Wordcount script counting the words of the Twitter sample stream.
The first step is the selection of the region you like to launch your cluster in; choose the desired one from the region top drop-down menu on the top left of the console page.
Afterward, click Create new to launch a new job on a new cluster. A dialog box will appear, which guides you trough the process of configuring job and cluster.
Step 1 will ask you to choose and configure the job you would like to run. Under New Job Name, give the job a meaningful name; in this example, we will name it TwitterWordcount. Right below the name form, you can choose the job you would like to run: Run your own new application allows you to let the cluster run a custom Pig script or hadoop job from an accessible location (for example, from HTTPS or S3). Run an application from a library allows you to choose a Pig script from our template and example library. Run a job from your history allows you to reload the configuration of a job that you have previously launched.
For this example, we will choose Twitter Wordcount from Twitter -- S3 Output, which will count the words of the public Twitter sample stream and place the output into S3. Once selected, the dialog box will automatically load and show the default configuration values on the right side. Then click continue to proceed to Step 2.
Step 2 will ask you to enter the specific job parameters, which may vary among individual jobs. For the twitter example, you will need to enter your Twitter credentials to allow the stream connect to the Twitter streaming API; enter them under Twitter Username and Twitter Password. Then, under S3 Output URL, enter the S3 location where you would like to store the stream output, for example, s3://mybucket/output. Also, enter the desired Batch Interval in milliseconds and the desired Batch Window Size counted in batches. For this example, we will use the default values of 1000 and 1 respectively. Click continue to proceed to Step 3.
Step 3 will ask you to define the cluster you like to launch your job on. If you already have one or more clusters running, you may choose to add the job to one of those under Add to your currently running EC2 cluster. For this example, let us assume you like to launch a new cluster. You can do so either by configuring a cluster from scratch under Launch a new EC2 cluster, or by selecting an existing specification under Use a cluster specification from your history. For this example, choose to configure from scratch. Finally, the dialog box will ask for a set of configuration parameters for the cluster, such as the name, the desired number and type of instances, and the EC2 key pair (this will later allow you to log into the cluster nodes via SSH). Finally, you may choose Hadoop log files to be stored in S3 under a given S3 path. For this example, we will use default values. Click continue to proceed to Step 4.
Step 4 will show a summary of the job and cluster and allow you to review their configuration settings. Once you have reviewed them, click Create JobFlow to launch the cluster.
Once you click Create JobFlow, EC2 machine instances will be launched on your behalf using the provided AWS credentials. The dialog box will close and you will see the newly created cluster at the top left of the console page under Currently running. Clicking the cluster will show status and information of the cluster and the jobs running on it. You will find, among other information, the public DNS name of the master node, which allows you to directly log into the Hadoop cluster via SSH.
For the example, however, this will not be necessary; once the wordcount script is running, it will place a file named output-part-r00000 in the S3 bucket, which will be overwritten every second with the wordcount.
When you are done with the job, select the job from the cluster list on the top left of the console page. By clicking Terminate on the main window you can terminate the job. Similarly, to terminate the cluster, select the cluster you would like to shut down from the list on the top left of the console page and then click Terminate on the main window to terminate it.