From f9e33e7526915a95ba9fdb5617bb26b55f3debca Mon Sep 17 00:00:00 2001 From: DennisDawson Date: Wed, 4 Mar 2015 15:13:18 -0800 Subject: [PATCH 1/5] Include use of CLI flume-config. --- tutorials/flume-capture-events.md | 179 ++++++++++++++++++++++++++++++ 1 file changed, 179 insertions(+) create mode 100644 tutorials/flume-capture-events.md diff --git a/tutorials/flume-capture-events.md b/tutorials/flume-capture-events.md new file mode 100644 index 0000000..cba95f0 --- /dev/null +++ b/tutorials/flume-capture-events.md @@ -0,0 +1,179 @@ +--- +layout: page +title: Capturing Events with Flume +--- + +Once you have an [Events dataset][events], you can create a web application that captures session events. + +This example shows how you can send log information via Flume to your Hadoop database using a JSP and custom servlet running on Tomcat. + +[events]:{{site.baseurl}}/tutorials/create-events-dataset.html + +## Configuring Flume + +These are the steps to configure Flume to channel log information directly to the `events` dataset. + +1. In a terminal window, type `kite-dataset flume-config --channel-type memory events`. +1. Copy the results. +1. Open Cloudera Manager. +1. Under __Status__, click the link to __Flume__. +1. Choose the __Configuration__ tab. +1. Click __Agent Base Group__. +1. Right-click the Configuration File text area and choose __Select All__. +1. Right-click the Configuration File text area and choose __Paste__. +1. Click __Save Changes__. +1. From the __Actions__ menu, choose __Restart__, and confirm the action. + +Flume is configured to receive events from Log4j and record them in the `events` dataset. + +## Creating Web Application Pages + +These JSP and servlet examples let you create message events to be captured by Flume. + +## index.jsp + +The default landing page for the web application is `index.jsp`. It defines a form with fields for an arbitrary User ID and a message. The __Send__ button submits the input values to the Tomcat server. + +```JSP + + + Kite Example + + +

Kite Example

+
+ User ID: + Message: + +
+ + +``` + +## LoggingServlet + +When you submit a message from the JSP, the LoggingServlet receives and processes the request. The following is mostly standard servlet code, with some notes about application-specific snippets. + +```Java +package org.kitesdk.examples.demo; +``` + +The servlet parses information from the request to create a StandardEvent object. However, you won't find any source code for `org.kitesdk.data.event.StandardEvent`. During the Maven build, the avro-maven-plugin runs before the compile phase. Any `.avsc` file in the `/main/avro` folder is defined as a Java class. The autogenerated classes have the methods required to build corresponding Avro `SpecificRecord` objects of that type. `SpecificRecord` objects permit efficient access to object fields. + +```Java + +import org.kitesdk.data.event.StandardEvent; +import java.io.IOException; +import java.io.PrintWriter; +import javax.servlet.ServletException; +import javax.servlet.http.HttpServlet; +import javax.servlet.http.HttpServletRequest; +import javax.servlet.http.HttpServletResponse; + +``` + +This example sends Log4j messages directly to the Hive data sink via Flume. + +```Java +import org.apache.log4j.Logger; + +public class LoggingServlet extends HttpServlet { + + private final Logger logger = Logger.getLogger(LoggingServlet.class); + + @Override + protected void doGet(HttpServletRequest request, HttpServletResponse + response) throws ServletException, IOException { + + response.setContentType("text/html"); +``` + +Create a PrintWriter instance to write the response page. + +```Java + PrintWriter pw = response.getWriter(); + + pw.println(""); + pw.println("Kite Example"); + pw.println(""); +``` + +Get the user ID and message values from the servlet request. + +```Java + String userId = request.getParameter("user_id"); + String message = request.getParameter("message"); +``` + +If there's no message, don't create a log entry. + +```Java + if (message == null) { + pw.println("

No message specified.

"); + +``` + +Otherwise, print the message at the top of the page body. + +```Java + } else { + pw.println("

Message: " + message + "

"); + +``` + +Create a new StandardEvent builder. + +```Java + StandardEvent event = StandardEvent.newBuilder() +``` +The event initiator is a user on the server. The event is a web message. These can be set as string literals, because the event initiator and event name are always the same. + +```Java + .setEventInitiator("server_user") + .setEventName("web:message") +``` + +Parse the arbitrary user ID, provided by the user, as a long integer. + +```Java + .setUserId(Long.parseLong(userId)) + +``` + +The application obtains the session ID and IP address from the request object, and creates a timestamp based on the local machine clock. + +```Java + .setSessionId(request.getSession(true).getId()) + .setIp(request.getRemoteAddr()) + .setTimestamp(System.currentTimeMillis()) +``` + +Build the StandardEvent object, then send the object to the logger with the level _info_. + +```Java + .build(); + logger.info(event); + } + pw.println("

Home

"); + pw.println(""); + } +} +``` + +## Running the Web Application + +Follow these steps to build the web application, start the Tomcat server, and then use the web application to generate events that are sent to the Hadoop dataset. + +1. In a terminal window, navigate to `/kite-examples/demo`. +1. Type the command `mvn install`. +1. In the terminal window, enter `mvn tomcat7:run`. +1. In a web browser, enter the URL [`http://quickstart.cloudera:8034/demo-logging-webapp/`][logging-app]. +1. On the web form, enter any user ID and a message, and then click **Send** to create a web event. + +The Flume agent receives the events over inter-process communication (IPC), and the agent writes the events to the Hive file sink. Each time you send a message, Log4j writes a new `INFO` line in the terminal window. + +View the records in Hadoop using the Hue File Browser. + +[http://quickstart.cloudera:8888/filebrowser/view/tmp/data/default/events](http://quickstart.cloudera:8888/filebrowser/view/tmp/data/default/events) + +[logging-app]:http://quickstart.cloudera:8034/demo-logging-webapp/ \ No newline at end of file From 16e423ce36afab49db4af82e087bea1d4a564ab8 Mon Sep 17 00:00:00 2001 From: DennisDawson Date: Thu, 5 Mar 2015 10:53:00 -0800 Subject: [PATCH 2/5] Moving files in this PR. Change prepare vm to point to install kite. --- tutorials/create-events-dataset.md | 108 +++++++++++++++++++++++++++++ tutorials/flume-capture-events.md | 8 +-- tutorials/generate-events.md | 69 ++++++++++++++++++ tutorials/preparing-the-vm.md | 88 +++++++++++++++++++++++ 4 files changed, 269 insertions(+), 4 deletions(-) create mode 100644 tutorials/create-events-dataset.md create mode 100644 tutorials/generate-events.md create mode 100644 tutorials/preparing-the-vm.md diff --git a/tutorials/create-events-dataset.md b/tutorials/create-events-dataset.md new file mode 100644 index 0000000..44fb834 --- /dev/null +++ b/tutorials/create-events-dataset.md @@ -0,0 +1,108 @@ +--- +layout: page +title: Creating the Events Dataset +--- + +This lesson shows you how to create a dataset suitable for storing standard event records. You define a dataset schema, a partition strategy, and a URI that specifies the storage scheme. + +## Defining the Schema + +The `standard_event.avsc` schema is self-describing, thanks to the _doc_ property for each of the fields. The fields store the `user_id` for the person who initiated the event, the user's IP address, and when the event occurred. + +### standard_event.avsc + +```JSON +{ + "name": "StandardEvent", + "namespace": "org.kitesdk.data.event", + "type": "record", + "doc": "A standard event type for logging, based on the paper 'The Unified Logging Infrastructure for Data Analytics at Twitter' by Lee et al, http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf", + "fields": [ + { + "name": "event_initiator", + "type": "string", + "doc": "Source of the event in the format {client,server}_{user,app}; for example, 'client_user'. Required." + }, + { + "name": "event_name", + "type": "string", + "doc": "A hierarchical name for the event, with parts separated by ':'. Required." + }, + { + "name": "user_id", + "type": "long", + "doc": "A unique identifier for the user. Required." + }, + { + "name": "session_id", + "type": "string", + "doc": "A unique identifier for the session. Required." + }, + { + "name": "ip", + "type": "string", + "doc": "The IP address of the host where the event originated. Required." + }, + { + "name": "timestamp", + "type": "long", + "doc": "The point in time when the event occurred, represented as the number of milliseconds since January 1, 1970, 00:00:00 GMT. Required." + } + ] +} +``` + +For convenience, save `standard_event.avsc` to the same directory where you installed the kite-dataset executable JAR. + +## Defining the Partition Strategy + +Analytics for the `events` dataset are time-based. Partitioning the dataset on the `timestamp` field allows Kite to go directly to the files for a particular day, ignoring files outside the chosen time period. Partition strategies are defined in JSON format. See [Partition Strategy JSON Format][partition-strategies]. + +The following code sample defines a strategy that partitions a dataset by _year_, _month_, and _day_, based on a _timestamp_ field. + +### standard_event.json + +``` +[ { + "source" : "timestamp", + "type" : "year", + "name" : "year" +}, { + "source" : "timestamp", + "type" : "month", + "name" : "month" +}, { + "source" : "timestamp", + "type" : "day", + "name" : "day" +} ] +``` + +For convenience, save `standard_event.json` to the same directory where you installed the `kite-dataset` executable JAR. + +[partition-strategies]:{{site.baseurl}}/Partition-Strategy-Format.html + +## Creating the Events Dataset Using the Kite CLI + +Create the _events_ dataset using the default Hive scheme. + +To create the _events_ dataset: + +1. Open a terminal window and navigate to the directory where you saved the schema file. +1. Use the `create` command to create the dataset. + +``` +kite-dataset create events \ + --schema standard_event.avsc \ + --partition-by standard_event.json +``` + +Use Hue to look at the schema and confirm that the dataset is ready to use. + +[http://quickstart.cloudera:8888/filebrowser/view//tmp/data/default/events/.metadata/schema.avsc](http://quickstart.cloudera:8888/filebrowser/view//tmp/data/default/events/.metadata/schema.avsc) + +## Next Steps + +You've created a dataset to store events captured as they happen. Now you can run a web application to create records in your new dataset. See [Capturing Events with Flume][capture-events]. + +[capture-events]:{{site.baseurl}}/tutorials/flume-capture-events.html \ No newline at end of file diff --git a/tutorials/flume-capture-events.md b/tutorials/flume-capture-events.md index cba95f0..94da64f 100644 --- a/tutorials/flume-capture-events.md +++ b/tutorials/flume-capture-events.md @@ -11,10 +11,10 @@ This example shows how you can send log information via Flume to your Hadoop dat ## Configuring Flume -These are the steps to configure Flume to channel log information directly to the `events` dataset. +These are the steps to configure Flume to channel log information directly to the `events` dataset. You first generate the configuration information using the Kite command-line interface, copy the results, paste them in the Flume configuration file, and then restart Flume. 1. In a terminal window, type `kite-dataset flume-config --channel-type memory events`. -1. Copy the results. +1. Copy the output from the terminal window. 1. Open Cloudera Manager. 1. Under __Status__, click the link to __Flume__. 1. Choose the __Configuration__ tab. @@ -24,11 +24,11 @@ These are the steps to configure Flume to channel log information directly to th 1. Click __Save Changes__. 1. From the __Actions__ menu, choose __Restart__, and confirm the action. -Flume is configured to receive events from Log4j and record them in the `events` dataset. +Flume is now configured to receive logging events and record them in the `events` dataset. ## Creating Web Application Pages -These JSP and servlet examples let you create message events to be captured by Flume. +These JSP and servlet examples create message events that can be captured by Flume. ## index.jsp diff --git a/tutorials/generate-events.md b/tutorials/generate-events.md new file mode 100644 index 0000000..dfb245a --- /dev/null +++ b/tutorials/generate-events.md @@ -0,0 +1,69 @@ +--- +layout: page +title: Generating Events +--- + +Kite applications work with Big Data. `GenerateEvents.java` generates 1-1.5 million random event records, a small amount of realistic Big Data you can use with Kite examples. + +Much of the class is devoted to creating random values. The two methods of interest are `run` and `generateRandomEvent`. + +The `run` method performs the following tasks: +* creates a view of the `hive:events` dataset +* creates a writer instance +* spends 36 seconds writing random events +* closes the writer, which stores the results in the `events` dataset. + +While the goal is to create random events, if they're _too_ random there won't be anything to aggregate. The `while` loop simulates a user session with random values for `sessionId`, `userId`, and `ip`. It then generates up to 25 random events for that session. + +```Java + View events = Datasets.load( + "dataset:hive:events", StandardEvent.class); + DatasetWriter writer = events.newWriter(); + try { + Utf8 sessionId = new Utf8("sessionId"); + long userId = 0; + Utf8 ip = new Utf8("ip"); + int randomEventCount = 0; + while (System.currentTimeMillis() - baseTimestamp < 36000) { + sessionId = randomSessionId(); + userId = randomUserId(); + ip = randomIp(); + randomEventCount = random.nextInt(25); + for (int i=0; i < randomEventCount; i++) { + writer.write(generateRandomEvent(sessionId, userId, ip)); + } + } + } finally { + writer.close(); + } +``` + +The `generateRandomEvent` method produces `StandardEvent` objects using random values for the event and time details. + +```Java + public StandardEvent generateRandomEvent(Utf8 sessionId, long userId, Utf8 ip) { + return StandardEvent.newBuilder() + .setEventInitiator(new Utf8("client_user")) + .setEventName(randomEventName()) + .setUserId(userId) + .setSessionId(sessionId) + .setIp(ip) + .setTimestamp(randomTimestamp()) + .setEventDetails(randomEventDetails()) + .build(); + } +``` + +## Running GenerateEvents + +This example assumes that you've already created the [`hive:events` dataset][events]. + +These are the steps to run the GenerateEvents program to populate the `hive:events` dataset. + +1. In a terminal window, navigate to `/kite-examples/dataset`. +1. Enter `mvn compile`. +1. Run the Java utility with `mvn exec:java -Dexec.mainClass="org.kitesdk.examples.data.GenerateEvents"`. + +Use Hue to view the records in Hive. + +[events]:{{site.baseurl}}/tutorials/create-events-dataset.html \ No newline at end of file diff --git a/tutorials/preparing-the-vm.md b/tutorials/preparing-the-vm.md new file mode 100644 index 0000000..4daf529 --- /dev/null +++ b/tutorials/preparing-the-vm.md @@ -0,0 +1,88 @@ +--- +layout: page +title: Preparing the Virtual Machine +--- + +Complete the following steps to run Kite example code on a Cloudera Quickstart VM. + +* Install a VirtualBox or VMWare [Cloudera QuickStart VM][getvm] version 5.2 or later. +* In that VM, run the following command from a terminal window. This command clones a local copy of Kite code examples you can build and run. + +```bash +git clone https://github.com/kite-sdk/kite-examples.git +``` + +* If you haven't already done so, download the [`kite-dataset` CLI JAR][install-cli]. + +[getvm]: http://www.cloudera.com/content/support/en/downloads/quickstart_vms.html +[install-cli]:{{site.baseurl}}/Install-Kite.html + +## Configuring the VM + +Some Kite examples require Flume. If you use Cloudera Manager, Flume user impersonation is configured for you. If do not use Cloudera Manager, you must enable Flume user impersonation. + +### Enabling Flume User Impersonation + +Flume impersonates the dataset owner to write to your dataset, much like the Unix `sudo` utility. See [Configuring Flume's Security Properties](http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh5sg_flume_security_props.html#topic_4_2_1_unique_1). + +Add the following XML snippet to your `/etc/hadoop/conf/core-site.xml` file. + +``` + + hadoop.proxyuser.flume.groups + * + + + hadoop.proxyuser.flume.hosts + * + +``` + +Restart your NameNode by running the following command in a terminal window. + +``` +sudo service hadoop-hdfs-namenode restart +``` + +## Working with the VM + +All usernames/passwords for the VM are `cloudera`/`cloudera`. + +## Troubleshooting + +* __I can't find the VM files in VirtualBox (or VMWare).__ + * You might need to unpack the VM files. + * On Windows, install 7zip, and extract the VM files from the `.7z` file. + * On Linux or Mac: + 1. In a terminal window, navigate to where you copied the VM file. + 1. Enter the command `7zr e `. For example, to extract files for the 5.4 VirtualBox VM, you run the following command. + `7zr e cloudera-quickstart-vm-5.4-virtualbox.7z`. + * Import the extracted files to VirtualBox or VMWare. + +* __How do I open an `.ovf` file?__ + 1. Install and open [VirtualBox][vbox] on your computer. + 1. From the __File__ menu, choose __Import Appliance...__. + 1. Navigate to the `.ovf` file and open it. + +* __What is a `.vmdk` file?__ + * The `.vmdk` file is the VM disk image that accompanies an `.ovf` file. The .ovf file is a portable VM description. + +* __How do I open a `.vbox` file?__ + 1. Install and open [VirtualBox][vbox] on your computer. + 1. From the __Machine__ menu, choose __Add...__. + 1. Navigate to where you unpacked the `.vbox` file and select it. + 1. Click __Open__, and click __Continue__. + 1. Follow the steps in [Configuring the VM](#configuring-the-vm) to complete the installation. + +* __How do I fix "VTx" errors?__ + 1. Reboot your computer and enter BIOS. + 1. Find the _Virtualization_ settings (usually under _Security_), and enable all of the virtualization options. + +* __How do I get my mouse back?__ + * If your mouse/keyboard is stuck in the VM (captured), you can usually release it by pressing the right `CTRL` key. If you don't have one (or if that didn't work), click the release key in the lower-right corner of the VirtualBox window. + +* __Other problems__ + * Using VirtualBox? Try using VMWare. + * Using VMWare? Try using VirtualBox. + +[vbox]: https://www.virtualbox.org/wiki/Downloads \ No newline at end of file From b9f115002b7eac9058aee64aa258d0c9ac75adcd Mon Sep 17 00:00:00 2001 From: DennisDawson Date: Thu, 12 Mar 2015 16:15:53 -0700 Subject: [PATCH 3/5] Release candidate review. --- tutorials/create-events-dataset.md | 43 ++++++++--------- tutorials/flume-capture-events.md | 74 +++++++++++++++++++----------- tutorials/generate-events.md | 61 ++++++++++++++---------- tutorials/preparing-the-vm.md | 25 ++++++---- 4 files changed, 122 insertions(+), 81 deletions(-) diff --git a/tutorials/create-events-dataset.md b/tutorials/create-events-dataset.md index 44fb834..e199376 100644 --- a/tutorials/create-events-dataset.md +++ b/tutorials/create-events-dataset.md @@ -2,12 +2,23 @@ layout: page title: Creating the Events Dataset --- +## Purpose -This lesson shows you how to create a dataset suitable for storing standard event records. You define a dataset schema, a partition strategy, and a URI that specifies the storage scheme. +This lesson shows you how to create a dataset suitable for storing standard event records, as defined in [The Unified Logging Infrastructure for Data Analytics at Twitter][paper]. You define a dataset schema, a partition strategy, and a URI that specifies the storage scheme. + +[paper]:http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf + +### Prerequisites + +A VM or cluster with CDH installed. + +### Result + +You create `dataset:hive:events`, where you can store standard event objects. You can use the dataset with several Kite tutorials that demonstrate data capture, storage, and analysis. ## Defining the Schema -The `standard_event.avsc` schema is self-describing, thanks to the _doc_ property for each of the fields. The fields store the `user_id` for the person who initiated the event, the user's IP address, and when the event occurred. +The `standard_event.avsc` schema is self-describing, with a _doc_ property for each field. StandardEvent records store the `user_id` for the person who initiates an event, the user's IP address, and a timestamp for when the event occurred. ### standard_event.avsc @@ -52,15 +63,13 @@ The `standard_event.avsc` schema is self-describing, thanks to the _doc_ propert } ``` -For convenience, save `standard_event.avsc` to the same directory where you installed the kite-dataset executable JAR. - ## Defining the Partition Strategy -Analytics for the `events` dataset are time-based. Partitioning the dataset on the `timestamp` field allows Kite to go directly to the files for a particular day, ignoring files outside the chosen time period. Partition strategies are defined in JSON format. See [Partition Strategy JSON Format][partition-strategies]. +Analytics for the `events` dataset are time-based. Partitioning the dataset on the `timestamp` field allows Kite to go directly to the files for a particular day, ignoring files outside the time period. Partition strategies are defined in JSON format. See [Partition Strategy JSON Format][partition-strategies]. -The following code sample defines a strategy that partitions a dataset by _year_, _month_, and _day_, based on a _timestamp_ field. +The following sample defines a strategy that partitions a dataset by _year_, _month_, and _day_, based on a _timestamp_ field. -### standard_event.json +### partition_year_month_day.json ``` [ { @@ -78,8 +87,6 @@ The following code sample defines a strategy that partitions a dataset by _year_ } ] ``` -For convenience, save `standard_event.json` to the same directory where you installed the `kite-dataset` executable JAR. - [partition-strategies]:{{site.baseurl}}/Partition-Strategy-Format.html ## Creating the Events Dataset Using the Kite CLI @@ -88,21 +95,15 @@ Create the _events_ dataset using the default Hive scheme. To create the _events_ dataset: -1. Open a terminal window and navigate to the directory where you saved the schema file. -1. Use the `create` command to create the dataset. +1. Open a terminal window. +1. Use the `create` command to create the dataset. This example assumes that you stored the schema and partition definitions in your home directory. Substitute the correct path if you stored them in a different location. ``` kite-dataset create events \ - --schema standard_event.avsc \ - --partition-by standard_event.json + --schema ~/standard_event.avsc \ + --partition-by ~/partition_year_month_day.json ``` -Use Hue to look at the schema and confirm that the dataset is ready to use. - -[http://quickstart.cloudera:8888/filebrowser/view//tmp/data/default/events/.metadata/schema.avsc](http://quickstart.cloudera:8888/filebrowser/view//tmp/data/default/events/.metadata/schema.avsc) - -## Next Steps - -You've created a dataset to store events captured as they happen. Now you can run a web application to create records in your new dataset. See [Capturing Events with Flume][capture-events]. +Use [Hue][hue] to confirm that the dataset appears in your table list and is ready to use. -[capture-events]:{{site.baseurl}}/tutorials/flume-capture-events.html \ No newline at end of file +[hue]:http://quickstart.cloudera:8888/beeswax/execute#query diff --git a/tutorials/flume-capture-events.md b/tutorials/flume-capture-events.md index 94da64f..af004ee 100644 --- a/tutorials/flume-capture-events.md +++ b/tutorials/flume-capture-events.md @@ -3,15 +3,33 @@ layout: page title: Capturing Events with Flume --- -Once you have an [Events dataset][events], you can create a web application that captures session events. +## Purpose -This example shows how you can send log information via Flume to your Hadoop database using a JSP and custom servlet running on Tomcat. +This lesson demonstrates how you can configure Flume to capture events from a web application with minimal impact on performance or the user. Flume collects individual events and writes them in groups to the dataset. +The Flume agent receives the events over inter-process communication (IPC), and writes the events to the Hive file sink. Each time you send a message, Log4j writes a new `INFO` line in the terminal window. + +This example demonstrates how to generate Flume configuration information from the Kite CLI. In addition, JSP and servlet samples allow you to test the data capture mechanism. + +### Prerequisites + +* A VM or cluster configured with Flume user impersonation. See [Preparing the Virtual Machine][vm]. +* An [Events dataset][events] in which to capture session events. + +[vm]:{{site.baseurl}}/tutorials/preparing-the-vm.html [events]:{{site.baseurl}}/tutorials/create-events-dataset.html +### Result + +Flume is configured to listen for events on a Tomcat server instance. Use the JSP and servlets to send events to Tomcat. Log4j logs each event to the terminal window. Flume stores the events in `dataset:hive:events`. + ## Configuring Flume -These are the steps to configure Flume to channel log information directly to the `events` dataset. You first generate the configuration information using the Kite command-line interface, copy the results, paste them in the Flume configuration file, and then restart Flume. +Follow these steps to configure Flume to channel log information directly to the `events` dataset. You first generate the configuration information using the Kite command-line interface, copy the results, paste them in the Flume configuration file, and then restart Flume. + +You can configure Flume for this example using either Cloudera Manager or the command line. + +### Configuring Flume in Cloudera Manager 1. In a terminal window, type `kite-dataset flume-config --channel-type memory events`. 1. Copy the output from the terminal window. @@ -24,11 +42,31 @@ These are the steps to configure Flume to channel log information directly to th 1. Click __Save Changes__. 1. From the __Actions__ menu, choose __Restart__, and confirm the action. -Flume is now configured to receive logging events and record them in the `events` dataset. +### Configuring Flume from the Command Line + +1. In a terminal window, enter `kite-dataset flume-config --channel-type memory events -o flume.conf`. +1. To update Flume configuration, enter `sudo cp flume.conf /etc/flume-ng/conf/flume.conf`. +1. To restart the Flume agent, enter `sudo /etc/init.d/flume-ng-agent restart`. + +Flume is now configured to listen for web application events and record them in the `events` dataset. + +## Running the Web Application + +Follow these steps to build the web application, start the Tomcat server, and then use the web application to generate events that are sent to the Hadoop dataset. + +1. In a terminal window, navigate to `kite-examples/demo`. +1. To compile the application, enter `mvn install`. +1. To start the Tomcat server, enter `mvn tomcat7:run`. +1. In a web browser, enter the URL [`http://quickstart.cloudera:8034/demo-logging-webapp/`][logging-app]. +1. On the web form, enter any user ID and a message, and then click **Send** to create a web event. + +View the log messages in the terminal window where you launched Tomcat. View the records in Hive using the Hue File Browser. + +[logging-app]:http://quickstart.cloudera:8034/demo-logging-webapp/ ## Creating Web Application Pages -These JSP and servlet examples create message events that can be captured by Flume. +These JSP and servlet examples create message events that can be captured by Flume. These examples are not Kite- or Flume-specific; they send messages to the Tomcat server, and Flume captures the events independent of the web application. ## index.jsp @@ -58,7 +96,7 @@ When you submit a message from the JSP, the LoggingServlet receives and processe package org.kitesdk.examples.demo; ``` -The servlet parses information from the request to create a StandardEvent object. However, you won't find any source code for `org.kitesdk.data.event.StandardEvent`. During the Maven build, the avro-maven-plugin runs before the compile phase. Any `.avsc` file in the `/main/avro` folder is defined as a Java class. The autogenerated classes have the methods required to build corresponding Avro `SpecificRecord` objects of that type. `SpecificRecord` objects permit efficient access to object fields. +The servlet parses information from the request to create a StandardEvent object. However, you won't find any source code for `org.kitesdk.data.event.StandardEvent`. During the Maven build, the `avro-maven-plugin` runs before the compile phase. Any `.avsc` file in the `/main/avro` folder is defined as a Java class. The autogenerated classes have the methods required to build corresponding Avro `SpecificRecord` objects of that type. `SpecificRecord` objects permit efficient access to object fields. ```Java @@ -126,10 +164,10 @@ Create a new StandardEvent builder. ```Java StandardEvent event = StandardEvent.newBuilder() ``` -The event initiator is a user on the server. The event is a web message. These can be set as string literals, because the event initiator and event name are always the same. +The event initiator is a user on the client. The event is a web message. You can set these values as string literals, because the event initiator and event name are always the same. ```Java - .setEventInitiator("server_user") + .setEventInitiator("client_user") .setEventName("web:message") ``` @@ -148,7 +186,7 @@ The application obtains the session ID and IP address from the request object, a .setTimestamp(System.currentTimeMillis()) ``` -Build the StandardEvent object, then send the object to the logger with the level _info_. +Build the StandardEvent object, and then send the object to the logger with the level _info_. ```Java .build(); @@ -159,21 +197,3 @@ Build the StandardEvent object, then send the object to the logger with the leve } } ``` - -## Running the Web Application - -Follow these steps to build the web application, start the Tomcat server, and then use the web application to generate events that are sent to the Hadoop dataset. - -1. In a terminal window, navigate to `/kite-examples/demo`. -1. Type the command `mvn install`. -1. In the terminal window, enter `mvn tomcat7:run`. -1. In a web browser, enter the URL [`http://quickstart.cloudera:8034/demo-logging-webapp/`][logging-app]. -1. On the web form, enter any user ID and a message, and then click **Send** to create a web event. - -The Flume agent receives the events over inter-process communication (IPC), and the agent writes the events to the Hive file sink. Each time you send a message, Log4j writes a new `INFO` line in the terminal window. - -View the records in Hadoop using the Hue File Browser. - -[http://quickstart.cloudera:8888/filebrowser/view/tmp/data/default/events](http://quickstart.cloudera:8888/filebrowser/view/tmp/data/default/events) - -[logging-app]:http://quickstart.cloudera:8034/demo-logging-webapp/ \ No newline at end of file diff --git a/tutorials/generate-events.md b/tutorials/generate-events.md index dfb245a..831173d 100644 --- a/tutorials/generate-events.md +++ b/tutorials/generate-events.md @@ -2,22 +2,51 @@ layout: page title: Generating Events --- +## Purpose -Kite applications work with Big Data. `GenerateEvents.java` generates 1-1.5 million random event records, a small amount of realistic Big Data you can use with Kite examples. +Kite applications work with Big Data. This example class, `GenerateEvents.java`, generates 1-1.5 million random event records, a small amount of realistic Big Data you can use with Kite examples. -Much of the class is devoted to creating random values. The two methods of interest are `run` and `generateRandomEvent`. +### Prerequisites + +* A VM or cluster configured with Flume user impersonation. See [Preparing the Virtual Machine][vm]. +* An [Events dataset][events] in which to capture session events. + +[vm]:{{site.baseurl}}/tutorials/preparing-the-vm.html +[events]:{{site.baseurl}}/tutorials/create-events-dataset.html + +### Result + +The `events` dataset is populated with realistic event records. Use these records for ad hoc queries and with Kite data analysis tutorials. + +## Running GenerateEvents + +Follow these steps to run GenerateEvents to populate `dataset:hive:events`. + +1. In a terminal window, navigate to `kite-examples/dataset`. +1. Enter `mvn compile`. +1. Run the Java utility with `mvn exec:java -Dexec.mainClass="org.kitesdk.examples.data.GenerateEvents"`. + +Use Hue to view the records in Hive. + +[events]:{{site.baseurl}}/tutorials/create-events-dataset.html + +## Understanding GenerateEvents + +Much of the class GenerateEvents creates random values. The two methods of interest are `run` and `generateRandomEvent`. The `run` method performs the following tasks: -* creates a view of the `hive:events` dataset -* creates a writer instance -* spends 36 seconds writing random events -* closes the writer, which stores the results in the `events` dataset. -While the goal is to create random events, if they're _too_ random there won't be anything to aggregate. The `while` loop simulates a user session with random values for `sessionId`, `userId`, and `ip`. It then generates up to 25 random events for that session. +1. Creates a view of the `hive:events` dataset. +1. Creates a writer instance. +1. Spends 36 seconds writing random events. +1. Closes the writer, which stores the results in the `events` dataset. + +Although the goal is to create random events, if they're _too_ random, there won't be anything to aggregate. The `while` loop simulates a user session with random values for `sessionId`, `userId`, and `ip`. It then generates up to 25 random events for that session. ```Java View events = Datasets.load( - "dataset:hive:events", StandardEvent.class); + (args[0].isEmpty() ? "dataset:hive:events" : args[0]), + StandardEvent.class); DatasetWriter writer = events.newWriter(); try { Utf8 sessionId = new Utf8("sessionId"); @@ -38,7 +67,7 @@ While the goal is to create random events, if they're _too_ random there won't b } ``` -The `generateRandomEvent` method produces `StandardEvent` objects using random values for the event and time details. +The `generateRandomEvent` method produces `StandardEvent` objects, using random values for the event and time details. ```Java public StandardEvent generateRandomEvent(Utf8 sessionId, long userId, Utf8 ip) { @@ -53,17 +82,3 @@ The `generateRandomEvent` method produces `StandardEvent` objects using random v .build(); } ``` - -## Running GenerateEvents - -This example assumes that you've already created the [`hive:events` dataset][events]. - -These are the steps to run the GenerateEvents program to populate the `hive:events` dataset. - -1. In a terminal window, navigate to `/kite-examples/dataset`. -1. Enter `mvn compile`. -1. Run the Java utility with `mvn exec:java -Dexec.mainClass="org.kitesdk.examples.data.GenerateEvents"`. - -Use Hue to view the records in Hive. - -[events]:{{site.baseurl}}/tutorials/create-events-dataset.html \ No newline at end of file diff --git a/tutorials/preparing-the-vm.md b/tutorials/preparing-the-vm.md index 4daf529..e6771b2 100644 --- a/tutorials/preparing-the-vm.md +++ b/tutorials/preparing-the-vm.md @@ -2,28 +2,33 @@ layout: page title: Preparing the Virtual Machine --- +## Purpose +This lesson describes the steps for configuring a virtual machine to run Kite example code on a Cloudera Quickstart VM. -Complete the following steps to run Kite example code on a Cloudera Quickstart VM. +### Result +Your VM is ready for you to install and run sample programs from the Kite SDK Examples project. -* Install a VirtualBox or VMWare [Cloudera QuickStart VM][getvm] version 5.2 or later. -* In that VM, run the following command from a terminal window. This command clones a local copy of Kite code examples you can build and run. +## Installing the VM and Kite Examples -```bash -git clone https://github.com/kite-sdk/kite-examples.git -``` +1. Install a VirtualBox or VMWare [Cloudera QuickStart VM][getvm] version 5.2 or later. +1. In that VM, run the following command from a terminal window. This command clones a local copy of Kite code examples you can build and run. + + ```bash + git clone https://github.com/kite-sdk/kite-examples.git + ``` -* If you haven't already done so, download the [`kite-dataset` CLI JAR][install-cli]. +1. If you haven't already done so, install the [Kite CLI][install-cli]. [getvm]: http://www.cloudera.com/content/support/en/downloads/quickstart_vms.html [install-cli]:{{site.baseurl}}/Install-Kite.html ## Configuring the VM -Some Kite examples require Flume. If you use Cloudera Manager, Flume user impersonation is configured for you. If do not use Cloudera Manager, you must enable Flume user impersonation. +Some Kite examples require Flume. If you use Cloudera Manager, Flume user impersonation is configured for you. If do not use Cloudera Manager, you must update Flume user impersonation in `core-site.xml`. ### Enabling Flume User Impersonation -Flume impersonates the dataset owner to write to your dataset, much like the Unix `sudo` utility. See [Configuring Flume's Security Properties](http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh5sg_flume_security_props.html#topic_4_2_1_unique_1). +To write to your dataset, Flume impersonates the dataset owner, much like the Unix `sudo` utility. See [Configuring Flume's Security Properties](http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh5sg_flume_security_props.html#topic_4_2_1_unique_1). Add the following XML snippet to your `/etc/hadoop/conf/core-site.xml` file. @@ -76,7 +81,7 @@ All usernames/passwords for the VM are `cloudera`/`cloudera`. * __How do I fix "VTx" errors?__ 1. Reboot your computer and enter BIOS. - 1. Find the _Virtualization_ settings (usually under _Security_), and enable all of the virtualization options. + 1. Find the _Virtualization_ settings (usually under _Security_), and enable all virtualization options. * __How do I get my mouse back?__ * If your mouse/keyboard is stuck in the VM (captured), you can usually release it by pressing the right `CTRL` key. If you don't have one (or if that didn't work), click the release key in the lower-right corner of the VirtualBox window. From eef961fa2564469a6c1235fc5741dd1498e96011 Mon Sep 17 00:00:00 2001 From: DennisDawson Date: Mon, 30 Mar 2015 12:16:08 -0700 Subject: [PATCH 4/5] Adding back CM configuration instructions. --- tutorials/preparing-the-vm.md | 47 +++++++++++++++++++++++++++++------ 1 file changed, 39 insertions(+), 8 deletions(-) diff --git a/tutorials/preparing-the-vm.md b/tutorials/preparing-the-vm.md index e6771b2..e4de216 100644 --- a/tutorials/preparing-the-vm.md +++ b/tutorials/preparing-the-vm.md @@ -6,31 +6,62 @@ title: Preparing the Virtual Machine This lesson describes the steps for configuring a virtual machine to run Kite example code on a Cloudera Quickstart VM. ### Result -Your VM is ready for you to install and run sample programs from the Kite SDK Examples project. +Your VM is ready for you to run sample programs from the Kite SDK Examples project. ## Installing the VM and Kite Examples -1. Install a VirtualBox or VMWare [Cloudera QuickStart VM][getvm] version 5.2 or later. -1. In that VM, run the following command from a terminal window. This command clones a local copy of Kite code examples you can build and run. +Install an Oracle VirtualBox or VMWare Fusion [Cloudera QuickStart VM][getvm] version 5.2 or later. + +Before you launch the VM, decide whether to use Cloudera Manager. If you choose to use Cloudera Manager, you'll need to allocate additional memory and processing resources to your VM. The advantages of using Cloudera Manager are that it provides a visual interface for monitoring the health of your system, it configures by default most of the settings for using Kite examples, and it makes it easier for you to perform additional optional configurations. + +### Configuring the VM for Cloudera Manager + +If you use Cloudera Manager, you must increase the VM memory allocation and the number of CPUs. + +#### Adding Memory and CPUs in a VirtualBox VM + +1. In VirtualBox Manager, select your VM instance and click __Settings__. +1. Select the __System__ tab. +1. On the __Motherboard__ page, set the __Base Memory__ slider to _8192 MB_. +1. Click the __Processor__ page tab. +1. Set the __Processor(s)__ slider to _2_. +1. Click __OK__. + +#### Adding Memory and CPUs in a VMware Fusion VM + +1. From the VMware Fusion menu bar, select __Window > Virtual Machine Library__. +1. Select your virtual machine and click __Settings__. +1. In the __Settings__ window, in the __System Settings__ section, select __Processors & Memory__. +1. Set the amount of memory to allocate to the VM to _8192 MB_ using the slider control. +1. Expand __Advanced Options__, and set the number of CPUs to _2_. +1. Click __OK__. + +### Downloading Resources to the VM + +1. Start your VM. +1. In the VM, run the following command from a terminal window. This command clones a local copy of Kite code examples you can build and run. ```bash git clone https://github.com/kite-sdk/kite-examples.git ``` -1. If you haven't already done so, install the [Kite CLI][install-cli]. +1. Install the [Kite CLI][install-cli] command. [getvm]: http://www.cloudera.com/content/support/en/downloads/quickstart_vms.html [install-cli]:{{site.baseurl}}/Install-Kite.html ## Configuring the VM -Some Kite examples require Flume. If you use Cloudera Manager, Flume user impersonation is configured for you. If do not use Cloudera Manager, you must update Flume user impersonation in `core-site.xml`. +Some Kite examples require Flume. To write to your dataset, Flume impersonates the dataset owner, much like the Unix `sudo` utility. See [Configuring Flume's Security Properties](http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh5sg_flume_security_props.html#topic_4_2_1_unique_1). +If you use Cloudera Manager, Flume user impersonation is configured for you. If don't use Cloudera Manager, you must update Flume user impersonation in `core-site.xml`. -### Enabling Flume User Impersonation +### Starting Cloudera Manager + +To run Cloudera Manager, double-click the __Launch Cloudera Manager__ icon on the VM desktop. Flume user impersonation is enabled by default. -To write to your dataset, Flume impersonates the dataset owner, much like the Unix `sudo` utility. See [Configuring Flume's Security Properties](http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh5sg_flume_security_props.html#topic_4_2_1_unique_1). +### Enabling Flume User Impersonation -Add the following XML snippet to your `/etc/hadoop/conf/core-site.xml` file. +If choose not to use Cloudera Manager, add the following XML snippet to your `/etc/hadoop/conf/core-site.xml` file. ``` From bf0f4fafb2ead2c935745aa622b231e2758440ee Mon Sep 17 00:00:00 2001 From: DennisDawson Date: Mon, 30 Mar 2015 13:16:50 -0700 Subject: [PATCH 5/5] Add links to conceptual topics. --- tutorials/create-events-dataset.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/tutorials/create-events-dataset.md b/tutorials/create-events-dataset.md index e199376..1cdcc99 100644 --- a/tutorials/create-events-dataset.md +++ b/tutorials/create-events-dataset.md @@ -4,13 +4,21 @@ title: Creating the Events Dataset --- ## Purpose -This lesson shows you how to create a dataset suitable for storing standard event records, as defined in [The Unified Logging Infrastructure for Data Analytics at Twitter][paper]. You define a dataset schema, a partition strategy, and a URI that specifies the storage scheme. +This lesson shows you how to create a dataset suitable for storing standard event records, as defined in [The Unified Logging Infrastructure for Data Analytics at Twitter][paper]. You define a [dataset schema][schema], a [partition strategy][partstrat], and a URI that specifies the storage [scheme][scheme], then use [`kite-dataset create`][create] to make a Hive dataset. [paper]:http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf +[schema]:{{site.baseurl}}/introduction-to-datasets.html#schemas +[partstrat]:{{site.baseurl}}/Partitioned-Datasets.html#partition-strategies +[scheme]:{{site.baseurl}}/introduction-to-datasets.html#uri-schemes +[create]:{{site.baseurl}}/cli-reference.html#create ### Prerequisites -A VM or cluster with CDH installed. +* A [Quickstart VM][prepare] or instance of CDH 5.2 or later. +* The [kite-dataset][kite-dataset] command. + +[prepare]:{{site.baseurl}}/tutorials/preparing-the-vm.html +[kite-dataset]:{{site.baseurl}}/Install-Kite.html ### Result