diff --git a/tutorials/create-events-dataset.md b/tutorials/create-events-dataset.md new file mode 100644 index 0000000..1cdcc99 --- /dev/null +++ b/tutorials/create-events-dataset.md @@ -0,0 +1,117 @@ +--- +layout: page +title: Creating the Events Dataset +--- +## Purpose + +This lesson shows you how to create a dataset suitable for storing standard event records, as defined in [The Unified Logging Infrastructure for Data Analytics at Twitter][paper]. You define a [dataset schema][schema], a [partition strategy][partstrat], and a URI that specifies the storage [scheme][scheme], then use [`kite-dataset create`][create] to make a Hive dataset. + +[paper]:http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf +[schema]:{{site.baseurl}}/introduction-to-datasets.html#schemas +[partstrat]:{{site.baseurl}}/Partitioned-Datasets.html#partition-strategies +[scheme]:{{site.baseurl}}/introduction-to-datasets.html#uri-schemes +[create]:{{site.baseurl}}/cli-reference.html#create + +### Prerequisites + +* A [Quickstart VM][prepare] or instance of CDH 5.2 or later. +* The [kite-dataset][kite-dataset] command. + +[prepare]:{{site.baseurl}}/tutorials/preparing-the-vm.html +[kite-dataset]:{{site.baseurl}}/Install-Kite.html + +### Result + +You create `dataset:hive:events`, where you can store standard event objects. You can use the dataset with several Kite tutorials that demonstrate data capture, storage, and analysis. + +## Defining the Schema + +The `standard_event.avsc` schema is self-describing, with a _doc_ property for each field. StandardEvent records store the `user_id` for the person who initiates an event, the user's IP address, and a timestamp for when the event occurred. + +### standard_event.avsc + +```JSON +{ + "name": "StandardEvent", + "namespace": "org.kitesdk.data.event", + "type": "record", + "doc": "A standard event type for logging, based on the paper 'The Unified Logging Infrastructure for Data Analytics at Twitter' by Lee et al, http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf", + "fields": [ + { + "name": "event_initiator", + "type": "string", + "doc": "Source of the event in the format {client,server}_{user,app}; for example, 'client_user'. Required." + }, + { + "name": "event_name", + "type": "string", + "doc": "A hierarchical name for the event, with parts separated by ':'. Required." + }, + { + "name": "user_id", + "type": "long", + "doc": "A unique identifier for the user. Required." + }, + { + "name": "session_id", + "type": "string", + "doc": "A unique identifier for the session. Required." + }, + { + "name": "ip", + "type": "string", + "doc": "The IP address of the host where the event originated. Required." + }, + { + "name": "timestamp", + "type": "long", + "doc": "The point in time when the event occurred, represented as the number of milliseconds since January 1, 1970, 00:00:00 GMT. Required." + } + ] +} +``` + +## Defining the Partition Strategy + +Analytics for the `events` dataset are time-based. Partitioning the dataset on the `timestamp` field allows Kite to go directly to the files for a particular day, ignoring files outside the time period. Partition strategies are defined in JSON format. See [Partition Strategy JSON Format][partition-strategies]. + +The following sample defines a strategy that partitions a dataset by _year_, _month_, and _day_, based on a _timestamp_ field. + +### partition_year_month_day.json + +``` +[ { + "source" : "timestamp", + "type" : "year", + "name" : "year" +}, { + "source" : "timestamp", + "type" : "month", + "name" : "month" +}, { + "source" : "timestamp", + "type" : "day", + "name" : "day" +} ] +``` + +[partition-strategies]:{{site.baseurl}}/Partition-Strategy-Format.html + +## Creating the Events Dataset Using the Kite CLI + +Create the _events_ dataset using the default Hive scheme. + +To create the _events_ dataset: + +1. Open a terminal window. +1. Use the `create` command to create the dataset. This example assumes that you stored the schema and partition definitions in your home directory. Substitute the correct path if you stored them in a different location. + +``` +kite-dataset create events \ + --schema ~/standard_event.avsc \ + --partition-by ~/partition_year_month_day.json +``` + +Use [Hue][hue] to confirm that the dataset appears in your table list and is ready to use. + +[hue]:http://quickstart.cloudera:8888/beeswax/execute#query diff --git a/tutorials/flume-capture-events.md b/tutorials/flume-capture-events.md new file mode 100644 index 0000000..af004ee --- /dev/null +++ b/tutorials/flume-capture-events.md @@ -0,0 +1,199 @@ +--- +layout: page +title: Capturing Events with Flume +--- + +## Purpose + +This lesson demonstrates how you can configure Flume to capture events from a web application with minimal impact on performance or the user. Flume collects individual events and writes them in groups to the dataset. + +The Flume agent receives the events over inter-process communication (IPC), and writes the events to the Hive file sink. Each time you send a message, Log4j writes a new `INFO` line in the terminal window. + +This example demonstrates how to generate Flume configuration information from the Kite CLI. In addition, JSP and servlet samples allow you to test the data capture mechanism. + +### Prerequisites + +* A VM or cluster configured with Flume user impersonation. See [Preparing the Virtual Machine][vm]. +* An [Events dataset][events] in which to capture session events. + +[vm]:{{site.baseurl}}/tutorials/preparing-the-vm.html +[events]:{{site.baseurl}}/tutorials/create-events-dataset.html + +### Result + +Flume is configured to listen for events on a Tomcat server instance. Use the JSP and servlets to send events to Tomcat. Log4j logs each event to the terminal window. Flume stores the events in `dataset:hive:events`. + +## Configuring Flume + +Follow these steps to configure Flume to channel log information directly to the `events` dataset. You first generate the configuration information using the Kite command-line interface, copy the results, paste them in the Flume configuration file, and then restart Flume. + +You can configure Flume for this example using either Cloudera Manager or the command line. + +### Configuring Flume in Cloudera Manager + +1. In a terminal window, type `kite-dataset flume-config --channel-type memory events`. +1. Copy the output from the terminal window. +1. Open Cloudera Manager. +1. Under __Status__, click the link to __Flume__. +1. Choose the __Configuration__ tab. +1. Click __Agent Base Group__. +1. Right-click the Configuration File text area and choose __Select All__. +1. Right-click the Configuration File text area and choose __Paste__. +1. Click __Save Changes__. +1. From the __Actions__ menu, choose __Restart__, and confirm the action. + +### Configuring Flume from the Command Line + +1. In a terminal window, enter `kite-dataset flume-config --channel-type memory events -o flume.conf`. +1. To update Flume configuration, enter `sudo cp flume.conf /etc/flume-ng/conf/flume.conf`. +1. To restart the Flume agent, enter `sudo /etc/init.d/flume-ng-agent restart`. + +Flume is now configured to listen for web application events and record them in the `events` dataset. + +## Running the Web Application + +Follow these steps to build the web application, start the Tomcat server, and then use the web application to generate events that are sent to the Hadoop dataset. + +1. In a terminal window, navigate to `kite-examples/demo`. +1. To compile the application, enter `mvn install`. +1. To start the Tomcat server, enter `mvn tomcat7:run`. +1. In a web browser, enter the URL [`http://quickstart.cloudera:8034/demo-logging-webapp/`][logging-app]. +1. On the web form, enter any user ID and a message, and then click **Send** to create a web event. + +View the log messages in the terminal window where you launched Tomcat. View the records in Hive using the Hue File Browser. + +[logging-app]:http://quickstart.cloudera:8034/demo-logging-webapp/ + +## Creating Web Application Pages + +These JSP and servlet examples create message events that can be captured by Flume. These examples are not Kite- or Flume-specific; they send messages to the Tomcat server, and Flume captures the events independent of the web application. + +## index.jsp + +The default landing page for the web application is `index.jsp`. It defines a form with fields for an arbitrary User ID and a message. The __Send__ button submits the input values to the Tomcat server. + +```JSP + + + Kite Example + + +

Kite Example

+
+ User ID: + Message: + +
+ + +``` + +## LoggingServlet + +When you submit a message from the JSP, the LoggingServlet receives and processes the request. The following is mostly standard servlet code, with some notes about application-specific snippets. + +```Java +package org.kitesdk.examples.demo; +``` + +The servlet parses information from the request to create a StandardEvent object. However, you won't find any source code for `org.kitesdk.data.event.StandardEvent`. During the Maven build, the `avro-maven-plugin` runs before the compile phase. Any `.avsc` file in the `/main/avro` folder is defined as a Java class. The autogenerated classes have the methods required to build corresponding Avro `SpecificRecord` objects of that type. `SpecificRecord` objects permit efficient access to object fields. + +```Java + +import org.kitesdk.data.event.StandardEvent; +import java.io.IOException; +import java.io.PrintWriter; +import javax.servlet.ServletException; +import javax.servlet.http.HttpServlet; +import javax.servlet.http.HttpServletRequest; +import javax.servlet.http.HttpServletResponse; + +``` + +This example sends Log4j messages directly to the Hive data sink via Flume. + +```Java +import org.apache.log4j.Logger; + +public class LoggingServlet extends HttpServlet { + + private final Logger logger = Logger.getLogger(LoggingServlet.class); + + @Override + protected void doGet(HttpServletRequest request, HttpServletResponse + response) throws ServletException, IOException { + + response.setContentType("text/html"); +``` + +Create a PrintWriter instance to write the response page. + +```Java + PrintWriter pw = response.getWriter(); + + pw.println(""); + pw.println("Kite Example"); + pw.println(""); +``` + +Get the user ID and message values from the servlet request. + +```Java + String userId = request.getParameter("user_id"); + String message = request.getParameter("message"); +``` + +If there's no message, don't create a log entry. + +```Java + if (message == null) { + pw.println("

No message specified.

"); + +``` + +Otherwise, print the message at the top of the page body. + +```Java + } else { + pw.println("

Message: " + message + "

"); + +``` + +Create a new StandardEvent builder. + +```Java + StandardEvent event = StandardEvent.newBuilder() +``` +The event initiator is a user on the client. The event is a web message. You can set these values as string literals, because the event initiator and event name are always the same. + +```Java + .setEventInitiator("client_user") + .setEventName("web:message") +``` + +Parse the arbitrary user ID, provided by the user, as a long integer. + +```Java + .setUserId(Long.parseLong(userId)) + +``` + +The application obtains the session ID and IP address from the request object, and creates a timestamp based on the local machine clock. + +```Java + .setSessionId(request.getSession(true).getId()) + .setIp(request.getRemoteAddr()) + .setTimestamp(System.currentTimeMillis()) +``` + +Build the StandardEvent object, and then send the object to the logger with the level _info_. + +```Java + .build(); + logger.info(event); + } + pw.println("

Home

"); + pw.println(""); + } +} +``` diff --git a/tutorials/generate-events.md b/tutorials/generate-events.md new file mode 100644 index 0000000..831173d --- /dev/null +++ b/tutorials/generate-events.md @@ -0,0 +1,84 @@ +--- +layout: page +title: Generating Events +--- +## Purpose + +Kite applications work with Big Data. This example class, `GenerateEvents.java`, generates 1-1.5 million random event records, a small amount of realistic Big Data you can use with Kite examples. + +### Prerequisites + +* A VM or cluster configured with Flume user impersonation. See [Preparing the Virtual Machine][vm]. +* An [Events dataset][events] in which to capture session events. + +[vm]:{{site.baseurl}}/tutorials/preparing-the-vm.html +[events]:{{site.baseurl}}/tutorials/create-events-dataset.html + +### Result + +The `events` dataset is populated with realistic event records. Use these records for ad hoc queries and with Kite data analysis tutorials. + +## Running GenerateEvents + +Follow these steps to run GenerateEvents to populate `dataset:hive:events`. + +1. In a terminal window, navigate to `kite-examples/dataset`. +1. Enter `mvn compile`. +1. Run the Java utility with `mvn exec:java -Dexec.mainClass="org.kitesdk.examples.data.GenerateEvents"`. + +Use Hue to view the records in Hive. + +[events]:{{site.baseurl}}/tutorials/create-events-dataset.html + +## Understanding GenerateEvents + +Much of the class GenerateEvents creates random values. The two methods of interest are `run` and `generateRandomEvent`. + +The `run` method performs the following tasks: + +1. Creates a view of the `hive:events` dataset. +1. Creates a writer instance. +1. Spends 36 seconds writing random events. +1. Closes the writer, which stores the results in the `events` dataset. + +Although the goal is to create random events, if they're _too_ random, there won't be anything to aggregate. The `while` loop simulates a user session with random values for `sessionId`, `userId`, and `ip`. It then generates up to 25 random events for that session. + +```Java + View events = Datasets.load( + (args[0].isEmpty() ? "dataset:hive:events" : args[0]), + StandardEvent.class); + DatasetWriter writer = events.newWriter(); + try { + Utf8 sessionId = new Utf8("sessionId"); + long userId = 0; + Utf8 ip = new Utf8("ip"); + int randomEventCount = 0; + while (System.currentTimeMillis() - baseTimestamp < 36000) { + sessionId = randomSessionId(); + userId = randomUserId(); + ip = randomIp(); + randomEventCount = random.nextInt(25); + for (int i=0; i < randomEventCount; i++) { + writer.write(generateRandomEvent(sessionId, userId, ip)); + } + } + } finally { + writer.close(); + } +``` + +The `generateRandomEvent` method produces `StandardEvent` objects, using random values for the event and time details. + +```Java + public StandardEvent generateRandomEvent(Utf8 sessionId, long userId, Utf8 ip) { + return StandardEvent.newBuilder() + .setEventInitiator(new Utf8("client_user")) + .setEventName(randomEventName()) + .setUserId(userId) + .setSessionId(sessionId) + .setIp(ip) + .setTimestamp(randomTimestamp()) + .setEventDetails(randomEventDetails()) + .build(); + } +``` diff --git a/tutorials/preparing-the-vm.md b/tutorials/preparing-the-vm.md new file mode 100644 index 0000000..e4de216 --- /dev/null +++ b/tutorials/preparing-the-vm.md @@ -0,0 +1,124 @@ +--- +layout: page +title: Preparing the Virtual Machine +--- +## Purpose +This lesson describes the steps for configuring a virtual machine to run Kite example code on a Cloudera Quickstart VM. + +### Result +Your VM is ready for you to run sample programs from the Kite SDK Examples project. + +## Installing the VM and Kite Examples + +Install an Oracle VirtualBox or VMWare Fusion [Cloudera QuickStart VM][getvm] version 5.2 or later. + +Before you launch the VM, decide whether to use Cloudera Manager. If you choose to use Cloudera Manager, you'll need to allocate additional memory and processing resources to your VM. The advantages of using Cloudera Manager are that it provides a visual interface for monitoring the health of your system, it configures by default most of the settings for using Kite examples, and it makes it easier for you to perform additional optional configurations. + +### Configuring the VM for Cloudera Manager + +If you use Cloudera Manager, you must increase the VM memory allocation and the number of CPUs. + +#### Adding Memory and CPUs in a VirtualBox VM + +1. In VirtualBox Manager, select your VM instance and click __Settings__. +1. Select the __System__ tab. +1. On the __Motherboard__ page, set the __Base Memory__ slider to _8192 MB_. +1. Click the __Processor__ page tab. +1. Set the __Processor(s)__ slider to _2_. +1. Click __OK__. + +#### Adding Memory and CPUs in a VMware Fusion VM + +1. From the VMware Fusion menu bar, select __Window > Virtual Machine Library__. +1. Select your virtual machine and click __Settings__. +1. In the __Settings__ window, in the __System Settings__ section, select __Processors & Memory__. +1. Set the amount of memory to allocate to the VM to _8192 MB_ using the slider control. +1. Expand __Advanced Options__, and set the number of CPUs to _2_. +1. Click __OK__. + +### Downloading Resources to the VM + +1. Start your VM. +1. In the VM, run the following command from a terminal window. This command clones a local copy of Kite code examples you can build and run. + + ```bash + git clone https://github.com/kite-sdk/kite-examples.git + ``` + +1. Install the [Kite CLI][install-cli] command. + +[getvm]: http://www.cloudera.com/content/support/en/downloads/quickstart_vms.html +[install-cli]:{{site.baseurl}}/Install-Kite.html + +## Configuring the VM + +Some Kite examples require Flume. To write to your dataset, Flume impersonates the dataset owner, much like the Unix `sudo` utility. See [Configuring Flume's Security Properties](http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh5sg_flume_security_props.html#topic_4_2_1_unique_1). +If you use Cloudera Manager, Flume user impersonation is configured for you. If don't use Cloudera Manager, you must update Flume user impersonation in `core-site.xml`. + +### Starting Cloudera Manager + +To run Cloudera Manager, double-click the __Launch Cloudera Manager__ icon on the VM desktop. Flume user impersonation is enabled by default. + +### Enabling Flume User Impersonation + +If choose not to use Cloudera Manager, add the following XML snippet to your `/etc/hadoop/conf/core-site.xml` file. + +``` + + hadoop.proxyuser.flume.groups + * + + + hadoop.proxyuser.flume.hosts + * + +``` + +Restart your NameNode by running the following command in a terminal window. + +``` +sudo service hadoop-hdfs-namenode restart +``` + +## Working with the VM + +All usernames/passwords for the VM are `cloudera`/`cloudera`. + +## Troubleshooting + +* __I can't find the VM files in VirtualBox (or VMWare).__ + * You might need to unpack the VM files. + * On Windows, install 7zip, and extract the VM files from the `.7z` file. + * On Linux or Mac: + 1. In a terminal window, navigate to where you copied the VM file. + 1. Enter the command `7zr e `. For example, to extract files for the 5.4 VirtualBox VM, you run the following command. + `7zr e cloudera-quickstart-vm-5.4-virtualbox.7z`. + * Import the extracted files to VirtualBox or VMWare. + +* __How do I open an `.ovf` file?__ + 1. Install and open [VirtualBox][vbox] on your computer. + 1. From the __File__ menu, choose __Import Appliance...__. + 1. Navigate to the `.ovf` file and open it. + +* __What is a `.vmdk` file?__ + * The `.vmdk` file is the VM disk image that accompanies an `.ovf` file. The .ovf file is a portable VM description. + +* __How do I open a `.vbox` file?__ + 1. Install and open [VirtualBox][vbox] on your computer. + 1. From the __Machine__ menu, choose __Add...__. + 1. Navigate to where you unpacked the `.vbox` file and select it. + 1. Click __Open__, and click __Continue__. + 1. Follow the steps in [Configuring the VM](#configuring-the-vm) to complete the installation. + +* __How do I fix "VTx" errors?__ + 1. Reboot your computer and enter BIOS. + 1. Find the _Virtualization_ settings (usually under _Security_), and enable all virtualization options. + +* __How do I get my mouse back?__ + * If your mouse/keyboard is stuck in the VM (captured), you can usually release it by pressing the right `CTRL` key. If you don't have one (or if that didn't work), click the release key in the lower-right corner of the VirtualBox window. + +* __Other problems__ + * Using VirtualBox? Try using VMWare. + * Using VMWare? Try using VirtualBox. + +[vbox]: https://www.virtualbox.org/wiki/Downloads \ No newline at end of file