diff --git a/tutorials/create-events-dataset.md b/tutorials/create-events-dataset.md new file mode 100644 index 0000000..1cdcc99 --- /dev/null +++ b/tutorials/create-events-dataset.md @@ -0,0 +1,117 @@ +--- +layout: page +title: Creating the Events Dataset +--- +## Purpose + +This lesson shows you how to create a dataset suitable for storing standard event records, as defined in [The Unified Logging Infrastructure for Data Analytics at Twitter][paper]. You define a [dataset schema][schema], a [partition strategy][partstrat], and a URI that specifies the storage [scheme][scheme], then use [`kite-dataset create`][create] to make a Hive dataset. + +[paper]:http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf +[schema]:{{site.baseurl}}/introduction-to-datasets.html#schemas +[partstrat]:{{site.baseurl}}/Partitioned-Datasets.html#partition-strategies +[scheme]:{{site.baseurl}}/introduction-to-datasets.html#uri-schemes +[create]:{{site.baseurl}}/cli-reference.html#create + +### Prerequisites + +* A [Quickstart VM][prepare] or instance of CDH 5.2 or later. +* The [kite-dataset][kite-dataset] command. + +[prepare]:{{site.baseurl}}/tutorials/preparing-the-vm.html +[kite-dataset]:{{site.baseurl}}/Install-Kite.html + +### Result + +You create `dataset:hive:events`, where you can store standard event objects. You can use the dataset with several Kite tutorials that demonstrate data capture, storage, and analysis. + +## Defining the Schema + +The `standard_event.avsc` schema is self-describing, with a _doc_ property for each field. StandardEvent records store the `user_id` for the person who initiates an event, the user's IP address, and a timestamp for when the event occurred. + +### standard_event.avsc + +```JSON +{ + "name": "StandardEvent", + "namespace": "org.kitesdk.data.event", + "type": "record", + "doc": "A standard event type for logging, based on the paper 'The Unified Logging Infrastructure for Data Analytics at Twitter' by Lee et al, http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf", + "fields": [ + { + "name": "event_initiator", + "type": "string", + "doc": "Source of the event in the format {client,server}_{user,app}; for example, 'client_user'. Required." + }, + { + "name": "event_name", + "type": "string", + "doc": "A hierarchical name for the event, with parts separated by ':'. Required." + }, + { + "name": "user_id", + "type": "long", + "doc": "A unique identifier for the user. Required." + }, + { + "name": "session_id", + "type": "string", + "doc": "A unique identifier for the session. Required." + }, + { + "name": "ip", + "type": "string", + "doc": "The IP address of the host where the event originated. Required." + }, + { + "name": "timestamp", + "type": "long", + "doc": "The point in time when the event occurred, represented as the number of milliseconds since January 1, 1970, 00:00:00 GMT. Required." + } + ] +} +``` + +## Defining the Partition Strategy + +Analytics for the `events` dataset are time-based. Partitioning the dataset on the `timestamp` field allows Kite to go directly to the files for a particular day, ignoring files outside the time period. Partition strategies are defined in JSON format. See [Partition Strategy JSON Format][partition-strategies]. + +The following sample defines a strategy that partitions a dataset by _year_, _month_, and _day_, based on a _timestamp_ field. + +### partition_year_month_day.json + +``` +[ { + "source" : "timestamp", + "type" : "year", + "name" : "year" +}, { + "source" : "timestamp", + "type" : "month", + "name" : "month" +}, { + "source" : "timestamp", + "type" : "day", + "name" : "day" +} ] +``` + +[partition-strategies]:{{site.baseurl}}/Partition-Strategy-Format.html + +## Creating the Events Dataset Using the Kite CLI + +Create the _events_ dataset using the default Hive scheme. + +To create the _events_ dataset: + +1. Open a terminal window. +1. Use the `create` command to create the dataset. This example assumes that you stored the schema and partition definitions in your home directory. Substitute the correct path if you stored them in a different location. + +``` +kite-dataset create events \ + --schema ~/standard_event.avsc \ + --partition-by ~/partition_year_month_day.json +``` + +Use [Hue][hue] to confirm that the dataset appears in your table list and is ready to use. + +[hue]:http://quickstart.cloudera:8888/beeswax/execute#query diff --git a/tutorials/flume-capture-events.md b/tutorials/flume-capture-events.md new file mode 100644 index 0000000..af004ee --- /dev/null +++ b/tutorials/flume-capture-events.md @@ -0,0 +1,199 @@ +--- +layout: page +title: Capturing Events with Flume +--- + +## Purpose + +This lesson demonstrates how you can configure Flume to capture events from a web application with minimal impact on performance or the user. Flume collects individual events and writes them in groups to the dataset. + +The Flume agent receives the events over inter-process communication (IPC), and writes the events to the Hive file sink. Each time you send a message, Log4j writes a new `INFO` line in the terminal window. + +This example demonstrates how to generate Flume configuration information from the Kite CLI. In addition, JSP and servlet samples allow you to test the data capture mechanism. + +### Prerequisites + +* A VM or cluster configured with Flume user impersonation. See [Preparing the Virtual Machine][vm]. +* An [Events dataset][events] in which to capture session events. + +[vm]:{{site.baseurl}}/tutorials/preparing-the-vm.html +[events]:{{site.baseurl}}/tutorials/create-events-dataset.html + +### Result + +Flume is configured to listen for events on a Tomcat server instance. Use the JSP and servlets to send events to Tomcat. Log4j logs each event to the terminal window. Flume stores the events in `dataset:hive:events`. + +## Configuring Flume + +Follow these steps to configure Flume to channel log information directly to the `events` dataset. You first generate the configuration information using the Kite command-line interface, copy the results, paste them in the Flume configuration file, and then restart Flume. + +You can configure Flume for this example using either Cloudera Manager or the command line. + +### Configuring Flume in Cloudera Manager + +1. In a terminal window, type `kite-dataset flume-config --channel-type memory events`. +1. Copy the output from the terminal window. +1. Open Cloudera Manager. +1. Under __Status__, click the link to __Flume__. +1. Choose the __Configuration__ tab. +1. Click __Agent Base Group__. +1. Right-click the Configuration File text area and choose __Select All__. +1. Right-click the Configuration File text area and choose __Paste__. +1. Click __Save Changes__. +1. From the __Actions__ menu, choose __Restart__, and confirm the action. + +### Configuring Flume from the Command Line + +1. In a terminal window, enter `kite-dataset flume-config --channel-type memory events -o flume.conf`. +1. To update Flume configuration, enter `sudo cp flume.conf /etc/flume-ng/conf/flume.conf`. +1. To restart the Flume agent, enter `sudo /etc/init.d/flume-ng-agent restart`. + +Flume is now configured to listen for web application events and record them in the `events` dataset. + +## Running the Web Application + +Follow these steps to build the web application, start the Tomcat server, and then use the web application to generate events that are sent to the Hadoop dataset. + +1. In a terminal window, navigate to `kite-examples/demo`. +1. To compile the application, enter `mvn install`. +1. To start the Tomcat server, enter `mvn tomcat7:run`. +1. In a web browser, enter the URL [`http://quickstart.cloudera:8034/demo-logging-webapp/`][logging-app]. +1. On the web form, enter any user ID and a message, and then click **Send** to create a web event. + +View the log messages in the terminal window where you launched Tomcat. View the records in Hive using the Hue File Browser. + +[logging-app]:http://quickstart.cloudera:8034/demo-logging-webapp/ + +## Creating Web Application Pages + +These JSP and servlet examples create message events that can be captured by Flume. These examples are not Kite- or Flume-specific; they send messages to the Tomcat server, and Flume captures the events independent of the web application. + +## index.jsp + +The default landing page for the web application is `index.jsp`. It defines a form with fields for an arbitrary User ID and a message. The __Send__ button submits the input values to the Tomcat server. + +```JSP + +
+No message specified.
"); + +``` + +Otherwise, print the message at the top of the page body. + +```Java + } else { + pw.println("Message: " + message + "
"); + +``` + +Create a new StandardEvent builder. + +```Java + StandardEvent event = StandardEvent.newBuilder() +``` +The event initiator is a user on the client. The event is a web message. You can set these values as string literals, because the event initiator and event name are always the same. + +```Java + .setEventInitiator("client_user") + .setEventName("web:message") +``` + +Parse the arbitrary user ID, provided by the user, as a long integer. + +```Java + .setUserId(Long.parseLong(userId)) + +``` + +The application obtains the session ID and IP address from the request object, and creates a timestamp based on the local machine clock. + +```Java + .setSessionId(request.getSession(true).getId()) + .setIp(request.getRemoteAddr()) + .setTimestamp(System.currentTimeMillis()) +``` + +Build the StandardEvent object, and then send the object to the logger with the level _info_. + +```Java + .build(); + logger.info(event); + } + pw.println(""); + pw.println(""); + } +} +``` diff --git a/tutorials/generate-events.md b/tutorials/generate-events.md new file mode 100644 index 0000000..831173d --- /dev/null +++ b/tutorials/generate-events.md @@ -0,0 +1,84 @@ +--- +layout: page +title: Generating Events +--- +## Purpose + +Kite applications work with Big Data. This example class, `GenerateEvents.java`, generates 1-1.5 million random event records, a small amount of realistic Big Data you can use with Kite examples. + +### Prerequisites + +* A VM or cluster configured with Flume user impersonation. See [Preparing the Virtual Machine][vm]. +* An [Events dataset][events] in which to capture session events. + +[vm]:{{site.baseurl}}/tutorials/preparing-the-vm.html +[events]:{{site.baseurl}}/tutorials/create-events-dataset.html + +### Result + +The `events` dataset is populated with realistic event records. Use these records for ad hoc queries and with Kite data analysis tutorials. + +## Running GenerateEvents + +Follow these steps to run GenerateEvents to populate `dataset:hive:events`. + +1. In a terminal window, navigate to `kite-examples/dataset`. +1. Enter `mvn compile`. +1. Run the Java utility with `mvn exec:java -Dexec.mainClass="org.kitesdk.examples.data.GenerateEvents"`. + +Use Hue to view the records in Hive. + +[events]:{{site.baseurl}}/tutorials/create-events-dataset.html + +## Understanding GenerateEvents + +Much of the class GenerateEvents creates random values. The two methods of interest are `run` and `generateRandomEvent`. + +The `run` method performs the following tasks: + +1. Creates a view of the `hive:events` dataset. +1. Creates a writer instance. +1. Spends 36 seconds writing random events. +1. Closes the writer, which stores the results in the `events` dataset. + +Although the goal is to create random events, if they're _too_ random, there won't be anything to aggregate. The `while` loop simulates a user session with random values for `sessionId`, `userId`, and `ip`. It then generates up to 25 random events for that session. + +```Java + View