asahi7/PasswordCrackerInHadoop
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
####################################################### # BDSA Lab # # Assignment 05 - PasswordCracker using MapReduce In Hadoop ####################################################### This directory contains the files that you will use to build and run the PasswordCrackerInHadoop In this assignment, you implement PasswordCrackerInHadoop. (For more information PasswordCracker, see the README in assignment 01) *********** 1. Overview *********** In HDFS, files are divided into blocks (64MB size by default) and each block is stored in multiple replicas (3 replicas by default). See Hadoop tutorial for the details. (refer to https://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat) Generally, inputs to Hadoop applications are files on HDFS. Hadoop transforms byte sequences in file blocks into records for applications with the following classes: InputFormat : reads data from a file block, creates equal-sized byte sequences, called split; the splits are of InputSplit type. InputSplit : helps InputFormat to create splits. RecordReader : takes splits as input and creates records of key/value pairs; these keys and values are passed to the map function. In this assignment, however, we use a list of strings, rather than files, as input to our Hadoop application (our password cracker). The input strings are generated by the code you implement, in CandidateRangeInputFormat class. Unlike file-based InputFormat, CandidateRangeInputFormat does not read data from file to generate splits; it generates each split using metadata information. A split in our application represents a search space (as in our previous assignments); CandidateRangeInputFormat considers the password length, user-defined numberOfsplits, etc when creating input splits. These splits are transformed into records of key/value pairs representing a solution sub-space range (key : rangeStart, value : rangeEnd) in CandidateRangeRecordReader class. After the map function searches its given range, it passes the solution, or the computed password, to the reducer if it found it. The reducer writes both origianl password and encrypted password to the output file. **** 2. Template Code **** In the template code, we provide the following classes: CrackerDriver class CandidateRangeInputFormat class CandidateRangeInputSplit class CandidateRangeRecordReader class PasswordCrackerMapper class TerminationChecker class PasswordCrackerReducer class PasswordCrackerUtil class CrackerDriver class : Set up execution information about mapreduce job and start job. CandidateRangeInputFormat class : It defines how to read data in Hadoop. 1) getSplits() : It generate the splits which are consist of string (or solution space range) and return to JobClient. CandidateRangeInputSplit class : It represent the split information. Originally, a InputSplit class has a length in bytes. But in this assignment, it has a length as a solution space size CandidateRangeRecordReader class : 1) initialize() : After creating this class, It is called with a inputSplit as a parameter. and It divides inputSplit by a record of key/value. 2) nextKeyValue() : Normally, this function in the RecordReader is called repeatedly to polulate the key and value objects for the mapper. and When the reader gets to the end of the stream, the next method false, and the map task completes. But in our case, it is called only one. PasswordCrackerMapper class : map() : After reading a key/value, it compute the password by using a function of PasswordCrackerUtil class If it finds the original password, pass the original password to reducer. Otherwise is not. TerminationChecker class : The class is used for early termination. In this assignment, we use the existence of a file (named "FoundXYZ.txt") as the singal of task completion. So, If a mapper successfully finds the original password, then it should create the file with function XYZ. Then other mappers will check the existence of the file and terminate. PasswordCrackerReducer class : It write both original password and encrypted password in the file. PasswordCrackerUtil class : Utility class for PasswordCracker. ********* 4. What you need to implement ********* In the template code, you will find comments with COMPLETE (all in capital). You are required to implement the necessary code in those locations. Here we list all the functions you need to implement: 1) CandidateRangeInputFormat class - getSplits() 2) CandidateRangeRecordReader class - initialize() - nextkeyValue() 3) PasswordCrackerUtil class ******** 5. Building the PasswordCrackerInHadoop ******** 1) ./compile.sh 2) ./run.sh [OUTPUT_PATH] [ENCRYPTED_PASSWORD] [NUMBER_OF_SPLIT] ex) ./run.sh output c4b9942f2886cd34fce932f279000ef3 4 part-r-00000 +-----------------------------------------+ | | | c4b9942f2886cd34fce932f279000ef3 1294ab | | | +-----------------------------------------+ ******** 6. Requirements ********* - You need to fill in the blanks in the template code. (Search the string 'COMPLETE' which marks the places where you need to implement your code) - You must write a report describing your implementation. The report need to be in word format or pdf format. - You must name your submission file as following: hw5_studentNumber.tar, hw5_studentNumber.pdf (or .docx) and submit both files (hw5_studentNumber.tar, hw4_studentNumber.pdf) - You must fully describe the code you have implemented. - The program must support early termination. ex) if one task discovers the password, then all the tasks immediately stop the computation. ********* 7. Report ********* You must submit the report for this assignment as well as your source code. The report must be in PDF/Word format and include the following : - Describe the code you have implemented. - Test 1) 684f632dd58d42d436be855610737508 2) 22719a4e80526053c143ec944c71090c 3) ebd92e0b253e114e4d74a603a7295375 4) bc11f06afb9b27070673471a23ecc6a9 - Discuss the advantages and disadvantages of using multithreading(be used in assignment01) versus using Hadoop. - Which one is better in Hadoop? I/O intensive task versus CPU intensive task? And why? If you have any questions, Please send an email to bongki@unist.ac.kr