Skip to content

Recreation of web search engine using custom C-based crawler, indexer, and querier

Notifications You must be signed in to change notification settings

20eddibae/WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tiny Search Engine

Eddie Bae (GitHub username: 20eddibae)

This repository implements the three components of CS50’s Tiny Search Engine:

  1. crawler — web crawler that pulls pages from a seed URL
  2. indexer — builds an inverted index from the crawled pages
  3. querier — answers search queries against the index

Prerequisites

  • A UNIX‐compatible shell (macOS / Linux)
  • make, gcc, standard build tools
  • Internet connection (for crawling)

Build

From the top‐level directory:

# build libcs50 and all three tools
make all

Usage

  1. Crawl
# <pagedir> must not exist or be empty
./crawler/crawler <seedURL> <pagedir> <maxDepth>
  1. Indexer
mkdir indexdir
./indexer/indexer pages indexdir

Example:

./crawler/crawler http://cs50tse.cs.dartmouth.edu/tse/letters pages 2
  1. Querier
./querier/querier indexdir
  1. Clean
make clean

About

Recreation of web search engine using custom C-based crawler, indexer, and querier

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •