Skip to content

S3bas7ian-D/File-indexing-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

# Tema 2 - File Indexing System

## Description

A keyword-based file indexing system implemented in C. Documents are stored in a
doubly linked list and their keywords are indexed in a multiway trie. Each trie
leaf holds a singly linked list of pointers back to the documents that contain
that keyword. TOPK queries use an integer max-heap with a function-pointer
comparator; a parallel array of document pointers maps extracted scores back to
their owning documents.

---

## Data Structures

**kw_cell / kw_list** - singly linked list of keyword strings owned by a document.
Each node stores a heap-allocated keyword string and a pointer to the next node.

**TCell2 / TL2** - doubly linked list of documents. Each node stores the document
filename (`id`), its relevance score, a pointer to its keyword list, and `pre`/`urm`
pointers for O(1) unlinking. The list is headed by a sentinel node allocated by
`InitL2`, which carries no document data.

**arb_L_Cell / arb_L_L** - singly linked list stored at trie leaf nodes. Each node
holds a raw pointer (`cell`) to a document in the TL2 list. These nodes do not own
the documents; freeing them does not free the document.

**Node / Tree** - one node in the multiway trie. Stores the character (`letter`), a
dynamic array of child pointers sized to `alphabet_size` (26), the current child
count, an `isKeyword` flag set when the node has ever been a keyword endpoint, and
`p`, a pointer to the `arb_L_L` reference list (non-NULL only at active keyword
endpoints).

**THeap** - integer array heap. Stores a flat `int` array `v`, capacity `nrMax`,
current size `nrElem`, and a `TFCmp` function pointer comparator that determines
whether the heap acts as a min-heap or max-heap.

---

## Operations

**ADD** `<filename> <score> <n> <kw1> ... <kwn>` - creates a new document node,
appends it to the TL2 list, builds its keyword list (skipping duplicate keywords
within the same command), and inserts each keyword into the trie. Returns `EXISTS`
if the filename is already present, `OK` otherwise.

**DEL** `<filename>` - finds the document in the list, iterates its keyword list
calling `removeFromTree` for each keyword, frees the keyword list, unlinks the
document from the doubly linked list, and frees it. Returns `NOT FOUND` if the
document does not exist.

**ADDKW** `<filename> <keyword>` - finds the document, checks whether the keyword
is already in its keyword list (no-op if so, still returns `OK`), appends the new
keyword to the list, and inserts it into the trie. Returns `NOT FOUND` if the
document does not exist.

**DELKW** `<filename> <keyword>` - removes a keyword from a document's keyword list
and from the trie via `remove_Keyword`. If the document's keyword list becomes empty
after the removal, the document is also unlinked and freed from the list. Returns
`NOT FOUND` if the document or keyword is not found.

**FIND** `<keyword>` - navigates the trie character by character to the keyword's
leaf node, collects all document IDs from the leaf's reference list into a temporary
array, sorts them with `qsort` using `cmp_str`, and prints the count followed by the
sorted IDs. Prints `NOT FOUND` if the path does not exist, `EMPTY` if the leaf has
no references.

**TOPK** `<keyword> <k>` - navigates to the keyword leaf, counts the documents, then
fills a max-heap with their scores while mirroring document pointers into a parallel
`docs` array. Extracts up to `k` scores one by one; on each extraction it scans the
`docs` array for all documents matching that score and picks the lexicographically
smallest one, marking it as used. Prints the count followed by document IDs in
descending score order, with ties broken alphabetically. Prints `EMPTY` if the
keyword is not found or has no associated documents.

**PRINT** - performs a DFS over the trie starting from the root. At each level,
children are sorted alphabetically using `cmp_children` before recursion. Whenever a
node has a non-NULL reference list, the current path is printed as the keyword,
followed by the document count and sorted document IDs. Prints `EMPTY` if the root
has no children.

**PREFIX** `<prefix>` - navigates to the last character of the prefix in the trie,
then calls `collect_docs_prefix` to recursively gather all unique document pointers
from the entire subtree below that node. Duplicates (same document reachable via
multiple keywords sharing the prefix) are filtered during collection. The resulting
IDs are sorted with `qsort` and printed. Prints `NOT FOUND` if the prefix path does
not exist, `EMPTY` if no documents are found in the subtree.

---

## Functions

**check_command** - scans the command string with `strstr` and returns an integer
code (1–8) identifying the command type. ADDKW and DELKW are checked before ADD and
DEL to avoid false prefix matches.

**InitL2** - allocates and returns a sentinel head node for the document list with
all fields zeroed/NULL. The sentinel is never freed until `cleanup`.

**createNode** - allocates a trie node, sets its letter, allocates the children
pointer array to `maxChildren` capacity, and initialises all counters and pointers.

**addChild** - appends a child pointer to a parent's children array if capacity
allows. Returns 0 if the array is full.

**insertWord** - walks the trie character by character, creating missing nodes with
`createNode` and attaching them with `addChild`. At the endpoint, sets `isKeyword`
to 1 and appends a new `arb_L_Cell` referencing the document to the leaf's list.

**pruneNode** - post-order recursive traversal that removes any non-root node whose
`isKeyword` flag is 0, whose reference list is NULL, and which has no remaining
children. Called after every deletion to keep the trie minimal.

**removeFromTree** - navigates to the keyword leaf, finds and unlinks the specific
document reference from the leaf's `arb_L_L` list, frees the list node, then calls
`pruneNode` on the root to remove any nodes that are now dead.

**remove_Keyword** - removes a keyword string from a document's `kw_list` (freeing
the string and the list node) and delegates trie cleanup to `removeFromTree`.

**ADD** - parses filename, score, and keyword count with `sscanf`. Walks the list to
check for duplicates. Builds the new `TCell2` node, appends it, then tokenises a
copy of the command string with `strtok` to iterate the keywords, skipping
duplicates within the same command and inserting each unique keyword into the trie.

**DEL** - parses the filename, finds the document, iterates its keyword list calling
`removeFromTree` for each entry, frees each keyword node, unlinks the document node
from the doubly linked list, and frees it.

**ADDKW** - parses filename and keyword. Walks the keyword list to check for an
existing entry (returns 1 without re-inserting if found). Otherwise allocates a new
`kw_cell`, appends it to the document's list, and calls `insertWord`.

**DELKW** - parses filename and keyword, finds the document, calls `remove_Keyword`,
and if the document's keyword list is now empty, unlinks and frees the document from
the list.

**find** - navigates the trie to the keyword endpoint. Collects document IDs into a
temporary pointer array, sorts with `cmp_str`, and prints the result.

**TOPK** - navigates to the keyword endpoint, allocates a `TL2` pointer array and a
max-heap of the same size, fills both in a single pass over the leaf list, then
extracts up to `k` elements. Ties in score are resolved by linear scan over the
pointer array to find the lexicographically smallest unused document.

**PRINT** - entry point for the DFS traversal. Returns `EMPTY` if the root has no
children, otherwise calls `printDFS` with an empty path buffer.

**printDFS** - recursive DFS helper. Prints the accumulated keyword and sorted
document IDs whenever a node has a non-NULL reference list. Sorts the children array
in-place with `cmp_children` before descending so output is alphabetical.

**PREFIX** - navigates to the prefix endpoint, then calls `collect_docs_prefix` to
gather all unique document pointers in the subtree. Sorts the resulting IDs and
prints them.

**collect_docs_prefix** - recursive helper that visits every node in a subtree.
For each node with a non-NULL reference list it checks each document pointer against
the already-collected array and appends it only if not already present. Grows the
array with `realloc` as needed (doubles capacity each time).

**cmp_str** - `qsort`-compatible comparator for `char *` pointers; delegates to
`strcmp`.

**cmp_children** - `qsort`-compatible comparator for `Tree` pointers; compares by
`letter` field to sort trie children alphabetically.

**RetMinHeap / RetMaxHeap** - comparator functions for `THeap`. `RetMinHeap` returns
`a < b` (min-heap behaviour); `RetMaxHeap` returns `a > b` (max-heap behaviour).

**AlocHeap** - allocates a `THeap` struct and its internal integer array, sets
capacity, element count to 0, and stores the comparator.

**DestroyHeap** - frees the internal array and the struct, then sets the pointer to
NULL via a double pointer.

**swap** - exchanges two `int` values in place; used by `shiftUp` and `shiftDown`.

**shiftUp** - bubble-up step: compares `v[i]` with its parent and swaps upward while
the comparator holds. Used after `InsertHeap`.

**shiftDown** - sift-down step: compares the current node with its two children,
swaps with the winner while the comparator holds. Used after `ExtractTop`.

**InsertHeap** - places a value at the end of the array and calls `shiftUp`. Returns
0 if the heap is full.

**ExtractTop** - saves `v[0]`, moves the last element to the top, decrements the
count, and calls `shiftDown`. Returns 0 if empty.

**PeekTop** - reads `v[0]` without modifying the heap. Returns 0 if empty.

**printHeap** - prints the heap contents via in-order traversal using index
arithmetic; call with `pos = 0` for the whole heap. Intended for debugging.

**free_kw_list** - iterates a `kw_list`, freeing each keyword string and list node.

**free_doc** - frees a single `TCell2` by calling `free_kw_list` on its keyword
list, then freeing the `id` string and the node itself.

**free_doc_list** - iterates the entire document list from the sentinel head,
freeing each node's `id` string, keyword list, and the node itself.

**free_leaf_list** - iterates an `arb_L_L`, freeing each wrapper node (but not the
documents they point to, since those are owned by the document list).

**free_tree** - post-order recursive traversal: frees all children first, then calls
`free_leaf_list` on the node's reference list, frees the children array, and frees
the node itself.

**cleanup** - master teardown called at the end of `main`. Frees the trie first
(leaf reference lists only, not the documents), then frees the document list
including all keyword lists and strings.

---

## Files

- `tema.h` — struct definitions (`kw_cell`, `TCell2`, `arb_L_Cell`, `Node`,
  `THeap`) and all function declarations.
- `f-tema.c` — all function implementations.
- `tema2.c` — `main`; opens `indexare.in` / `indexare.out`, reads the command count,
  dispatches each command through `check_command`, and calls `cleanup` before exit.

---

## Input / Output

Commands are read from `indexare.in`. The first line contains the number of commands
`n`. Each subsequent line is one command. Results are written to `indexare.out`.

| Command | Output on success | Output on failure |
|---|---|---|
| `ADD f s n kw…` | `OK` | `EXISTS` |
| `DEL f` | `OK` | `NOT FOUND` |
| `ADDKW f kw` | `OK` | `NOT FOUND` |
| `DELKW f kw` | `OK` | `NOT FOUND` |
| `FIND kw` | `<count> <id> …` | `NOT FOUND` / `EMPTY` |
| `TOPK kw k` | `<count> <id> …` | `EMPTY` |
| `PRINT` | per-keyword lines | `EMPTY` |
| `PREFIX p` | `<count> <id> …` | `NOT FOUND` / `EMPTY` |

About

A file indexing system program in C.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages