S3bas7ian-D/File-indexing-system
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
# Tema 2 - File Indexing System ## Description A keyword-based file indexing system implemented in C. Documents are stored in a doubly linked list and their keywords are indexed in a multiway trie. Each trie leaf holds a singly linked list of pointers back to the documents that contain that keyword. TOPK queries use an integer max-heap with a function-pointer comparator; a parallel array of document pointers maps extracted scores back to their owning documents. --- ## Data Structures **kw_cell / kw_list** - singly linked list of keyword strings owned by a document. Each node stores a heap-allocated keyword string and a pointer to the next node. **TCell2 / TL2** - doubly linked list of documents. Each node stores the document filename (`id`), its relevance score, a pointer to its keyword list, and `pre`/`urm` pointers for O(1) unlinking. The list is headed by a sentinel node allocated by `InitL2`, which carries no document data. **arb_L_Cell / arb_L_L** - singly linked list stored at trie leaf nodes. Each node holds a raw pointer (`cell`) to a document in the TL2 list. These nodes do not own the documents; freeing them does not free the document. **Node / Tree** - one node in the multiway trie. Stores the character (`letter`), a dynamic array of child pointers sized to `alphabet_size` (26), the current child count, an `isKeyword` flag set when the node has ever been a keyword endpoint, and `p`, a pointer to the `arb_L_L` reference list (non-NULL only at active keyword endpoints). **THeap** - integer array heap. Stores a flat `int` array `v`, capacity `nrMax`, current size `nrElem`, and a `TFCmp` function pointer comparator that determines whether the heap acts as a min-heap or max-heap. --- ## Operations **ADD** `<filename> <score> <n> <kw1> ... <kwn>` - creates a new document node, appends it to the TL2 list, builds its keyword list (skipping duplicate keywords within the same command), and inserts each keyword into the trie. Returns `EXISTS` if the filename is already present, `OK` otherwise. **DEL** `<filename>` - finds the document in the list, iterates its keyword list calling `removeFromTree` for each keyword, frees the keyword list, unlinks the document from the doubly linked list, and frees it. Returns `NOT FOUND` if the document does not exist. **ADDKW** `<filename> <keyword>` - finds the document, checks whether the keyword is already in its keyword list (no-op if so, still returns `OK`), appends the new keyword to the list, and inserts it into the trie. Returns `NOT FOUND` if the document does not exist. **DELKW** `<filename> <keyword>` - removes a keyword from a document's keyword list and from the trie via `remove_Keyword`. If the document's keyword list becomes empty after the removal, the document is also unlinked and freed from the list. Returns `NOT FOUND` if the document or keyword is not found. **FIND** `<keyword>` - navigates the trie character by character to the keyword's leaf node, collects all document IDs from the leaf's reference list into a temporary array, sorts them with `qsort` using `cmp_str`, and prints the count followed by the sorted IDs. Prints `NOT FOUND` if the path does not exist, `EMPTY` if the leaf has no references. **TOPK** `<keyword> <k>` - navigates to the keyword leaf, counts the documents, then fills a max-heap with their scores while mirroring document pointers into a parallel `docs` array. Extracts up to `k` scores one by one; on each extraction it scans the `docs` array for all documents matching that score and picks the lexicographically smallest one, marking it as used. Prints the count followed by document IDs in descending score order, with ties broken alphabetically. Prints `EMPTY` if the keyword is not found or has no associated documents. **PRINT** - performs a DFS over the trie starting from the root. At each level, children are sorted alphabetically using `cmp_children` before recursion. Whenever a node has a non-NULL reference list, the current path is printed as the keyword, followed by the document count and sorted document IDs. Prints `EMPTY` if the root has no children. **PREFIX** `<prefix>` - navigates to the last character of the prefix in the trie, then calls `collect_docs_prefix` to recursively gather all unique document pointers from the entire subtree below that node. Duplicates (same document reachable via multiple keywords sharing the prefix) are filtered during collection. The resulting IDs are sorted with `qsort` and printed. Prints `NOT FOUND` if the prefix path does not exist, `EMPTY` if no documents are found in the subtree. --- ## Functions **check_command** - scans the command string with `strstr` and returns an integer code (1–8) identifying the command type. ADDKW and DELKW are checked before ADD and DEL to avoid false prefix matches. **InitL2** - allocates and returns a sentinel head node for the document list with all fields zeroed/NULL. The sentinel is never freed until `cleanup`. **createNode** - allocates a trie node, sets its letter, allocates the children pointer array to `maxChildren` capacity, and initialises all counters and pointers. **addChild** - appends a child pointer to a parent's children array if capacity allows. Returns 0 if the array is full. **insertWord** - walks the trie character by character, creating missing nodes with `createNode` and attaching them with `addChild`. At the endpoint, sets `isKeyword` to 1 and appends a new `arb_L_Cell` referencing the document to the leaf's list. **pruneNode** - post-order recursive traversal that removes any non-root node whose `isKeyword` flag is 0, whose reference list is NULL, and which has no remaining children. Called after every deletion to keep the trie minimal. **removeFromTree** - navigates to the keyword leaf, finds and unlinks the specific document reference from the leaf's `arb_L_L` list, frees the list node, then calls `pruneNode` on the root to remove any nodes that are now dead. **remove_Keyword** - removes a keyword string from a document's `kw_list` (freeing the string and the list node) and delegates trie cleanup to `removeFromTree`. **ADD** - parses filename, score, and keyword count with `sscanf`. Walks the list to check for duplicates. Builds the new `TCell2` node, appends it, then tokenises a copy of the command string with `strtok` to iterate the keywords, skipping duplicates within the same command and inserting each unique keyword into the trie. **DEL** - parses the filename, finds the document, iterates its keyword list calling `removeFromTree` for each entry, frees each keyword node, unlinks the document node from the doubly linked list, and frees it. **ADDKW** - parses filename and keyword. Walks the keyword list to check for an existing entry (returns 1 without re-inserting if found). Otherwise allocates a new `kw_cell`, appends it to the document's list, and calls `insertWord`. **DELKW** - parses filename and keyword, finds the document, calls `remove_Keyword`, and if the document's keyword list is now empty, unlinks and frees the document from the list. **find** - navigates the trie to the keyword endpoint. Collects document IDs into a temporary pointer array, sorts with `cmp_str`, and prints the result. **TOPK** - navigates to the keyword endpoint, allocates a `TL2` pointer array and a max-heap of the same size, fills both in a single pass over the leaf list, then extracts up to `k` elements. Ties in score are resolved by linear scan over the pointer array to find the lexicographically smallest unused document. **PRINT** - entry point for the DFS traversal. Returns `EMPTY` if the root has no children, otherwise calls `printDFS` with an empty path buffer. **printDFS** - recursive DFS helper. Prints the accumulated keyword and sorted document IDs whenever a node has a non-NULL reference list. Sorts the children array in-place with `cmp_children` before descending so output is alphabetical. **PREFIX** - navigates to the prefix endpoint, then calls `collect_docs_prefix` to gather all unique document pointers in the subtree. Sorts the resulting IDs and prints them. **collect_docs_prefix** - recursive helper that visits every node in a subtree. For each node with a non-NULL reference list it checks each document pointer against the already-collected array and appends it only if not already present. Grows the array with `realloc` as needed (doubles capacity each time). **cmp_str** - `qsort`-compatible comparator for `char *` pointers; delegates to `strcmp`. **cmp_children** - `qsort`-compatible comparator for `Tree` pointers; compares by `letter` field to sort trie children alphabetically. **RetMinHeap / RetMaxHeap** - comparator functions for `THeap`. `RetMinHeap` returns `a < b` (min-heap behaviour); `RetMaxHeap` returns `a > b` (max-heap behaviour). **AlocHeap** - allocates a `THeap` struct and its internal integer array, sets capacity, element count to 0, and stores the comparator. **DestroyHeap** - frees the internal array and the struct, then sets the pointer to NULL via a double pointer. **swap** - exchanges two `int` values in place; used by `shiftUp` and `shiftDown`. **shiftUp** - bubble-up step: compares `v[i]` with its parent and swaps upward while the comparator holds. Used after `InsertHeap`. **shiftDown** - sift-down step: compares the current node with its two children, swaps with the winner while the comparator holds. Used after `ExtractTop`. **InsertHeap** - places a value at the end of the array and calls `shiftUp`. Returns 0 if the heap is full. **ExtractTop** - saves `v[0]`, moves the last element to the top, decrements the count, and calls `shiftDown`. Returns 0 if empty. **PeekTop** - reads `v[0]` without modifying the heap. Returns 0 if empty. **printHeap** - prints the heap contents via in-order traversal using index arithmetic; call with `pos = 0` for the whole heap. Intended for debugging. **free_kw_list** - iterates a `kw_list`, freeing each keyword string and list node. **free_doc** - frees a single `TCell2` by calling `free_kw_list` on its keyword list, then freeing the `id` string and the node itself. **free_doc_list** - iterates the entire document list from the sentinel head, freeing each node's `id` string, keyword list, and the node itself. **free_leaf_list** - iterates an `arb_L_L`, freeing each wrapper node (but not the documents they point to, since those are owned by the document list). **free_tree** - post-order recursive traversal: frees all children first, then calls `free_leaf_list` on the node's reference list, frees the children array, and frees the node itself. **cleanup** - master teardown called at the end of `main`. Frees the trie first (leaf reference lists only, not the documents), then frees the document list including all keyword lists and strings. --- ## Files - `tema.h` — struct definitions (`kw_cell`, `TCell2`, `arb_L_Cell`, `Node`, `THeap`) and all function declarations. - `f-tema.c` — all function implementations. - `tema2.c` — `main`; opens `indexare.in` / `indexare.out`, reads the command count, dispatches each command through `check_command`, and calls `cleanup` before exit. --- ## Input / Output Commands are read from `indexare.in`. The first line contains the number of commands `n`. Each subsequent line is one command. Results are written to `indexare.out`. | Command | Output on success | Output on failure | |---|---|---| | `ADD f s n kw…` | `OK` | `EXISTS` | | `DEL f` | `OK` | `NOT FOUND` | | `ADDKW f kw` | `OK` | `NOT FOUND` | | `DELKW f kw` | `OK` | `NOT FOUND` | | `FIND kw` | `<count> <id> …` | `NOT FOUND` / `EMPTY` | | `TOPK kw k` | `<count> <id> …` | `EMPTY` | | `PRINT` | per-keyword lines | `EMPTY` | | `PREFIX p` | `<count> <id> …` | `NOT FOUND` / `EMPTY` |