Skip to content

docx2text - unwrapping zip - fails and crashes #48

@ventz

Description

@ventz
File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 284, in load
    docs += self.process_pages(
  File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 439, in process_pages
    doc = self.process_page(
  File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 479, in process_page
    attachment_texts = self.process_attachment(page["id"], ocr_languages)
  File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 557, in process_attachment
    text = title + self.process_doc(absolute_url)
  File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 657, in process_doc
    return docx2txt.process(file_data)
  File "/Users/user/project/.venv/lib/python3.10/site-packages/docx2txt/docx2txt.py", line 88, in process
    text += xml2text(zipf.read(doc_xml))
  File "/Users/user/.pyenv/versions/3.10.0/lib/python3.10/zipfile.py", line 1464, in read
    with self.open(name, "r", pwd) as fp:
  File "/Users/user/.pyenv/versions/3.10.0/lib/python3.10/zipfile.py", line 1503, in open
    zinfo = self.getinfo(name)
  File "/Users/user/.pyenv/versions/3.10.0/lib/python3.10/zipfile.py", line 1430, in getinfo
    raise KeyError(
KeyError: "There is no item named 'word/document.xml' in the archive"

This should probably be wrapped as such:

if doc_xml not in zipf.namelist():
    # Handle the missing file - skip it/soft error/etc
else:
    text += xml2text(zipf.read(doc_xml))
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions