Skip to content

OCR Corpora Early 2025 Publication Thread #3

@ctschroeder

Description

@ctschroeder

OCR publication thread for early 2025

Do not close this issue until all checkboxes below are complete or have been rescheduled:

List of corpora:

In Processed OCR folder (need chapter divisions + sentence splitting+full automatic NLP processing like the bible corpora)

  • Giron Legendes (11 docs)
    • chapter divisions should follow
    • Daughters Zenon 132
      • chapter divisions added/checked
      • pb xml:id numbers corrected
      • lb n tags changed to ed_line n
      • ed_pg n tags changed to ed_page n
      • pb xml:id n tags changed to pb xml:id
      • metadata updated
      • final validation at https://gucorpling.org/gitdox/validate_sgml.py
      • this document should be divided into 3: 1) BNF 132.1 ff. 19-20, 2) Leiden fragment, 3) BNF 132.1 f 21
    • Daughters Zenon 78
      • chapter divisions added/checked
      • pb xml:id numbers corrected
      • lb n tags changed to ed_line n
      • ed_pg n tags changed to ed_page n
      • pb xml:id n tags changed to pb xml:id
      • metadata updated
      • final validation at https://gucorpling.org/gitdox/validate_sgml.py
    • Daughters Zenon Crawford
      • chapter divisions added/checked
      • pb xml:id numbers corrected
      • lb n tags changed to ed_line n
      • ed_pg n tags changed to ed_page n
      • pb xml:id n tags changed to pb xml:id
      • metadata updated
      • final validation at https://gucorpling.org/gitdox/validate_sgml.py
    • Emperors Daughter 1
    • Emperors Daughter 2
    • Emperors Daughter 3
    • Emperors Daughter 4
    • Eve Serpent
    • Marina BNF
    • Marina Clarendon
    • Sacrifice Abraham

all documents above need to be moved to https://github.com/CopticScriptorium/auto-corpora when done

In GitDox

  • timothy.alexandria corpus prayer.athanasius

    • translation span (work by E. Sturgis via @amir-zeldes in progress)
    • entities and identities
    • metadata
  • apocalypse.paul (2)
    - [ ] corpus name needed
    - [ ] other metadata updated
    - possibly error in data -- translation on p. 1043 begins with folio 24a but OCR coptic begins in the middle of folio 6a p. 533

  • pscyril.alexandria

    • On Mary still in XML mode (auto tagging?)
    • entities and identities
  • pscyril.jerusalem

    • on the cross
      • needs corpus name
      • metadata updated
      • chapter & verse need to be updated in spreadsheet based on open tags in XML
    • on Mary
      • needs corpus name
      • metadata updated
      • chapter & verse need to be updated in spreadsheet based on open tags in XML
  • psepiphanius on Mary

    • translation span
    • entities and identities
    • metadata
  • pschrysostom

    • translation span
    • entities and identities
    • metadata
  • pscelestinus

  • pstimothy.alex

  • psote.psoi

  • timothy.discourse

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions