Skip to content

Operation 'hemp' parameter in FoLiA-stats #29

@martinreynaert

Description

@martinreynaert

The hemp parameter in FoLiA-stats collects spaced words. It currently breaks on ligatures (see example). It also fails to collect the last letter if this has a trailing punctuation mark, which happens often.

reynaert@black:/reddata/PILOTS/LEVITICUS$ grep 'F r a n' /reddata/PILOTS/LEVITICUS/FOLIA/NOFOREIGN/levit.03.NoForeigns.folia.xml.txt
F r a n s c h zal
Z. F r a n k r ij k.
uitgeoefend. Z. F r a n k r ij k.
F r a n k r ij k.
reynaert@black:/reddata/PILOTS/LEVITICUS$ cat TESTFRQ/TESTFRQFOLIAtagdiv.hemp |grep 'F_r_a_n'
F_r_a_n_k_r

1/ ligatures should be seen as single characters.
2/ a final character with a trailing punctuation mark should also be collected.

Perhaps both little issues might be solved by allowing for the 'occasional' two character sequence, given repetitions of single characters in historically emphasised text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions