-
Notifications
You must be signed in to change notification settings - Fork 3
Description
The hemp parameter in FoLiA-stats collects spaced words. It currently breaks on ligatures (see example). It also fails to collect the last letter if this has a trailing punctuation mark, which happens often.
reynaert@black:/reddata/PILOTS/LEVITICUS$ grep 'F r a n' /reddata/PILOTS/LEVITICUS/FOLIA/NOFOREIGN/levit.03.NoForeigns.folia.xml.txt
F r a n s c h zal
Z. F r a n k r ij k.
uitgeoefend. Z. F r a n k r ij k.
F r a n k r ij k.
reynaert@black:/reddata/PILOTS/LEVITICUS$ cat TESTFRQ/TESTFRQFOLIAtagdiv.hemp |grep 'F_r_a_n'
F_r_a_n_k_r
1/ ligatures should be seen as single characters.
2/ a final character with a trailing punctuation mark should also be collected.
Perhaps both little issues might be solved by allowing for the 'occasional' two character sequence, given repetitions of single characters in historically emphasised text.