I'm trying to adapt TransformerSum to a non-English custom dataset and currently very confused about this code in extractive.py:
|
if tokenized: |
|
src_txt = [ |
|
" ".join([token.text for token in sentence if str(token) != "."]) + "." |
|
for sentence in input_sentences |
|
] |
|
else: |
|
nlp = English() |
|
sentencizer = nlp.create_pipe("sentencizer") |
|
nlp.add_pipe(sentencizer) |
|
|
|
src_txt = [ |
|
" ".join([token.text for token in nlp(sentence) if str(token) != "."]) |
|
+ "." |
|
for sentence in input_sentences |
|
] |
- Why separate the words with spaces, when the resulting string is then tokenized using the tokenizer from the transformers library? I assume those tokenizers are not usually trained on pre-tokenized text, and neither are the pretrained models?
- Why remove the space before
"." characters, but not anywhere else?
Thanks for any explanations.
I'm trying to adapt TransformerSum to a non-English custom dataset and currently very confused about this code in
extractive.py:TransformerSum/src/extractive.py
Lines 1093 to 1107 in 15bd11d
"."characters, but not anywhere else?Thanks for any explanations.