Why tokenize twice?

I'm trying to adapt TransformerSum to a non-English custom dataset and currently very confused about this code in `extractive.py`:
https://github.com/HHousen/TransformerSum/blob/15bd11d3532ae2bd43f6b8aca2198483df701460/src/extractive.py#L1093-L1107

- Why separate the words with spaces, when the resulting string is then tokenized using the tokenizer from the transformers library? I assume those tokenizers are not usually trained on pre-tokenized text, and neither are the pretrained models?
- Why remove the space before `"."` characters, but not anywhere else?

Thanks for any explanations.

	if tokenized:
	src_txt = [
	" ".join([token.text for token in sentence if str(token) != "."]) + "."
	for sentence in input_sentences
	]
	else:
	nlp = English()
	sentencizer = nlp.create_pipe("sentencizer")
	nlp.add_pipe(sentencizer)

	src_txt = [
	" ".join([token.text for token in nlp(sentence) if str(token) != "."])
	+ "."
	for sentence in input_sentences
	]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why tokenize twice? #73

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why tokenize twice? #73

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions