Hors

Hors is a data procecing tool I made for putting together plain text corpus(es) suitable to be used with neural networks.

This tool is aimed to work with the dumps from Fimfarchive like the tool Fimfarchive

Usage

hors [OPTIONS]

Commit crimes against human kind, just like God intended.

Optional arguments:
  -h,--help             Show this help message and exit
  -a,--arkdir ARKDIR    Directory directory where the EPUBs are stored,
                        directory hierarchy does not matter.
  -n,--runame RUNAME    Name of this run.
  -o,--output OUTPUT    Directory where to store the resoulting file.
  -v,--verbose          Be verbose and slower.
  -u,--uniq             Filter adjacent matching lines from corpus.
  --overwrite           Overwrite existing corpus{es}; Useful when working with
                        limited disk space.
  --prefix PREFIX       Prefix to put before each chapter
  --suffix PREFIX       suffix to put after each chapter

It works on my machine

I wrote this tool to work out of the box in that configuration. I open source because I thought it may be useful to someone someday maybe perhaps Also, to brag that I can use rust :P

On my machine, I have my things stored in such a way that:

project_folder:
    - pony:  # This project
        - corp:
            - $(runname):
                - $(timestamp)-$(runname).txt  # Default
                - $(runname).txt  # If overwirte == True
        - archive:  # This is the latest Fimfarchive dump
            - epub:
                - *[$(char)]:
                    - *[$(Author)]:
                        - *[$(Title)-$(id).epub]
            - index.json
            - about.json
            - readme.pdf
    - *foo:  # Other projects
        ...
    - tools:
        - hors.sh  # this is the old hors implementation
        - hors     # ideally, this binary
        - some.r
        - other.py
        - tools.sh

And I wrote this tool to work out of the box in that configuration I also added manual overrides for that but it's kind of painful to use At the end, is up to you :person_shrugging:

Original implementation

The original hors implementation can be found on the original folder

It is slow and painful but it does the work, after about a day of runtime or so.

TODO

Add support for working on compressed dump.
Add unit test (because why not)
Add option to disable or change prefix and suffix
Add output formatting:
- JSON
- CSV
- ???
add propper error handling and logging instead of just panicing
add a fancy progress bars

In the future

Perhaps ading filtering support directly on the program, such as:

filters:
    - by story:
        - by tags:
            - characters
            - series
            - warning
        - lenght:
            - spesific
            - flash ( <1k words)
            - short ( <5k words)
            - small ( <10k words )
            - medium ( <10k - 50k words )
            - long ( >50k words)
            - custom word count
        - rating:
        - popularity
            - highly rated stories ( >80% rating )
            - poorly rated stories ( <20% rating )
            - custom
        - view count
        - content rating
        - publication state
        - completion status
        - publication date
        - chapter count
    - by author:
        - ammount of stories
        - word count
        - creatin date

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
.vscode		.vscode
original		original
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hors

Usage

It works on my machine

Original implementation

TODO

In the future

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hors

Usage

It works on my machine

Original implementation

TODO

In the future

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages