Skip to content

639 allow the user to define input and output file names#734

Open
mmutic wants to merge 12 commits intodevelopfrom
639-allow-the-user-to-define-input-and-output-file-names
Open

639 allow the user to define input and output file names#734
mmutic wants to merge 12 commits intodevelopfrom
639-allow-the-user-to-define-input-and-output-file-names

Conversation

@mmutic
Copy link
Copy Markdown
Collaborator

@mmutic mmutic commented Aug 5, 2024

Description

This feature allows users to input two YAML files, input_settings.yml, and results_settings.yml. The file input_settings.yml allows users to specify the name of their input files as well as the paths and names of the input folders. A function called configure_input_settings has been added that adds a dictionary of these names, merged with default names, to the setup dictionary.

The function load_dataframe(), which is called to open all files in GenX, has been altered to use DuckDB (per Greg's suggestion). DuckDB can open files of type CSV, Parquet, and JSON, all of which can also be compressed (i.e. .gz), so users can now have input files of any of those types.

The file results_settings.yml can contain the desired names of the results file. Names in the YAML file can be entered with or without a file extension. In genx_settings.yml, two new keys can be added: ResultsFileType and ResultsCompressionType, whose defaults are both "auto_detect". Both of those keys are used in the function "write_output_files()".

The function write_output_files() has been added to write_outputs.jl. It uses DuckDB to save files according to a specified file type, which can be .csv, .csv.gz, .parquet, .json, or .json.gz. If filetype is set to "auto_detect", it will detect if the file name contains an extension (if no extension is present, .csv is used). If a filetype is set to something (eg .parquet) but that extension is not present in the filename, the extension is added. A compression type can also be specified, these are .gz for CSV and JSON files, and -snappy and -zstd for Parquet files. The compression type can also be auto_detected.

The goal is for write_output_files to replace all instances in which CSV.write is currently used. This is a work in progress and is only present in some places in GenX at the moment.

An example file, 10_three_zones_define_input, contains the aforementioned YAML files.

Edit 9/11/24: Multistage inputs can now also be defined using input_settings.yml. The structure is a bit different (uses indentation to make a separate subdictionary for each input stage), see 6_three_zones_w_multistage for an example YAML file. Results multistage file names can also be changed using results_settings.yml, but the file structure is the same as in single stage. I deleted 10_three_zones_define_input as it's the exact same as 1_three_zones, but added input_settings.yml and results_settings.yml to 1_three_zones. The function write_output_files now replaces CSV.write() in almost all instances. Documentation has also been updated to reflect new capabilities.

Notes from GenX Meeting 9/12

  1. The function write_output_file() could potentially be split into smaller functions as it is very long.
  2. The headers in NSE and PowerBalance are currently incompatible with DuckDB as they have repeated column names.
  3. Being able to read and output databases using DuckDB could be added (I think this can be done relatively easily).
  4. The structure of input_settings.yml for multistage could be altered to have global files. The last two commits on this PR (done on 9/11 and 9/12) are the ones involving multistage, so this PR can be split into single and multistage PRs.

Side note, not brought up in the meeting: the results files specific to multistage (capacities_multi_stage etc) have not been tested with write_output_file(), but the code is present and commented out.

What type of PR is this? (check all applicable)

  • Feature

Related Tickets & Documents

Issue #639

Checklist

  • Code changes are sufficiently documented; i.e. new functions contain docstrings and .md files under /docs/src have been updated if necessary.
  • The latest changes on the target branch have been incorporated, so that any conflicts are taken care of before merging. This can be accomplished either by merging in the target branch (e.g. 'git merge develop') or by rebasing on top of the target branch (e.g. 'git rebase develop'). Please do not hesitate to reach out to the GenX development team if you need help with this.
  • Code has been tested to ensure all functionality works as intended.
  • CHANGELOG.md has been updated (if this is a 'notable' change).
  • I consent to the release of this PR's code under the GNU General Public license.

How this can be tested

Working on writing test functions. For now, testing can be done by altering the input and results YAML files in example 10 and ensuring the expected results follow.

Post-approval checklist for GenX core developers

After the PR is approved

  • Check that the latest changes on the target branch are incorporated, either via merge or rebase
  • Remember to squash and merge if incorporating into develop

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants