Skip to content

feat(db): extract get_trait_data_pft() as standalone file-I/O-free function#3978

Open
omkarrr2533 wants to merge 4 commits into
PecanProject:developfrom
omkarrr2533:gsoc/get-trait-data-pft-standalone
Open

feat(db): extract get_trait_data_pft() as standalone file-I/O-free function#3978
omkarrr2533 wants to merge 4 commits into
PecanProject:developfrom
omkarrr2533:gsoc/get-trait-data-pft-standalone

Conversation

@omkarrr2533
Copy link
Copy Markdown
Member

get.trait.data.pft() currently queries the DB, computes trait_data and
prior_distns, saves them to disk, then throws them away — forcing
run.meta.analysis.pft() to immediately reload the same files.

This PR cuts that first link in the chain:

  • Adds get_trait_data_pft(): same query logic, no disk reads or writes,
    returns list(trait_data, prior_distns, pft_info) directly
  • Patches the existing wrapper to attach trait_data and prior_distns
    to the returned pft — so downstream functions can use them in-memory
    without changing anything for callers that don't care
  • All existing behavior unchanged: dir.create, save(), CSV output,
    dbfile.insert(), caching all stay exactly as they were

Tests added: 3 validation tests (fire without DB), 8 DB-backed tests
including an end-to-end equivalence check that verifies the standalone
returns identical objects to what the wrapper writes to disk.

Part of GSoC 2026 Week 1 — follows the same pattern as
meta_analysis_standalone() which was already extracted this way.

@omkarrr2533 omkarrr2533 force-pushed the gsoc/get-trait-data-pft-standalone branch from fd354ca to 8525b71 Compare May 11, 2026 14:49
@mdietze
Copy link
Copy Markdown
Member

mdietze commented May 11, 2026

At a high level, I'm pretty confused about the relationship between get.trait.data.pft and get_trait_data_pft as neither calls the other, so which is the wrapper around the other? Instead of a thin backwards-compatible wrapper, we seems to have two implementations of the the same thing, which isn't what we're looking for.

Also, nothing seems to be calling get_trait_data_pft yet. If this is the intended replacement, it seems like you should be swapping out the call in get.trait.data

@S1DDHEY
Copy link
Copy Markdown
Contributor

S1DDHEY commented May 11, 2026

A straightforward fix that I could think of is get.trait.data.pft() should be refactored to call get_trait_data_pft() for the query part, then add the file I/O / dbfile.insert() on top of its result.

@omkarrr2533
Copy link
Copy Markdown
Member Author

Understood the wrapper needs to actually call get_trait_data_pft() for the query part rather than duplicating that logic. Will refactor get.trait.data.pft() to delegate to the standalone in the non-cache path, matching the pattern from #3860 where get.parameter.samples() delegates to get_parameter_samples(). Will push the fix shortly.

@omkarrr2533 omkarrr2533 force-pushed the gsoc/get-trait-data-pft-standalone branch from 8525b71 to 6a140a7 Compare May 12, 2026 02:23
@omkarrr2533 omkarrr2533 force-pushed the gsoc/get-trait-data-pft-standalone branch 3 times, most recently from 12219a8 to e5f7353 Compare May 12, 2026 16:10
@omkarrr2533 omkarrr2533 force-pushed the gsoc/get-trait-data-pft-standalone branch from e5f7353 to 3fd20cf Compare May 12, 2026 16:14
@omkarrr2533 omkarrr2533 marked this pull request as ready for review May 12, 2026 16:56
Comment on lines +55 to +61
computed <- get_trait_data_pft(
pft_name = pft[["name"]],
modeltype = modeltype,
dbcon = dbcon,
trait_names = trait.names,
constants = if (!is.null(pft$constants)) pft$constants else list()
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this get_trait_data_pft is called before the forceupdate check, meaning every call hits the db, even when the cache is valid

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fix could be to move this inside the cache-miss section.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the get_trait_data_pft() call into the cache-miss path. The DB is no longer queried on a cache-hit. Added a cheap query_pfts() call before the cache check to resolve pftid, pfttype, and pft_member_filename without fetching full trait data.

Comment thread base/db/R/get.trait.data.pft.R
@omkarrr2533 omkarrr2533 force-pushed the gsoc/get-trait-data-pft-standalone branch 2 times, most recently from c590f33 to 43919c2 Compare May 13, 2026 10:09
Comment thread base/db/R/get.trait.data.pft.R Outdated
##' instead.
##' Internally this wrapper delegates all database queries to
##' \code{\link{get_trait_data_pft}} exactly once. The returned objects are
##' used for both the cache-staleness check and the save step, so the database
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bit about cacheing something the user needs to know to be able to use the function correctly, or something that should just be internally documented within the function? I suspect it's going to confuse more than enlighten

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point moved it out. The cache mechanics were noise in user-facing roxygen. The new wrapper delegates to get_trait_data_pft() exactly once at the top, so the cache details now live as a short inline comment block right above the staleness check, where someone reading the function body will see them and someone reading ?get.trait.data.pft won't.

##' detect changes between runs.}
##' }
##'
##' **Downstream contract:** The files `trait.data.Rdata` and
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't remove this until run.meta.analysis.pft no longer expects these files

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed restored the Downstream contract section. Added a one-line note that it'll be removed in the Week 2 PR once run.meta.analysis.pft() accepts in-memory inputs.

##' `pft` should be a list containing at least `name` and `outdir`, and
##' optionally `posteriorid` and `constants`.
##'
##' **File-based side effects (saved to `pft$outdir`):**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these file-based side effects are still true and still really important information to retain. I'd restore these comments

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored. Kept the full File-based side effects section — every file listed is still being written exactly as before, so the docs should reflect that. 👍

Comment thread base/db/R/get.trait.data.pft.R Outdated
##' ID of the (possibly new) posterior record in BETYdb. The posterior ID can
##' be used to locate the output files (`trait.data.Rdata`, `prior.distns.Rdata`,
##' ID of the (possibly new) posterior record in BETYdb. Also contains
##' `pft$trait_data` and `pft$prior_distns` for in-memory chaining on both
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "also contains" part is critical but insufficiently described/documented and insufficiently connected to the output files part. Indeed, I'm not 100% sure this function should return these objects at all, at least by default, as that's a nontrivial, breaking change to the previous behavior. The problem is that if these objects are returned as part of a pft input list that is then used to updated that part of the settings object, that doing so will break the ability to save the settings object. I wonder if we should add a flag to control this behavior?

The "in-memory chaining on both the cache-hit and cache-miss paths" is also unneeded.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on the flag. Added a return_data parameter, default FALSE, so legacy behavior is unchanged. When return_data = TRUE, the returned pft list also includes trait_data and prior_distns for in-memory chaining; otherwise nothing is attached.
Updated the @return docs to make the tradeoff explicit:

"Default is FALSE to preserve legacy behavior — when pft is embedded inside a settings object, attaching these data frames by default would inflate the settings and break serialization."

Also dropped the "in-memory chaining on both cache-hit and cache-miss paths" line per your comment.

Comment thread base/db/R/get.trait.data.pft.R Outdated
pft_member_filename = "cultivars.csv"
pftid <- pftres[["id"]]
pfttype <- pftres[["pft_type"]]
pft_member_filename <- if (pfttype == "cultivar") "cultivars.csv" else "species.csv"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit is not equivalent to what was here previously and as such we seem to be loosing the logger.severe associated with an unmatched pfttype. This would be a good test to add if we don't have it already -- mock having an invalid pft_type and check that the function stops rather than continuing on thinking it's working with plant / species.csv data

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was a regression on my part. Restored the explicit three-branch check with logger.severe for unknown pft_type. The actual guard lives inside get_trait_data_pft() now (since the wrapper delegates there), but I added a wrapper-level test (test_that("wrapper errors for unknown pft_type, not silent fallback")) that mocks query_pfts to return a bad record and asserts the function stops. So the contract is covered from both ends.

Comment thread base/db/R/get.trait.data.pft.R Outdated
if (done) {
trait_env <- new.env(parent = emptyenv())
prior_env <- new.env(parent = emptyenv())
load(file.path(pft$outdir, "trait.data.Rdata"), envir = trait_env)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I thought the whole point of the refactor was that these objects would be returned by get_trait_data_pft without having to write them to disk and then load them back in?
  2. see previous comment about not being sure returning these data structures should be the default behavior for the legacy function

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that was a leftover from an earlier version of the diff. With the single-delegation structure, trait.data and prior.distns are already in scope from the top-of-function call, so the cache-hit path just attaches those directly (gated on return_data = TRUE). No load() from disk.

Comment thread base/db/R/get.trait.data.pft.R Outdated

## Attach computed objects so downstream callers can chain in-memory
## without loading files from pft$outdir.
pft$trait_data <- trait.data
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as before about adding a flag to make this optional. Should default to not returning (legacy behavior)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done handled by the same return_data flag from the comment above. Both the cache-hit and cache-miss return paths attach trait_data / prior_distns only when return_data = TRUE. Default FALSE = legacy behavior, no breaking change.


# ---- Resolve PFT to a single database record ----
pft_record <- query_pfts(dbcon, pft_name, modeltype, strict = TRUE)
if (nrow(pft_record) > 1L) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get.trait.data.pft also has a check for length 0. Is that needed here? Also, if the checks are here, are they still needed in get.trait.data.pft too?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was a real duplication. With the new single-delegation structure, all DB queries and their associated validation (length-0, multi-PFT, pft_type guard, dbcon class, etc.) live only in get_trait_data_pft(). The wrapper inherits the errors through the delegation. So both checks are present once, in one place — no abstraction needed since there's no longer duplication.

# ---- Fetch PFT member species or cultivars ----
# The join table depends on pft_type. An unknown type is a hard error —
# silently falling through to the species path would produce wrong results.
if (identical(pft_type, "cultivar")) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, if this check is here, is it also needed in get.trait.data.pft? If all these checks are actually needed in both places, can they be abstracted to a function that is used in both places?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pft_type guard now lives only in get_trait_data_pft(). The wrapper's inline copy is gone in the new diff. If we ever do find a check that genuinely needs to be in both places, I agree a shared helper would be the right move, but right now there's nothing to abstract.

Comment thread base/db/R/get_trait_data_pft.R
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants