Add possibility to overwrite data in datapackage#73
Conversation
5dc6087 to
67cc18d
Compare
32032c8 to
fd1316d
Compare
| for res in self.resources: | ||
| if res.name == name: | ||
| continue | ||
| sm = safe(res, "schema") |
There was a problem hiding this comment.
I think this should stay. I'm not really fine with the fact that suppressing a resource is automatically deleting the fks, i prefer we raise, tell the user to delete the fks and try again. It's much more explicit and less error-prone
There was a problem hiding this comment.
Hum ok 👍 But then, when adding data in "overwrite" strategy, what should we do?
Should we just check that both schemas match, overwrite data if it does, and raise error if it does not (with a message saying that we should force removing foreign keys beforehand if we really want to overwrite)?
There was a problem hiding this comment.
IMO coordo should try to avoid doing side effects, if there is foreign keys then it's up to the user (the user being either a person or another project using coordo, data4trees for example) to decide how to manage it
| if self.resource_exists(resource.name): | ||
| # resource already exists | ||
|
|
||
| if strategy == ResourceExistsStrategy.overwrite: |
There was a problem hiding this comment.
Why removing and recreating the resource ? what happens if there is a foreignkey ? this code will remove the fks and recreating the resource without the user knowing
There was a problem hiding this comment.
this was the indended purpose indeed ;) in my opinion, there could two types of overwrite: "overwrite_strict", which overwrites data but checks that previous and current schemas match (and if it does not, it raises an error), and overwrite_flexible, which is the current behavior of overwrite. Do you think we should just raise an error whenever schema do not match and ask the user to remove foreign keys manually before?
There was a problem hiding this comment.
Let's see this from a user standpoint:
On the website, we want a user to be able to import new data in the datapackage.
For external files and form data, we want to be able to overwrite data, but we already said that form metadata won't be updatable.
So IMO the overwrite function we want here is just to erase data and import new one, without changing the datapackage.json. So the load command should not always expect the datapackage definition, at least a ressource name, and overwrite only the data if it conforms to the already exsiting schema in datapackage.json. If not, raise an error.
There was a problem hiding this comment.
Yeah i agree with arnaud, there is 2 possibility, either replace the whole table by the uploaded data or append the data (with a merging strategy which ensure primary keys are not duplicated) but in either case the schema should not be touched at all
if you wish to modify the schema, delete the resource and recreate it, but handling it in coordo will be error prone and difficult to maintain, let's stay as simple as possible
There was a problem hiding this comment.
I agree with you
There was a problem hiding this comment.
Really good job on the loader side, the implementation looks clean !
On the overwrite strategy I feel like this is overly complicated, because there is many ways of doing the same thing so I fear it will be difficult to maintain.
IMO ight now the add resource function is trying to do too much thing, and we should only have 3 specific actions :
Adding a resource : raise an error if name already exists and tell the user to either delete the existing resource or update the resource if he wants to insert data.
Updating a resource : by default this should show how many lines will be updated and created (you don't need to implement it in that PR because it will be done with the ducklake PR), and prompt the user if he wants to continue (or a flag in the func for integration with the data4trees web app)
Removing a resource: raise an error if there is foreign keys to it and tell the user to delete the fks
5758534 to
d2effc6
Compare
c8da8f0 to
342fc75
Compare
…e "schema" in "Resource" shadows an attribute in parent "BaseModel"'
697d7e3 to
46ae7bc
Compare
fe4f58d to
e7cee9b
Compare
|
|
||
| self.schema.foreignKeys.append(fk) | ||
|
|
||
| def remove_foreignkey(self, fk: ForeignKey) -> None: |
There was a problem hiding this comment.
Why add_foreignkey and remove_foreignkey doesn't have the same arguments ? It's not very coherent
There was a problem hiding this comment.
I did not change remove_foreignkey because it is not used anywhere, but actually I'm going to write a cli for it
|
|
||
| def add_foreignkey(self, fk: ForeignKey) -> None: | ||
| def add_foreignkey(self, fields: list[str], foreign_fields: list[str], foreign_resource: str) -> None: | ||
| fk = ForeignKey( |
There was a problem hiding this comment.
Composite foreign keys are not yet supported by coordo so we should limit to only one field pointing to one another field. Also we can't have multiple keys pointing to the same external resource (because the parser wouldn't know which one to use when auto-joining)
There was a problem hiding this comment.
I understand that the cli does not support that yet, but perhaps we could already implement something more general? It does not hurt anyway
There was a problem hiding this comment.
Yeah I understand that you want to support something more general but IMO it's better to block it and then add it later (and also update the parser at the same time) than to let users do something that is not supported by other parts of coordo
There was a problem hiding this comment.
Even if this is not exposed from the cli, we could for instance use this method from D4T backend , so it is exposed as a lib component. Therefore I rather agree with Mathias, let's not expose something we don't really support yet.
Maybe we can keep this current signature, but make an assert on the lists so if their size is > 1 , we raise "Composite forign keys are not supported yet". WDYT ?
There was a problem hiding this comment.
I made the modification :)
Add tests for datapackage
Closes #57
This PR adds a new parameter in coordo to choose which strategy to use whenever a resource is added while a resource with the same name already exists:
NOTE: for both
appendandappend_strict, the actual merging of dataframes IS NOT IMPLEMENTED YET. This will be performed by @Cabanon in another USIt is a big PR and I took some freedom to refactor and factorise code. Feel free to comment / refuse whenever you feel some portions of code were changed unecessarily