Skip to content

beers_dirty_5_implicitmissingvaluemedianmode: Accidential duplication of 2 rows #5

@visenger

Description

@visenger

by running the grouping command:
dirtyData.groupBy(col("tid")).count().where(col("count") > 11).show()

it seems that two data points are got duplicated (or triplicated)
+----+-----+
| tid|count|
+----+-----+
| 0| 55|
|1206| 66|
+----+-----+

In particular, the dirty dataset for rows with "tid"==0 looks like the following:

+---+------------+--------------------+--------------------+-----+
|tid| attrName| dirty-value| clean-value|label|
+---+------------+--------------------+--------------------+-----+
| 0| id| 825| 1436| 1|
| 0| id| 2222| 1436| 1|
| 0| id| 2233| 1436| 1|
| 0| id| 665| 1436| 1|
| 0| beer-name| Bodacious Bock| Pub Beer| 1|
| 0| beer-name| 10 Ton| Pub Beer| 1|
| 0| beer-name| American Lager| Pub Beer| 1|
| 0| beer-name| Toasted Lager| Pub Beer| 1|
| 0| style| Bock| American Pale Lager| 1|
| 0| style| Oatmeal Stout| American Pale Lager| 1|
| 0| style|American Adjunct ...| American Pale Lager| 1|
| 0| style| Vienna Lager| American Pale Lager| 1|
| 0| ounces| 16.0| 12.0| 1|
| 0| ounces| 16.0| 12.0| 1|
| 0| abv| 0.075| 0.05| 1|
| 0| abv| 0.07| 0.05| 1|
| 0| abv|0.040999999999999995| 0.05| 1|
| 0| abv| 0.055| 0.05| 1|
| 0| ibu| 8.0| null| 1|
| 0| ibu| 28.0| null| 1|
| 0| brewery_id| 499| 408| 1|
| 0| brewery_id| 94| 408| 1|
| 0| brewery_id| 129| 408| 1|
| 0| brewery_id| 489| 408| 1|
| 0|brewery-name|Wildwood Brewing ...|10 Barrel Brewing...| 1|
| 0|brewery-name|Warped Wing Brewi...|10 Barrel Brewing...| 1|
| 0|brewery-name| Straub Brewery|10 Barrel Brewing...| 1|
| 0|brewery-name|Blue Point Brewin...|10 Barrel Brewing...| 1|
| 0| city| Stevensville| Bend| 1|
| 0| city| Dayton| Bend| 1|
| 0| city| St Mary's| Bend| 1|
| 0| city| Patchogue| Bend| 1|
| 0| state| MT| OR| 1|
| 0| state| OH| OR| 1|
| 0| state| PA| OR| 1|
| 0| state| NY| OR| 1|

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions