Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
099973b
Progress Bar: initial implementation using progresbar2 package
jigglepuff Apr 5, 2020
dfc37c2
Progress Bar: refactor using enlighten package for stdout workaround
jigglepuff Apr 6, 2020
20b77ae
Progress Bar: implemented detailed progress increments for Extractor
jigglepuff Apr 7, 2020
7401541
Progress Bar: Fixed comment
jigglepuff Apr 7, 2020
40d99eb
Merge branch 'master' of https://github.com/OpenSTL/StlOpenDataEtl in…
jigglepuff Apr 11, 2020
a4dbb77
changes to ensure merge is working
jigglepuff Apr 11, 2020
0e5e837
loader.py: fix debug message argument error
jigglepuff Apr 11, 2020
18aa332
Progress bar: change to singleton implementation
jigglepuff Apr 11, 2020
62ea3f3
Progress bar manager: added function to get child progress bars
jigglepuff Apr 11, 2020
bc86195
loader.py: added load_all() function
jigglepuff Apr 12, 2020
e7accfd
test_fetcher.py: script to run fetcher standalone
jigglepuff Apr 13, 2020
274cb13
test_parser.py, test_extractor.py: run parser and extractor standalone
jigglepuff Apr 13, 2020
2d03e9b
app.py: changed import calls such that library and variable name don'…
jigglepuff Apr 13, 2020
23ec6dc
test_transformer.py: run transformer standalone, added utils function…
jigglepuff Apr 13, 2020
07f6dd0
vacant_table.py: changed to_csv to custom utils.to_csv to preserve da…
jigglepuff Apr 13, 2020
b7e0627
test_loader: run loader standalone
jigglepuff Apr 13, 2020
9138b3c
data/test_*.yml: added test config files to go with test scripts
jigglepuff Apr 13, 2020
c94330b
vacant_table.py: revert change
jigglepuff Apr 13, 2020
914fdc8
.gitignore: ignore test_transfom_tasks.yaml
jigglepuff Apr 13, 2020
a000cec
vacant_table.py: revert more changes
jigglepuff Apr 14, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -148,3 +148,7 @@ transform_vacant_table.csv

# sqlite test db
vacancy.sqlite

# config files for unit tests
data/source/test_sources.yml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the database config files seem to be named config_{environment}.yml but these files are named {env}_sources.yml or similar. I would say we should put the environment name in a consistent place in the filename, either front or back.

data/transform_tasks/test_transform_tasks.yml
37 changes: 37 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,44 @@ python3 ./app.py --db prod
```
:warning: Example 3. will not work if you don't have the database admin credentials. For more details, [Go to Running with Production Database](#running-with-production-database).

#### Running individual stages
To run individual stages (i.e. Fetcher only, Transformer only) without running the entire application, use the following commands:
1. Run `Fetcher` only:
Run with default `test_sources.yml`:
```bash
python3 tests/test_fetcher.py
```
To run with specific source YAML, run the following command replacing last argument with path to custom YAML:
```bash
python3 tests/test_fetcher.py ./data/sources/sources.yml
```

2. Run `Parser` only:
Use --local-sources to specify local files to parse:
```bash
python3 tests/test_parser.py --local-sources ./src/prcl.mdb ./src/par.dbf ./src/prcl_shape.zip
```

3. Run `Extractor` only:
```bash
python3 tests/test_extractor.py --local-sources ./src/prcl.mdb ./src/par.dbf ./src/prcl_shape.zip
```

4. Run `Transformer` only:
```bash
python3 tests/test_transformer.py --local-sources src/BldgCom.csv src/BldgRes.csv src/par.dbf.csv src/prcl.shp.csv src/Prcl.csv
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do those csv files come from running one of the other stages in isolation? If so we should document that here.

```

5. Run `Loader` only:
```bash
python3 tests/test_loader.py --local-sources ./src/vacant_table.csv
```

#### Running unit tests
To run unit tests, run the following command from project root directory:
```bash
pytest
```

#### Deactivating Virtual Environment

Expand Down
75 changes: 26 additions & 49 deletions app.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,83 +3,60 @@
'''

import os
import logging
from etl import command_line_args, extractor, fetcher, fetcher_local, loader, parser, transformer, utils

CSV = '.csv' # comma separated values
DBF = '.dbf' # dbase
MDB = '.mdb' # microsoft access database (jet, access, etc.)
PRJ = '.prj' # .shp support file
SBN = '.sbn' # .shp support file
SBX = '.sbx' # .shp support file
SHP = '.shp' # shapes
SHX = '.shx' # .shp support file
import sys
import logging.config
from etl.constants import *
from etl import command_line_args, utils
from etl.fetcher import Fetcher
from etl.fetcher_local import FetcherLocal
from etl.extractor import Extractor
from etl.parser import Parser
from etl.transformer import Transformer
from etl.loader import Loader

ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
SUPPORTED_FILE_EXT = [CSV, DBF, MDB, PRJ, SBN, SBX, SHP, SHX]

if __name__ == '__main__':
# Parse Command line arguments
commandLineArgs = command_line_args.getCommandLineArgs()
# Setup logging
logging.config.fileConfig('data/logger/config.ini')
logger = logging.getLogger(__name__)

# Load database config
# notify user if the app will be using test or prod db
if (commandLineArgs.db == 'prod'):
print('Using production database...')
logger.info('Using production database...')
db_yaml = utils.get_yaml('data/database/config_prod.yml')
else:
print('Using development database...')
logger.info('Using development database...')
db_yaml = utils.get_yaml('data/database/config_dev.yml')
# delete local db from previous run
utils.silentremove(db_yaml['database_credentials']['db_name'])

# Fetcher
if (commandLineArgs.local_sources):
print('using local data files', commandLineArgs.local_sources)
fetcher = fetcher_local.FetcherLocal()
logger.info("Using local data files: {}".format(' '.join(map(str, commandLineArgs.local_sources))))
fetcher = FetcherLocal()
filenames = commandLineArgs.local_sources
responses = fetcher.fetch_all(filenames)
else:
fetcher = fetcher.Fetcher()
fetcher = Fetcher()
src_yaml = utils.get_yaml('data/sources/sources.yml')
responses = fetcher.fetch_all(src_yaml)

# Parser
parser = parser.Parser()
for response in responses:
try:
response.payload = parser.flatten(response, SUPPORTED_FILE_EXT)
except Exception as err:
print(err)
parser = Parser()
responses = parser.parse_all(responses)

# Extractor
extractor = extractor.Extractor()
# Master entity list
entity_dict = dict()
entities = []
for response in responses:
for payload in response.payload:
if utils.get_file_ext(payload.filename) == CSV:
entities = extractor.get_csv_data(payload)
elif utils.get_file_ext(payload.filename) == MDB:
entities = extractor.get_mdb_data(payload)
elif utils.get_file_ext(payload.filename) == DBF:
entities = extractor.get_dbf_data(payload)
elif utils.get_file_ext(payload.filename) == SHP:
entities = extractor.get_shp_data(response, payload)
else:
entities = {}
# Add to master entity list
entity_dict.update(entities)
extractor = Extractor()
entity_dict = extractor.extract_all(responses)

# Transformer
transform_tasks = utils.get_yaml('data/transform_tasks/transform_tasks.yml')
transformer = transformer.Transformer()
transformer = Transformer()
transformed_dict = transformer.transform_all(entity_dict, transform_tasks)

# Loader
# read loader config
loader = loader.Loader(db_yaml)
# connect to database
loader.connect()
for tablename, transformed_df in transformed_dict.items():
loader.insert(tablename, transformed_df)
loader = Loader(db_yaml)
loader.load_all(transformed_dict)
Empty file removed constants.py
Empty file.
23 changes: 23 additions & 0 deletions data/logger/config.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[loggers]
keys=root

[handlers]
keys=consoleHandler

[formatters]
keys=simpleFormatter

[logger_root]
level=DEBUG
handlers=consoleHandler

[handler_consoleHandler]
class=StreamHandler
level=DEBUG
disable_existing_loggers=False
formatter=simpleFormatter
args=(sys.stdout,)

[formatter_simpleFormatter]
format=%(asctime)s %(levelname)-8s [%(filename)s:%(lineno)d] %(message)s
datefmt=%Y-%m-%d:%H:%M:%S
21 changes: 14 additions & 7 deletions data/sources/sources.yml
Original file line number Diff line number Diff line change
@@ -1,20 +1,27 @@
prcl_shape:
ESRI_Parcels_Shapefile:
info: https://www.stlouis-mo.gov/data/datasets/distribution.cfm?id=84
url: https://www.stlouis-mo.gov/data/upload/data-files/prcl_shape.zip

prcl:
Parcels_Key:
info: https://www.stlouis-mo.gov/data/datasets/distribution.cfm?id=83
url: https://www.stlouis-mo.gov/data/upload/data-files/prcl.zip

par:
Parcel_Data:
info: https://www.stlouis-mo.gov/data/datasets/distribution.cfm?id=85
url: https://www.stlouis-mo.gov/data/upload/data-files/par.zip

lra_public:
LRA_Iventory_Records:
info: https://www.stlouis-mo.gov/data/datasets/distribution.cfm?id=65
url: https://www.stlouis-mo.gov/data/upload/data-files/lra_public.zip

bldginsp:
Building_Inspections:
info: https://www.stlouis-mo.gov/data/datasets/dataset.cfm?id=11
url: https://www.stlouis-mo.gov/data/upload/data-files/bldginsp.zip

prmbdo:
Building_Permits:
info: https://www.stlouis-mo.gov/data/datasets/distribution.cfm?id=3
url: https://www.stlouis-mo.gov/data/upload/data-files/prmbdo.zip

forestry_maintenance_properties:
Forestry_Property_Maintenance_Data:
Info: https://www.stlouis-mo.gov/data/datasets/dataset.cfm?id=64
url: https://www.stlouis-mo.gov/data/upload/data-files/forestry-maintenance-properties.csv
27 changes: 27 additions & 0 deletions data/sources/test_sources.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
ESRI_Parcels_Shapefile:
info: https://www.stlouis-mo.gov/data/datasets/distribution.cfm?id=84
url: https://www.stlouis-mo.gov/data/upload/data-files/prcl_shape.zip

Parcels_Key:
info: https://www.stlouis-mo.gov/data/datasets/distribution.cfm?id=83
url: https://www.stlouis-mo.gov/data/upload/data-files/prcl.zip

# Parcel_Data:
# info: https://www.stlouis-mo.gov/data/datasets/distribution.cfm?id=85
# url: https://www.stlouis-mo.gov/data/upload/data-files/par.zip
#
# LRA_Iventory_Records:
# info: https://www.stlouis-mo.gov/data/datasets/distribution.cfm?id=65
# url: https://www.stlouis-mo.gov/data/upload/data-files/lra_public.zip
#
# Building_Inspections:
# info: https://www.stlouis-mo.gov/data/datasets/dataset.cfm?id=11
# url: https://www.stlouis-mo.gov/data/upload/data-files/bldginsp.zip
#
# Building_Permits:
# info: https://www.stlouis-mo.gov/data/datasets/distribution.cfm?id=3
# url: https://www.stlouis-mo.gov/data/upload/data-files/prmbdo.zip
#
# Forestry_Property_Maintenance_Data:
# Info: https://www.stlouis-mo.gov/data/datasets/dataset.cfm?id=64
# url: https://www.stlouis-mo.gov/data/upload/data-files/forestry-maintenance-properties.csv
1 change: 1 addition & 0 deletions data/transform_tasks/test_transform_tasks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
- vacant_table
8 changes: 5 additions & 3 deletions etl/command_line_args.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import argparse

def getCommandLineArgs():
def getCommandLineArgs(local_source=True, db=True):
parser = argparse.ArgumentParser()
parser.add_argument('--db', nargs='?', type=str, choices=['dev','prod'], default='dev', help='dev: use local database; prod: use production database')
parser.add_argument('--local-sources', nargs='+', type=str, help='local data files to use in place of internet sources.')
if db:
parser.add_argument('--db', nargs='?', type=str, choices=['dev','prod'], default='dev', help='dev: use local database; prod: use production database')
if local_source:
parser.add_argument('--local-sources', nargs='+', type=str, help='local data files to use in place of internet sources.')
args = parser.parse_args()
return args
12 changes: 12 additions & 0 deletions etl/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@

# Declare global variables (callable across files)
TOTAL_STAGES = 5
CSV = '.csv' # comma separated values
DBF = '.dbf' # dbase
MDB = '.mdb' # microsoft access database (jet, access, etc.)
PRJ = '.prj' # .shp support file
SBN = '.sbn' # .shp support file
SBX = '.sbx' # .shp support file
SHP = '.shp' # shapes
SHX = '.shx' # .shp support file
SUPPORTED_FILE_EXT = [CSV, DBF, MDB, PRJ, SBN, SBX, SHP, SHX]
Loading