ML_Project2

Quality of life in Swiss cities

In collaboration with LABORATORY FOR HUMAN-ENVIRONMENTRELATIONS IN URBAN SYSTEMS(HERUS). Under the supervision of Massaro Emanuele

Authors:

Habibollahi Saatlou Forough

Piquet Anthony

Erbacher Pierre

Kouh Kamari Hosseini Seyed Reza*

*This member is not enrolled in the Machine Learning course

Introduction

The HERUS ( Human Environment Relations in Urban Systems ) lab wants to find a new model (or improve their existing model ) that is able to predict the quality of space indicators of Swiss cities using insurance data of the customers from laMobilière.

Dataset

The data-set has been provided by the insurance company laMobilère. This dataset is under confidentiality agreements. It contains 1. The data of thousands of people in Switzerland that subscribed to laMobiliere insurance, we call this dataset “Datacities” and 2. Indicators about corresponding to each city (there are about 90 indicators for 170 cities in Switzerland) provided by Swiss Statistical Office, from 10 000 people. The raw dataset ‘Datacities’ contains 25 features and about 670 000 entries.

Feature Description

JobState: Status of employment
Civil: Civil status
YearOfBirth: Year of birth
Gender: Gender
Own/Rent: If own or rent an house
Lang: Speaking language
Nation: Nation of origin
Children0-26: How many children
Car1Price: Price of the first car
Car1ClaimsCt5Y: Number of claims for the first car
Car1ClaimsSum5Y: Sum of money of claims for the firstcar
Car2Price: Price of the second car
Car2ClaimsCt5Y: Number of claims for the second car
Car2ClaimsSum5Y: Sum of money of claims for thesecond car
CarPremium: Premium class
HHInsSum: Insured Sum
Standoffurn: Standard of furniture1The descriptions are
retrieved from the original data description file
Rooms: Number of rooms
BuildInsSum: Insured sum of the building
Yearofconstr: Year of constructions
HHaBClaimsCt5Y: Number of claims
HHaBClaimsSum5Y: Sum of money of claims
HHandBldPrem: Premium class
Zip: Zip code of residence
BFS: BFS number
City: The city

Indicator description:

11 Transportation indicators
29 Population indicators
11 Work & Workplaces indicators
8 Space and Territory indicators
18 Housing indicators
9 Finance indicators
4 Education indicators

Note that more details about the explanations of the indicators can be found in the swissdatadescription.pdf file uploaded alongside the code.

Usage

In case of access to the data-set, you can use the Run.ipynb file to perform the analysis and reproduce the results.

In this notebook we have included all the steps from preprocessing the data to learning them and training a prediction model. The notebook consists of two main sections:

Features exploration and selection:

This part is used to explore the raw data-set and select the most relevant features which is done through the following steps:

Loading raw dataset
Translation:

The categorical values are written in German, we translate in English.

Unifomization of unknown values:

We regroup all the value Na or white space under the name unknown.

Features exploration:

4.1 Categorical features:

We look at the number of unknown values for each feature.

4.2 Numrical features:

We look at the number of zeroes for each feature.

Remove Outliers:

We show the proportion of unknown and meaningless zero values and the distribution to see if the feature is exploitable. If there are too many unknown, we consider the feature as useless and prefer to drop it. We remove rows that are considered as outliers.

Creation of dummy Variable and replacing categorical values by number
Saving the selected and engineered features

Normalization and Learning:

In this part, we use the preprocessed data-set generated in the Features exploration and selection part. We do normalization and we try to design a model that can predict the indicators of cities.

This part contains the following steps:

Loading the selected features and the indicators from the original dataset
Merge dataset with city indicator
Split the dataset in training and testing set
Group cities by population size (assuming that smaller cities do not have the same model as bigger and more popularized cities)
Normalization of the categorized dataset
Training the selected model:

In this step, we have used three different models:

Multilayer Perceptron
Ridge Regression
Linear Regression

Note that the results from the two latter methods were very similar and hence we have only reported the predictions from the first two methods in the report file.

Extracting the scores for each indicator

Parameter Setting:

You can modify these parameters in the learning part:

Number of cities in each class: You can change by defining num_cities_per_class as you wish.

Indicator you want to predict: You can change the range in the for loop over the file containing indicators and choose different groups of them to predict. Current values are indicator number 50 to indicator number 70:

for i in range(50,70): y = X_Class.iloc[:,i] y = preprocessing.scale(y)

Output:

You will get the R^2 Score (scores) on the test data of the chosen model for all 20 chosen indicators and all the city categories. The mean values of these scores and more detailed results are presented in the report file.

The output files corresponding to each utilized model are named: prediction_results_50to70_mlregressor.csv, prediction_results_50to70_ridge.csv, and prediction_results_50to70_linear.csv.

Libraries Used:

Matplotlib

Numpy

Pandas

Scikit-Learn:

preprocessing/ PCA/ RidgeRegression/ LinearRegression/ MLPRegressor(neuralnet)

Reproductibility:

In case of access to the original dataset, make sure your libraries have the last version. Just run the notebook Run.ipynb file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML_Project2

Quality of life in Swiss cities

Authors:

Introduction

Dataset

Feature Description

Indicator description:

Usage

Features exploration and selection:

Normalization and Learning:

Parameter Setting:

Output:

Libraries Used:

Reproductibility:

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ML_Project2

Quality of life in Swiss cities

Authors:

Introduction

Dataset

Feature Description

Indicator description:

Usage

Features exploration and selection:

Normalization and Learning:

Parameter Setting:

Output:

Libraries Used:

Reproductibility: