MiniCat is short for Mini Text Categorizer.
The goals of this tool is to :
- Serve as a simple, interactive interface to categorize text documents into up to 10 custom categories using Google's Natural Language API and Google Cloud Machine Learning Engine.
- Use Cloud ML Engine and Natural Language API to see how it can improve the performance/accuracy and provide an end-to-end solution for your ML needs.
- Serve as a template for your own end-to-end text classification workflows using Google Cloud Platform APIs.
It is recommended to use a Virtual Environment, but not required. Installing the above dependencies in a new virtual environment allows you to run the sample without changing global python packages on your system.
There are two options for the virtual environments:
- Install Virtual env
- Create virtual environment:
virtualenv MiniCat-env - Activate env:
source MiniCat-env/bin/activate
- Create virtual environment:
- Install Miniconda
- Create conda environment:
conda create --name MiniCat-env python=2.7 - Activate env:
source activate MiniCat-env
- Create conda environment:
Python 2.7 required.
pip install -r requirements.txtSetup a google cloud project and enable the following APIs:
Then create a Google Cloud Storage bucket. This is where all your model and training related data will be stored. For more information check out the tutorials in the documentation pages.
A simple terminal-based tool that allows document labeling for training, as well as label curation.
python main.py label --data_csv_file <filename.csv> \
--local_working_dir <MiniCat/data>
-
data_csv_file: path to your csv which should contain these 3 column headers :file_path: full file path of where the text is to be read fromtext: Text for the data point (Only one of either file_path or text is required.)labels: The class which the text belong to (can be empty)
-
local_working_dir: This is where all the different csv versions of your data and the prediction results is going to be located at.
Use the NL API and ML Engine to train a classifier using the text and labels prepared by the labeler.
python main.py train --local_working_dir <MiniCat/data> \
--version <version_number> \
--gcs_working_dir <gs://bucket_name/file_path> \
--vocab_size <number> \
--region <us-central-1> \
--scale_tier
local_working_dir: Directory where all the csv version files are located.version: Version number of csv to be used for training.gcs_working_dir: Path to your Google Cloud Storage directory to use for training and storing the models and dataset (of the form :-gs://bucket_name/some_path).vocab_size: Size of the vocabulary to use for training. (Default :- 20000)region: REGION where training should occur. Ideally set this the same as the REGION where your Google Cloud Storage bucket is located. (Default :-us-central-1)scale_tier: Mention this flag to train with GPU's. The scale_tier will be set toBASIC_GPU.
This tool could be used to classify different types of text data such as emails, support-tickets, movie reviews, news topics etc.
Let's consider the case of emails.
Create a working directory emails in your home directory.
As an example, export your emails from gmail into a mailbox file. Then post-process into the following csv format.
Create a spreadsheet similar to :
| . | file_path | text | labels |
|---|---|---|---|
| 1 | ~/emails/file1.txt | Important | |
| 2 | ~/emails/file2.txt | Unimportant | |
| 3 | ~/emails/file3.txt | ||
| 4 | ~/emails/file4.txt | Important |
.
.
In this example each email's text is in a file. There are some seed labels that can be used to partially label the set of emails.
The spreadsheet can also be in this format :
| . | file_path | text | labels |
|---|---|---|---|
| 1 | You just won a prize for $5000 ... | Unimportant | |
| 2 | Your friends Alice tagged you in ... | Important | |
| 3 | Call #0000 and get a free Iphone ... | ||
| 4 | Signup today for holiday packages... | Important |
.
.
Note: You could also use a mix of both text and file_path in the
spreadsheet.
Create the spreadsheet according to your requirements and save it in
working_directory emails under the name emails.csv.
Make sure python 2.7 is installed. Follow the commands in the Virtual
Environments Setup section. Fork the git repository and
from inside the directory run :- pip install -r requirements.txt
Create a Google Cloud Platform project and setup billing and credentials. For info on how to do that see the steps 1,2,4,5 and 6 on this page.
Set up APIs by following the setup mentioned above.
Create a Google cloud Storage bucket emails and then create a
directory under it called working_dir.
From the git-repo directory, run the following command
python main.py label --data_csv_file ~/emails/email.csv \
--local_working_dir ~/emails/
First the tool will ask you to select a set of target labels :-
Automatically detected labels :
Important
Unimportant
Enter a new label or enter 'd' for done :
Then the tool will allow you to label the text :-
Id Label
0 Important
1 Unimportant
Call #0000 and get a free Pixel today. Select between all google phones........
Enter the Label id ('d' for done, 's' to skip) : 1
The labelling workflow will continue until you have labelled all the unlabelled text or you type 'd'.
The tool should exit at the end saying a new version 1 was created.
From the git-repo directory, run the following command
python main.py train --local_working_dir ~/emails/ \
--version 1 \
--gcs_working_dir gs://emails/working_dir \
--scale_tier
Note: Don't use the flag scale_tier if you do not want to use a GPU
while training.
This will start the training on the version 1 labels file which was created
using the labeler tool. The tool will output a url which can be used to view the
job's progress. Wait for the job to finish and the results to be displayed.
There should be a file in ~/emails/v1/predictions.csv that will contain the
predicted labels and prediction confidence for all your data points.
At this point if the results are unsatisfactory then label some more examples.
Predictions in ~/emails/v1/predictions.csv could be used to help in
labelling the new version of labels.
Run the command below to start labelling again. :-
python main.py label --data_csv_file ~/emails/v1/predictions.csv \
--local_working_dir ~/emails/
Note: We call the labelling on the predictions.csv file from version 1.
This will lead to the same labelling process. After labelling some more examples call the trainer module :-
python main.py train --local_working_dir ~/emails/ \
--version 2 \
--gcs_working_dir gs://emails/working_dir \
--scale_tier
Repeat the same process if the results are still unsatisfactory.
- If you are still not satisified with the training results, here are some
things you could do :-
- Run the model for more number of epochs by changing the 'num_epochs'
value in
params.json. - If you have a lot of training data (say > 20000) you could increase the
number of hyper-parameters in
params.json. - Provide more training examples for the labels that are performing badly
- Run the model for more number of epochs by changing the 'num_epochs'
value in
A few errors that might commonly occur and their possible solutions :-
google.cloud.exceptions.TooManyRequests:
This error is due to the tool making too many requests too quickly. Add some sort of throttling liketime.sleep(0.1)before making the NL API requests in trainer.py.The provided GCS paths [] cannot be read by service account $srvacct
This error occurs when the$srvcacctdoesn't have write permissions to the GCS bucket. Run the following command to set the ACL permissions :-
gsutil defacl ch -u $SVCACCT:O gs://$BUCKET/
This is not an official Google product.