This repository includes data and task description for the AI/ML Summer Internship program.
Tasks per track:
- Data Science and Time Series Forecasting
- NLP and Deep Learning
- Image Processing and Computer Vision
- BI and Data analytics
Dataset: https://www.kaggle.com/datasets/utathya/future-volume-prediction
Description: this is a demand forecasting data for different agencies with different SKUs. The same data used in pytorch-forecasting TFT example.
Tasks:
- Data is aggregated at monthly level. The task is to create a forecasting model that would forecast the demand for the next 3 months. How would you evaluate and compare your models?
- Please list at least 3 models/methods that can be used for this task. Make sure that at least one of them is an ML method and at least one of them does not use any ML/DL. List their advantages and disadvantages.
- Please select a method that you think best suits the problem and create a forecasting model. You are free to do feature engineering or add additional data. Please explain how your method works.
- Agency06 and Agency14 are new and they want to decide which SKUs to sell. Please recommend 3 SKUs for each of them and explain why. How would you approach this problem?
- Assume one of the managers really wants to use an ML model. you are tasked with training that model. The model has 5 hyperparameters. How would you approach the problem?
- One of the managers thinks that increasing the discount (decreasing Sales in price_sales_promotion.csv) would make the number of sales higher.
- How would you approach solving this problem? Can you quantify the effect of changing the Sales by 1%?
- Assume that it holds true. Should they decrease the price? By how much? Is there a problem with this, if yes do you have any suggestion to reformulate the initial problem of forecasting the volume of sales.
- You have successfully created an ML model that has great performance. They have been using the model for 2 years. Then suddenly the model performance drops. What could have possibly resulted in this outcome. Could you have prevented it? If yes how. If not, what can you do now?
Submission: Please use Python for the project. Please create a jupyter notebook with your solutions. Use markdown to answer the questions and describe your code. Push your work to github. Make sure you have all the info for creating your working environment. Use this form to submit your work: https://forms.gle/wVAcjBgSWwcp7HyH9
Dataset: https://www.kaggle.com/datasets/mewbius/ecommerce-products
Description: the dataset contains a small sample of labeled examples and a lot of unlabeled product information. You will have several tasks on this dataset with classification of product category based on description being the main task.
Tasks:
- Cite 3 scientific papers that suggest approaches that can be used for the main task (aka classification of product categories using descriptions) of this problem. Present a very short summary of the key aspects of the papers (may be in bullet points).
- Select productsfull.csv dataset. Use only description column to extract the color of the product whenever applicable. Compare your results with the colorname columns
- Using the same dataset, extract a set of keywords describing each category. What methods did you use for keywords extraction? Explain.
- Separate part of the productsclassified.csv dataset and keep it for evaluation purposes (letβs call the rest training data). Now, first, take again productsfull.csv that has no Classification column (aka no label). Use description column to pretrain a language model and then using training data build a product classification pipeline with zero-shot and few-shot approaches separately. Provide your model performance evaluation on the separated evaluation set.
- Which approach provided better results? Zero-shot learning or few-shot learning? Why?
- You have successfully created an ML model that has a great performance. They have been using the model for 2 years. Then suddenly the model performance drops. What could have possibly resulted in this outcome. Could you have prevented it? If yes how. If not, what can you do now?
Submission: Please use Python for the project. Please create a jupyter notebook with your solutions. Use markdown to answer the questions and describe your code. Push your work to github. Make sure you have all the info for creating your working environment. Use this form to submit your work: https://forms.gle/wVAcjBgSWwcp7HyH9
Image: Image_for_CV_task_1.jpg
Description: Consider the image above. The task is to be able to detect different shapes and calculate their area in the picture.
Tasks: Please create 3 different methods:
- Find the objects by the color. The color should be input. Find all the objects that have the given color. Calculate their areas. Please explain your solution.
- Find the objects by their geometric shape (that is rectangle, circle, triangle etc.). Find all the objects with the given shape and calculate their areas. Please explain your solution.
- Create an unsupervised ML model that can find the area given a marker on a relatively brighter and homogeneous background. No need to filter for color or shape. Please explain your solution.
- Please compare the three approaches. List their advantages and disadvantages.
- Assume you have a high resolution image (30000 x 30000 pixels). The image cannot be fully read into the memory of your machine. So you have to divide the image into grids and process the grids separately. Maximum size of the grid can be 3000x3000. Will your solutions work in that case? If not, how can you make them work? How would you approach the problem? How would you do the splitting/cutting into grids. Please go over all the three approaches and cover if they would work or not, and if not what can you do to make them work.
Submission: Please use Python for the project. Please create a jupyter notebook with your solutions. Use markdown to answer the questions and describe your code. Push your work to github. Make sure you have all the info for creating your working environment. Use this form to submit your work: https://forms.gle/wVAcjBgSWwcp7HyH9
Dataset: bigquery-public-data.crypto_ethereum on Google BigQuery
Description: Please create an account in Google BigQuery if you donβt have one yet. The task is to calculate the several features/variables for each address on the publicly available dataset bigquery-public-data.crypto_ethereum:
Tasks: Calculate the following features:
- transactions sent
- transactions recieved 3/ ETH sent
- ETH recieved
- Average monthly transactions sent
- Average monthly transactions recieved
- Average monthly ETH sent
- Average monthly ETH recieved
- Average time between transactions sent
- Average time between transactions recieved
- STD of time between transactions sent
- STD of time between transactions recieved
- Active months
- ETH balance
- After calculating the features, use different visualizations to try to find relationships between the features. You should be able to find/see distinct groups of addresses. Please use statistical tests to find groups or validate that the groups are indeed different.
Notes:
- Please note. 1 WEI is equal to 10^-18 ETH. So you need to convert all the WEI into ETH.
- Please use the Python library called
google-cloud-bigqueryto access the DB and calculate the features. - You can learn more about the dataset from here:
https://www.kaggle.com/datasets/bigquery/ethereum-blockchain/code?utm_medium=partner&utm_source=cloud&utm_campaign=big+data+blog+ethereum
Submission: Please use Python for the project. Please create a jupyter notebook with your solutions. Use markdown to answer the questions and describe your code. Push your work to github. Make sure you have all the info for creating your working environment. Use this form to submit your work: https://forms.gle/wVAcjBgSWwcp7HyH9