Calvin commited on
Commit
8575329
1 Parent(s): ee79db1

final touches

Browse files
Files changed (1) hide show
  1. README.md +22 -17
README.md CHANGED
@@ -1,12 +1,3 @@
1
- ---
2
- title: Offer/Deal Recommender
3
- sdk: streamlit
4
- sdk_version: 1.25.0
5
- app_file: offer_pipeline.py
6
- pinned: false
7
- ---
8
-
9
-
10
  # Offer/Deal Recommender
11
 
12
  To solve this problem, I first broke it down into its simplest form and determined the inputs and outputs. The input is a text string that is meant for a category, brand, or a retailer. The first step is to analyze and process the data. The problem would be simple if we could associate every offer to a category, meaning we join based on offer-brand and brand-category. However, there is a contingency that brands may be associated with different categories. For example, Barilla can fall under the category of red sauce and dry pasta, which works out but can also fall under the category of chips in the data which doesn’t make much sense. I processed the data in the following way.
@@ -18,27 +9,41 @@ To solve this problem, I first broke it down into its simplest form and determin
18
  * We iterate through each of these offers and categorize it according to its most likely labels (> 0.20 probability)
19
  3. We then concatenate that these tables so that we have a list of offers with its corresponding categor(ies)
20
 
21
- After this processing, we perform the following control flow logic using the same zero-shot learning model.
22
 
23
  ## Pipeline
24
  1. Search for the text input directly in the offer and return this list of offers
25
  2. Determine if the text input is highly likely to be a retailer, otherwise default to brand/category inference
26
  * We use the zero shot learning model with the labels "brand", "retailer", "category". If there is a > 0.40 score of it being a retailer. Then we continue as follows:
27
- * Create a mapping of all the categories that a retailer falls under
28
- * Find the retailers that have the most overlap in the types of goods sold
29
- 3. If not highly likely to be a retailer, then we continue with brand/category inference
 
30
  * We classify the text as one of the 22 parent categories using the zero-shot learning model
31
  * We filter based on the ones that have greater than 0.20 score
32
  * We find the child categories that are associated with this filtered list of parent categories
33
  * We classify the text again according to this reduced set of child categories
 
34
  4. Return the corresponding offers
35
 
36
- ## Assumptions and tradeoffs
37
- One assumption made is that the user will not try to fool the system by using strings that are not real words, contain numbers, or are sentences. Although the model is still probabilistic and will output something that it thinks it is closest to. Allowing for open-ended inputs is a tradeoff of flexibility over more refined results. By not keeping a specific set of brands or categories that is only found in the data, we can allow the model to generalize. Another tradeoff I made is to remove the need for the user to specify if the input is a retailer, brand, or category. If this is specified, then the search can be refined and we also may not accidentally misclassify something as a retailer or not a retailer. However, again we obtain flexibility and generalization this way. Using terms like "beef" and "steak" may lead to similar offers, but may not be actually found in the data.
 
 
 
 
 
 
 
 
 
 
38
 
 
 
39
 
40
  ## Requirements and Instructions
41
- I did not host this website because of the implications of hosting the huggingface model as well. However, this model can be hosted with something like sagemaker. Therefore, the instructions below indicate how to run the Streamlit app locally.
42
 
43
  1. Install the requirements in a virtual environment
44
 
@@ -53,4 +58,4 @@ pip install -r requirements.txt
53
  streamlit run offer_pipeline.py
54
  ```
55
 
56
- 3. The HuggingFace model takes a minute to download, but it is cached after downloading.
 
 
 
 
 
 
 
 
 
 
1
  # Offer/Deal Recommender
2
 
3
  To solve this problem, I first broke it down into its simplest form and determined the inputs and outputs. The input is a text string that is meant for a category, brand, or a retailer. The first step is to analyze and process the data. The problem would be simple if we could associate every offer to a category, meaning we join based on offer-brand and brand-category. However, there is a contingency that brands may be associated with different categories. For example, Barilla can fall under the category of red sauce and dry pasta, which works out but can also fall under the category of chips in the data which doesn’t make much sense. I processed the data in the following way.
 
9
  * We iterate through each of these offers and categorize it according to its most likely labels (> 0.20 probability)
10
  3. We then concatenate that these tables so that we have a list of offers with its corresponding categor(ies)
11
 
12
+ After this processing, we perform the following control flow logic using the same zero-shot learning model and a sentence embedding model.
13
 
14
  ## Pipeline
15
  1. Search for the text input directly in the offer and return this list of offers
16
  2. Determine if the text input is highly likely to be a retailer, otherwise default to brand/category inference
17
  * We use the zero shot learning model with the labels "brand", "retailer", "category". If there is a > 0.40 score of it being a retailer. Then we continue as follows:
18
+ * Extract a sentence embedding using a pre-trained embedding model for each offer and compare with the retailer text input, sort and take the top 20
19
+ * Something that I didn't do but could have added is to first find other retailers that have a high overlap with types of goods sold, then we narrow the search down before comparing with each offer
20
+
21
+ 3. If not highly likely to be a retailer, then we continue with brand/category inference. The rationale behind this separation is that we can leverage the human-labeled categories that help us have a more refined search when provided a brand or category. The same doesn't really apply for retailers that a have broad range of categories.
22
  * We classify the text as one of the 22 parent categories using the zero-shot learning model
23
  * We filter based on the ones that have greater than 0.20 score
24
  * We find the child categories that are associated with this filtered list of parent categories
25
  * We classify the text again according to this reduced set of child categories
26
+ * Extract a sentence embedding using a pre-trained embedding model for each offer and compare with the retailer text input, sort and take the top 20
27
  4. Return the corresponding offers
28
 
29
+ ## Assumptions
30
+ 1. One assumption made is that the user will not try to fool the system by using strings that are not real words, contain numbers, or are sentences. Although the model is still probabilistic and will output something that it thinks it is closest to.
31
+
32
+ 2. Another assumption is that we only need to provide the offer and not the corresponding retailer and brand. For example, one offer may be present for multiple brands or categories so it would ideally appear for both. However, I made the assumption that we only care about relationship between the offer and the text input.
33
+
34
+ ## Tradeoffs
35
+ 1. Allowing for open-ended inputs is a tradeoff of flexibility over more refined results. By not keeping a specific set of brands or categories that is only found in the data, we can allow the model to generalize.
36
+
37
+ 2. Another tradeoff I made is to remove the need for the user to specify if the input is a retailer, brand, or category. If this is specified, then the search can be refined and we also may not accidentally misclassify something as a retailer or not a retailer. However, again we obtain flexibility and generalization this way. Using terms like "beef" and "steak" may lead to similar offers, but may not be actually found in the data.
38
+
39
+ 3. Lastly, I made the tradeoff of speed over performance in the case of using a smaller zero-shot learning model. The performance difference is negligble from online research that performed various experiments on zero-shot learning models.
40
+
41
 
42
+ ## Conclusion
43
+ I believe this pipeline would work much better if there were more offers associated with the provided categories. For example, if we use "Huggies" as our input, we see that the model correctly finds the subcategories, "Diapering", "Potty Training", and "Baby Safety", but there are no offers that are associated with the brands of these categories. Therefore, it defaults to other categories that aren't super relevant.
44
 
45
  ## Requirements and Instructions
46
+ This app is hosted on HuggingFace Spaces: https://huggingface.co/spaces/cdy3870/Fetch_App. It takes a minute to load both the models but is cached afterwards. Unfortunately the free cpu they provide is quite slow for inferencing so I would suggest running locally. Inferencing is still a bit slow locally but is obviously device independent. If hosted on services where a GPU is enabled, the app would be much more efficient.
47
 
48
  1. Install the requirements in a virtual environment
49
 
 
58
  streamlit run offer_pipeline.py
59
  ```
60
 
61
+ 3. The HuggingFace models take a minute to download, but it is cached after downloading.