亚麻面经_ml

Ds -如何预测一个人会不会在下一个月在Amazon买东西，有什么模型。https://mlwave.com/predicting-repeat-buyers-vowpal-wabbit/

https://www.researchgate.net/post/How_can_I_study_the_past_spending_behaviour_of_a_customer_in_a_banking_perspective_and_predict_the_next_purchase_category_and_amount_of_buy

To predict if the first time buyer will purchase next month, the model has to evaluate non-transaction customer data, such as how many times a customer clicked on an email or how the customer interacts with your website. These models can also take into account certain demographic data. For example, in consumer marketing they may compare gender, age, and zip code to other likely buyers. In business marketing, relevant demographics may include industry, job title, and geography.

Here’s how it works: the models compare the pre-purchase behavior of prospective buyers to the pre-purchase behavior of thousands or millions of previous customers who ended up buying, comparing attributes like what emails they opened and what products they spent the most time looking at. The prospects that behave most like the previous buyers are tagged as “high-likelihood buyers”.

Predicting likelihood to buy for repeat buyers is a lot easier than predicting likelihood to buy for first-time buyers because there is a lot more information to go on. Repeat purchase predictions utilize all interactions of the customer, such as purchased item type(some items customer will purchase more frequently), last purchase for an item type, returned purchases, order interval, Track general events (Holidays, seasons) and phone calls to customer service.

Ds - Logistic Regression的梯度优化。梯度优化的各种变体，各个变体的优缺点。一阶优化和二阶优化有哪些，各有什么优缺点。
Ds - 怎么选feature

Q1 - Which statistical method do you think is most over used?

Q2 - Suppose the company is awarding bonuses, and you are given the task to select the awardees. How would you do it? Describe your analytics, as specific as possible.

Q5 - curse of dimensionality是什么意思；问了hash table；

Q9 - feature selection

Q13 -兩個模型, 分類正確率分別是 80% 與 81%, 可以說81%比較好嗎？為什麼？

weighted accuracy (WA) vs un-weighted accuracy (UA),

如果存在class imbalance, UA才能選出不會biased to big class的模型

(這邊被隨口追問一下怎麼前處理data of unbalanced class distribution: random sampling, class weights etc)

另外要考慮測試樣本數是否significant, test data diversity etc.

Q17 - 然后让用ML建模: 给了一个situation，让选出AWS用户中unpaid的那些account，刚开始不太理解unpaid; 问了给什么数据，分析了一下; 最后问了怎么validation model，怎么确定这个模型可行之类的

Q18 - maximum likelihood vs maximum a posteriori 啥区别？

Q19 - feature extraction, Word2Vec相关内容

Q20 –

如果给了一堆数据，然后发现plot出来的结果是个有噪声的sine 函数。怎么根据数据来训练模型。开始的时候不是很明白问题的意思，一直没太回答到点子上。后来面试官有引导，然后往regression的思路上靠。要写出推导函数（optimization function, derivative 等)，怎么训练参数，如果解决overfitting等问题。

model： Y = a+SINE(bX + c), here a, b, c are parameters.

optimization/cost function: mean squared error 1/m * sum(y - y_pred)^2

use gradient descent to minimize optimization (first derivative needed)

overfitting can be solved by regularization.

Q21 - how do you choose between random forest and linear regression given that you want to figure out the feature importance

Q25 - Hessian的计算，和特征值、特征向量的关系。

Q26 - 描述一个数据错误的例子，你怎么解决的。

Q27 - sensitive analysis

Q28 - PCA宏观理解->实现原理->PCA和SVD关系->为什么用SVD实现更好->latent analysis方法比较->other type of matrix decomposition.

Q29 –

首先问了下如果你建立了模型后，有新的数据用于预测未来，但是你并不知道这个新的数据的label的时候，如何判断模型是否能预测准确，是否需要重新train模型。

（1）training 过程中设置validation set prevent model不会overfit,(2)比较new加入data feature distribution 跟之前training data是不是相似的，可以用t-test看?

2016-10-18 ML SDE, ML Scientist

1. Amazon seller上传产品的时候需要给category; 如何根据product name, description, brand, 等信息recommend合适的category以及相关的sub-category

2. How to handle unbalanced data

3. How do you train logistic regression, what is the obj function

4. 如何combine多个非常相似的listed products. 比如amazon搜索某一个laptop 可能会返回3个results 但大部分时候其实是一样的东西只是卖家，描述和图片有些出入

5. when naive bayes is bettern than logistic regression?

6. Overfitting, Cross Validation etc.

7. 简要说一下自己做过跟ML有关的项目，用什么ML方法，数据什么样，多少feature，怎么处理overfitting/underfitting，L1/L2区别，feature selection

2018-9-27
onsite
8. 完全就是根据做的项目问。他会问high level的问题，比如哪个项目是你自己完成并且很有意义的，从产品的角度来说有哪些意义。
哪个项目是和别人，尤其是不同领域的人，合作完成的，那么是如何合作的。再比如ML 的metric是什么（比如AUC），为什么用这个，如果是对客户或者市场方面的人说AUC可能不太好理解，那么用什么metric好一些？
然后会突然教你说一下一些ML的方法比如GBM。另外，因为我面的是Alexa组，会叫你聊聊如何根据语言判别skill。
Skill 我当时理解就是具体的种类，比如game，pizza。比如我问“Alexa，can you suggestion pizza?”它要根据我的位置来推荐我家附近的pizza店。如果我问“Alexa，can you suggestion game?”
它应该问“what kind of game? Video game or something else?” 我说“Video”，它会接着问再具体的东西（RPG？）直到足够详细再给出建议。那么如何设计方法叫Alexa能够这么问。

2016-11-23

9.建model, 所以出了个题目. 如果做了一个survey, 知道人的姓名身高等等情况, 预测其收入, 怎么建模

10. naive bayes和logistics regression的区别. 说之间有个trade-off, 是什么?

11. 说一个vector只有binary, 用哪个好.
答, 我还是说看情况. 面试官说不看情况, 你只有一次尝试的机会你用什么? 我说如果都是binary的话我会用logistic regression…

12. evaluate performance:

13. regulation

2017-2-8 applied scientist

14. 解釋深度學習的模型, 優點 etc
15. generative, discriminative models 的差異, 舉例
16. 解釋 generative adversarial networks, 讀過論文但沒用過, 大約講一下原理
17. 避免overfitting的方法, regularization, dropout, cross validation, early stopping etc
18. 兩個模型, 分類正確率分別是 80% 與 81%, 可以說81%比較好嗎？為什麼？

19. weighted accuracy (WA) vs un-weighted accuracy (UA),
如果存在class imbalance, UA才能選出不會biased to big class的模型
(這邊被隨口追問一下怎麼前處理data of unbalanced class distribution: random sampling, class weights etc)
另外要考慮測試樣本數是否significant, test data diversity etc. vis

转载于:https://www.cnblogs.com/ffeng0312/p/9938263.html