1. Please explain star schema?

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.

2. Do you know what are Recommender Systems?

Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.

3. Tell us what are Recommender Systems?

A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.

4. Explain me what are your technical competencies?

Before the interview, do your homework on the analytics environment that the interviewing company uses. During the IT interview, you will be asked to review your technical competencies and skillsets. How well the company feels your technical skills fit with the data analytics approaches and tools they use in their environment can have a make-or-break effect on whether you get the job.

5. Please explain what is Collaborative Filtering?

The process of filtering used by most recommender systems to find patterns and information by collaborating perspectives, numerous data sources, and several agents.

6. Tell us what are Eigenvalue and Eigenvector?

Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing or stretching.

7. Do you know what is selection Bias?

Selection bias occurs when sample obtained is not represantative of the population intended to be analyzed.

8. Tell me do gradient descent methods at all times converge to a similar point?

No, they do not because in some cases they reach a local minima or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.

9. Can you please explain survivorship bias?

It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous different means.

10. Do you know what is the Law of Large Numbers?

It is a theorem that describes the result of performing the same experiment a large number of times. This theorem forms the basis of frequency-style thinking. It says that the sample mean, the sample variance and the sample standard deviation converge to what they are trying to estimate.

11. Do you know what is root cause analysis?

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.

12. Do you know what is pruning in Decision Tree?

When we remove sub-nodes of a decision node, this procsss is called pruning or opposite process of splitting.

13. Tell me what challenges have you encountered while working with big data?

Big data doesn't always work as advertised, which is why your IT interviewer will likely probe you about big data setbacks or limits that you've encountered, and ask how you worked through them. Be prepared to answer this question in a straightforward, factual manner, and cap your answer with a discussion of what you gained from the experience and how it benefits you now.

14. What are the different kernels functions in SVM?

There are four types of kernels in SVM.

☛ Linear Kernel
☛ Polynomial kernel
☛ Radial basis kernel
☛ Sigmoid kernel

15. Tell me how regularly must an algorithm be updated?

You will want to update an algorithm when:

☛ You want the model to evolve as data streams through infrastructure
☛ The underlying data source is changing
☛ There is a case of non-stationarity

16. Tell me how do you identify a barrier to performance?

This question will determine how the candidate approaches solving real-world issues they will face in their role as a data scientist. It will also determine how they approach problem-solving from an analytical standpoint. This information is vital to understand because data scientists must have strong analytical and problem-solving skills. Look for answers that reveal:

☛ Examples of problem-solving methods
☛ Steps to take to identify the barriers to performance
☛ Benchmarks for assessing performance

17. Explain me what is the goal of A/B Testing?

This is a statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any changes to a web page to maximize or increase the outcome of a strategy.

18. Tell me what is logistic regression?

Logistic Regression is also known as the logit model. It is a technique to forecast the binary outcome from a linear combination of predictor variables.

19. Tell me what is the difference between Regression and classification ML techniques?

Both Regression and classification machine learning techniques come under Supervised machine learning algorithms. In Supervised machine learning algorithm, we have to train the model using labeled dataset, While training we have to explicitly provide the correct labels and algorithm tries to learn the pattern from input to output. If our labels are discreate values then it will a classification problem, e.g A,B etc. but if our labels are continuous values then it will be a regression problem, e.g 1.23, 1.333 etc.

20. Do you know how many data structures does R language have?

It has two data structures namely:

Homogeneous data structures–

It contains the same type of objects – Vector, Matrix, and Array.

Heterogeneous data structures–

It contains a different type of objects – Data frames and lists.

21. Tell me what is the difference between supervised and unsupervised machine learning?

Supervised Machine learning:
Supervised machine learning requires training labeled data.

Unsupervised Machine learning:
Unsupervised machine learning doesn't required labeled data.

22. Explain me a big data project you worked on?

Companies understand that they have to train and orient you to their business and technical environments, but they also expect you to bring skills, experience, and fresh ideas to the job.

The end business user and the IT interviewer will be especially interested in your project work. For the IT person, be sure to go into the data quality, analysis, publication, and actionalization processes, covering both the end business and the technical enablement details. For the end business person, review the project from a business results perspective, but avoid using technical jargon unless asked.

23. Please explain what does not ‘R' language do?

• Since R is open source language but still it does not consist of any graphical user interface.
• Also, it easily connects to Excel/Microsoft Office easily. Although, it does not provide any spreadsheet view of data.

24. Do you know what are the types of biases that can occur during sampling?

☛ Selection bias
☛ Under coverage bias
☛ Survivorship bias

25. Tell me what is selection bias?

Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase “selection bias” most often refers to the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.

26. Tell me how can you produce co-relations and covariances?

Since Co-relations is produced by cor() and covariances are produced by cov() function, we need to use them.

27. Tell me why do you want to work at this company as a data scientist?

The purpose of this question is to determine the motivation behind the candidate's choice of applying and interviewing for the position. Their answer should reveal their inspiration for working for the company and their drive for being a data scientist. It should show the candidate is pursuing the position because they are passionate about data and believe in the company, two elements that can determine the candidate's performance. Answers to look for include:

☛ Interest in data mining
☛ Respect for the company's innovative practices
☛ Desire to apply analytical skills to solve real-world issues with data

28. Tell me which data object in R is used to store and process categorical data?

It seems like the Factor data objects in R are used to store and process categorical data in R

29. Tell me how has your prior experience prepared you for a role in data science?

This question helps determine the candidate's experience from a holistic perspective and reveals experience in demonstrating interpersonal, communication and technical skills. It is important to understand this because data scientists must be able to communicate their findings, work in a team environment and have the skills to perform the task. Here are some possible answers to look for:

☛ Project management skills
☛ Examples of working in a team environment
☛ Ability to identify errors

30. Tell us how do you work towards a random forest?

The underlying principle of this technique is that several weak learners combined to provide a strong learner. The steps involved are

☛ Build several decision trees on bootstrapped training samples of data
☛ On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates, out of all pp predictors
☛ Rule of thumb: At each split m=p√m=p
☛ Predictions: At the majority rule

31. Tell me what are the drawbacks of the linear model?

Some drawbacks of the linear model are:

☛ The assumption of linearity of the errors.
☛ It can't be used for count outcomes or binary outcomes
☛ There are overfitting problems that it can't solve

32. Explain me what is TF/IDF vectorization?

tf–idf is short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

33. Tell me how to create a function in arguments using apply() in R?

What if we want to be able to find how many data points (n) are in each column of m?

We are using columns, MARGIN = 2, thus, we can use length function to do this:

apply(my.matrx, 2, length)

There isn't a function in R to find n-1 for each column. So if we want to, we have to create our own Function. Since the function is simple, you can create it right inside the arguments for applying. In the arguments, I created a function that returns length – 1.

apply(my.matrx, 2, function (x) length(x)-1)

The function returned a vector of n-1 for each column.

34. Tell me how do you overcome challenges to your findings?

The reason for asking this question is to discover how well the candidate approaches solving conflicts in a team environment. Their answer shows the candidate's problem-solving and interpersonal skills in stressful situations. Understanding these skills is significant because group dynamics and business conditions change. Consider answers that:

☛ Encourage discussion
☛ Demonstrate leadership
☛ Acknowledges recognizing and respecting different opinions

35. Explain me what is the difference between machine learning and deep learning?

☛ Machine learning:
Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. Machine learning can be categorized in following three categories.

► Supervised machine learning,
► Unsupervised machine learning,
► Reinforcement learning

☛ Deep learning:
Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.