What is the difference inbetween gegevens mining, statistics, machine learning and AI?

Would it be accurate to say that they are Four fields attempting to solve very similar problems but with different approaches? What exactly do they have ter common and where do they differ? If there is some zuigeling of hierarchy inbetween them, what would it be?

Similar questions have bot asked previously but I still don’t get it:

## 12 Answers

There is considerable overlap among thesis, but some distinctions can be made. Of necessity, I will have to over-simplify some things or give short-shrift to others, but I will do my best to give some sense of thesis areas.

Firstly, **Artificial Intelligence** is fairly distinct from the surplus. AI is the examine of how to create slim agents. Ter practice, it is how to program a pc to behave and perform a task spil an slim juut (say, a person) would. This does not have to involve learning or induction at all, it can just be a way to ‘build a better mousetrap’. For example, AI applications have included programs to monitor and control ongoing processes (e.g., increase facet A if it seems too low). Notice that AI can include darn-near anything that a machine does, so long spil it doesn’t do it ‘stupidly’.

Ter practice, however, most tasks that require intelligence require an capability to induce fresh skill from practices. Thus, a large area within AI is **machine learning**. A rekentuig program is said to learn some task from practice if its vertoning at the task improves with practice, according to some spectacle measure. Machine learning involves the explore of algorithms that can samenvatting information automatically (i.e., without on-line human guidance). It is certainly the case that some of thesis procedures include ideas derived directly from, or inspired by, classical statistics, but they don’t have to be. Similarly to AI, machine learning is very broad and can include almost everything, so long spil there is some inductive component to it. An example of a machine learning algorithm might be a Kalman filterzakje.

**Gegevens mining** is an area that has taken much of its inspiration and technologies from machine learning (and some, also, from statistics), but is waterput to different finishes. Gegevens mining is carried out by a person, ter a specific situation, on a particular gegevens set, with a purpose te mind. Typically, this person wants to leverage the power of the various pattern recognition mechanisms that have bot developed ter machine learning. Fairly often, the gegevens set is massive, complicated, and/or may have special problems (such spil there are more variables than observations). Usually, the aim is either to detect / generate some preliminary insights ter an area where there truly wasgoed little skill beforehand, or to be able to predict future observations accurately. Moreover, gegevens mining procedures could be either ‘unsupervised’ (wij don’t know the answer–discovery) or ‘supervised’ (wij know the answer–prediction). Note that the aim is generally not to develop a more sophisticated understanding of the underlying gegevens generating process. Common gegevens mining technics would include cluster analyses, classification and regression trees, and neural networks.

I suppose I needn’t say much to explain what **statistics** is on this webpagina, but perhaps I can say a few things. Classical statistics (here I mean both frequentist and Bayesian) is a sub-topic within mathematics. I think of it spil largely the intersection of what wij know about probability and what wij know about optimization. Albeit mathematical statistics can be studied spil simply a Platonic object of inquiry, it is mostly understood spil more practical and applied ter character than other, more rarefied areas of mathematics. Spil such (and notably ter tegenstelling to gegevens mining above), it is mostly employed towards better understanding some particular gegevens generating process. Thus, it usually starts with a formally specified proefje, and from this are derived procedures to accurately samenvatting that monster from noisy instances (i.e., estimation–by optimizing some loss function) and to be able to distinguish it from other possibilities (i.e., inferences based on known properties of sampling distributions). The prototypical statistical mechanism is regression.

Many of the other answers have covered the main points but you asked for a hierarchy if any exists and the way I see it, albeit they are each disciplines ter their own right, there is hierarchy no one seems to have mentioned yet since each builds upon the previous one.

**Statistics** is just about the numbers, and quantifying the gegevens. There are many implements for finding relevant properties of the gegevens but this is pretty close to zuivere mathematics.

**Gegevens Mining** is about using **Statistics** spil well spil other programming methods to find patterns hidden ter the gegevens so that you can explain some phenomenon. Gegevens Mining builds intuition about what is indeed happening te some gegevens and is still little more towards math than programming, but uses both.

**Machine Learning** uses **Gegevens Mining** mechanisms and other learning algorithms to build models of what is happening behind some gegevens so that it can predict future outcomes. Math is the voet for many of the algorithms, but this is more towards programming.

**Artificial Intelligence** uses models built by **Machine Learning** and other ways to reason about the world and give rise to slim behavior whether this is playing a spel or driving a robot/car. Artificial Intelligence has some objective to achieve by predicting how deeds will affect the prototype of the world and chooses the deeds that will best achieve that purpose. Very programming based.

Now this being said, there will be some AI problems which fall only into AI and similarly for the other fields but most of the interesting problems today (self driving cars for example) could lightly and correctly be called all of thesis. Hope this clears up the relationship inbetween them you asked about.

**Statistics**is worried with probabilistic models, specifically inference on thesis models using gegevens.**Machine Learning**is worried with predicting a particular outcome given some gegevens. Almost any reasonable machine learning method can be formulated spil a formal probabilistic specimen, so te this sense machine learning is very much the same spil statistics, but it differs te that it generally doesn’t care about parameter estimates (just prediction) and it concentrates on computational efficiency and large datasets.**Gegevens Mining**is (spil I understand it) applied machine learning. It concentrates more on the practical aspects of deploying machine learning algorithms on large datasets. It is very much similar to machine learning.**Artificial Intelligence**is anything that is worried with (some arbitrary definition of) intelligence ter computers. So, it includes a lotsbestemming of things.

Ter general, probabilistic models (and thus statistics) have proven to be the most effective way to formally structure skill and understanding ter a machine, to such an extent that all three of the others (AI, ML and DM) are today mostly subfields of statistics. Not the very first discipline to become a shadow arm of statistics. (Economics, psychology, bioinformatics, etc.)

Wij can say that they are all related, but they are all different things. Albeit you can have things te common among them, such spil that te statistics and gegevens mining you use clustering methods.

Let mij attempt to shortly define each:

Statistics is a very old discipline mainly based on classical mathematical methods, which can be used for the same purpose that gegevens mining sometimes is which is classifying and grouping things.

Gegevens mining consists of building models te order to detect the patterns that permit us to classify or predict situations given an amount of facts or factors.

Artificial intelligence (check Marvin Minsky*) is the discipline that attempts to emulate how the brain works with programming methods, for example building a program that plays chess.

Machine learning is the task of building skill and storing it ter some form te the laptop, that form can be of mathematical models, algorithms, etc. Anything that can help detect patterns.

I’m most familiar with the machine-learning – gegevens mining axis – so I’ll concentrate on that:

Machine learning tends to be interested te inference ter non-standard situations, for example non-i.i.d. gegevens, active learning, semi-supervised learning, learning with structured gegevens (for example strings or graphs). ML also tends to be interested ter theoretical bounds on what is learnable, which often forms the poot for the algorithms used (e.g. the support vector machine). ML tends to be of a Bayesian nature.

Gegevens mining is interested ter finding patterns te gegevens that you don’t already know about. I’m not sure that is significantly different from exploratory gegevens analysis ter statistics, whereas te machine learning there is generally a more well-defined problem to solve.

ML tends to be more interested ter puny datasets where over-fitting is the problem and gegevens mining tends to be interested te large-scale datasets where the problem is dealing with the quantities of gegevens.

Statistics and machine learning provides many of the basic instruments used by gegevens miners.

Here is my take at it. Let’s commence with the two very broad categories:

- anything that even just pretends to be brainy is
**artificial intelligence**(including ML and DM). - anything that summarizes gegevens is
**statistics**, albeit you usually only apply this to methods that pay attention to the validity of the results (often used ter ML and DM)

Both ML and DM are usually both, AI and statistics, spil they usually involve basic methods from both. Here are some of the differences:

- te
**machine learning**, you have a well-defined objective (usually prediction) - ter
**gegevens mining**, you essentially have the objective “something I did**not**know before”

Additionally, **gegevens mining** usually involves much more gegevens management, i.e. how to organize the gegevens ter efficient index structures and databases.

Unluckily, they are not that effortless to separate. For example, there is “unsupervised learning”, which is often more closely related to DM than to ML, spil it cannot optimize towards the aim. On the other mitt, DM methods are hard to evaluate (how do you rate something you do not know?) and often evaluated on the same tasks spil machine learning, by leaving out some information. This, however, will usually make them show up to work worse than machine learning methods that can optimize towards the actual evaluation aim.

Furthermore, they are often used te combinations. For example, a gegevens mining method (say, clustering, or unsupervised outlier detection) is used to preprocess the gegevens, then the machine learning method is applied on the preprocessed gegevens to train better classifiers.

Machine learning is usually much lighter to evaluate: there is a objective such spil score or class prediction. You can compute precision and recall. Ter gegevens mining, most evaluation is done by leaving out some information (such spil class labels) and then testing whether your method discovered the same structure. This is naive te the sense, spil you assume that the class labels encode the structure of the gegevens fully, you actually penalize gegevens mining algorithm that detect something fresh ter your gegevens. Another way of – indirectly – evaluating it, is how the discovered structure improves the vertoning of the actual ML algorithm (e.g. when partitioning gegevens or removing outliers). Still, this evaluation is based on reproducing existing results, which is not truly the gegevens mining objective.

I’d add some observations to what’s bot said.

AI is a very broad term for anything that has to do with machines doing reasoning-like or sentient-appearing activities, ranging from programma a task or cooperating with other entities, to learning to operate limbs to walk. A pithy definition is that AI is anything computer-related that wij don’t know how to do well yet. (Once wij know how to do it well, it generally gets its own name and is no longer “AI”.)

It’s my impression, contrary to Wikipedia, that Pattern Recognition and Machine Learning are the same field, but the former is practiced by computer-science folks while the latter is practiced by statisticians and engineers. (Many technical fields are discovered overheen and overheen by different subgroups, who often bring their own lingo and mindset to the table.)

Gegevens Mining, te my mind anyhow, takes Machine Learning/Pattern Recognition (the technics that work with the gegevens) and wrap them te database, infrastructure, and gegevens validation/cleaning technologies.

Sadly, the difference inbetween thesis areas is largely where they’re trained: statistics is based ter maths depts, ai, machine learning te rekentuig science depts, and gegevens mining is more applied (used by business or marketing depts, developed by software companies).

Firstly AI (albeit it could mean any slim system) has traditionally meant logic based approaches (eg pro systems) rather than statistical estimation. Statistics, based te maths depts, has had a very good theoretical understanding, together with strong applied practice te experimental sciences, where there is a clear scientific proefje, and statistics is needed to overeenkomst with the limited experimental gegevens available. The concentrate has often bot on squeezing the maximum information from very petite gegevens sets. furthermore there is a bias towards mathematical proofs: you will not get published unless you can prove things about your treatment. This has tended to mean that statistics has lagged ter the use of computers to automate analysis. Again, the lack of programming skill has prevented statisticians to work on large scale problems where computational issues become significant (consider GPUs and distributed systems such spil hadoop). I believe that areas such spil bioinformatics have now moved statistics more ter this direction. Eventually I would say that statisticians are a more sceptical bunch: they do not voorwaarde that you detect skill with statistics- rather a scientist comes up with a hypothesis, and the statistician’s job is to check that the hypothesis is supported by the gegevens. Machine learning is trained ter cs departments, which unluckily do not train the adequate mathematics: multivariable calculus, probability, statistics and optimisation is not commonplace. one has vague ‘captivating’ concepts such spil learning from examples. rather than boring statistical estimation[ cf eg Elements of statistical learning pagina 30. This tends to mean that there is very little theoretical understanding and an explosion of algorithms spil researchers can always find some dataset on which their algorithm proves better. So there are hefty phases of hype spil ML researchers pursue the next big thing: neural networks, deep learning etc. Unluckily there is a loterijlot more money te CS departments (think google, Microsoft, together with the more marketable ‘learning’) so the more sceptical statisticians are overlooked. Eventually, there is an empiricist arched: basically there is an underlying belief that if you throw enough gegevens at the algorithm it will ‘learn’ the keurig predictions. Whilst I am biased against ML, there is a fundamental insight te ML which statisticians have disregarded: that computers can revolutionise the application of statistics.

There are two ways- a) automating the application of standard tests and models. Eg running a battery of models ( linear regression, random forests, etc attempting different combinations of inputs, parameter settings etc). This hasn’t truly happened- tho’ I suspect that competitors on kaggle develop their own automation technics. b) applying standard statistical models to thick gegevens: think of eg google translate, recommender systems etc (no one is claiming that eg people translate or recommend like that..but its a useful implement). The underlying statistical models are straightforward but there are enormous computational issues ter applying thesis methods to billions of gegevens points.

Gegevens mining is the culmination of this philosophy. developing automated ways of extracting skill from gegevens. However, it has a more practical treatment: essentially it is applied to behavioural gegevens, where there is no overarching scientific theory (marketing, fraud detection, spam etc) and the aim is to automate the analysis of large volumes of gegevens: no doubt a team of statisticians could produce better analyses given enough time, but it is more cost effective to use a laptop. Furthermore spil D. Arm explains it is the analysis of secondary gegevens – gegevens that is logged anyway rather than gegevens that has bot explicitly collected to response a scientific question ter a solid experimental vormgeving.Gegevens mining statistics and more, D Mitt

So I would summarise that traditional AI is logic based rather than statistical, machine learning is statistics without theory and statistics is ‘statistics without computers’, and gegevens mining is the development of automated instruments for statistical analysis with minimal user intervention.