For deep versus shallow learning in educational psychology, see Student approaches to learning. For more information, see Artificial neural network. Learning can be supervised, semi-murphy machine learning pdf or unsupervised.

Some representations are loosely based on interpretation of information processing and communication patterns in a biological nervous system, such as neural coding that attempts to define a relationship between various stimuli and associated neuronal responses in the brain. Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics and drug design, where they have produced results comparable to and in some cases superior to human experts. Each successive layer uses the output from the previous layer as input. Layers that have been used in deep learning include hidden layers of an artificial neural network and sets of propositional formulas.

They may also include latent variables organized layer-wise in deep generative models such as the nodes in Deep Belief Networks and Deep Boltzmann Machines. A chain of transformations from input to output. CAPs describe potentially causal connections between input and output. CAP depth is potentially unlimited.

The assumption underlying distributed representations is that observed data are generated by the interactions of layered factors. Deep learning adds the assumption that these layers of factors correspond to levels of abstraction or composition. Varying numbers of layers and layer sizes can provide different degrees of abstraction.

Deep learning exploits this idea of hierarchical explanatory factors where higher level, more abstract concepts are learned from the lower level ones. Deep learning architectures are often constructed with a greedy layer-by-layer method. Deep learning helps to disentangle these abstractions and pick out which features are useful for improving performance. For supervised learning tasks, deep learning methods obviate feature engineering, by translating the data into compact intermediate representations akin to principal components, and derive layered structures that remove redundancy in representation.

Deep learning algorithms can be applied to unsupervised learning tasks. This is an important benefit because unlabeled data are more abundant than labeled data.

Examples of deep structures that can be trained in an unsupervised manner are neural history compressors and deep belief networks. Deep neural networks are generally interpreted in terms of the universal approximation theorem or probabilistic inference. The universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions. In 1989, the first proof was published by Cybenko for sigmoid activation functions and was generalised to feed-forward multi-layer architectures in 1991 by Hornik.

The probabilistic interpretation derives from the field of machine learning. It features inference, as well as the optimization concepts of training and testing, related to fitting and generalization, respectively. More specifically, the probabilistic interpretation considers the activation nonlinearity as a cumulative distribution function.

The probabilistic interpretation led to the introduction of dropout as regularizer in neural networks. The probabilistic interpretation was introduced by researchers including Hopfield, Widrow and Narendra and popularized in surveys such as the one by Bishop. The term Deep Learning was introduced to the machine learning community by Rina Dechter in 1986, and to Artificial Neural Networks by Igor Aizenberg and colleagues in 2000, in the context of Boolean threshold neurons. In 2006, a publication by Geoff Hinton, Osindero and Teh showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation.