[Music]
[Applause]
[Music]
artificial intelligence is a term which for many people I think comes up images of Hollywood movies of killer robots or perhaps even the the more subtle but perhaps more sinister type of artificial intelligence we saw in how in the film 2001 in which it becomes increasingly paranoid and eventually kills some of its astronaut masters that’s the world of science fiction there’s something very interesting has been happening over the last 5 years I think one of the evidence points for this is just some ofthe the books which have appeared Israel books which have been published in the last couple of years and just look at
some of the titles our final invention artificial intelligence at the end of the human era super intelligence past dangers and strategies the artificial intelligence revolution with artificial intelligence save us or replace us and even various luminaries have commented on this Stephen Hawking uh said the development of full AI could spell the end of the human race Elon Musk said rampant AI is the biggest
existential threat facing mankind somewhat pessimistic view of the subjects I would say so this evening
what I want to do is to look at some of
the reasons why we’ve seen this
explosion of interest in our official
intelligence to look at some of the
science behind the technologies and
hopefully to paint a slightly more
optimistic view of how a I might help
humanity rather than destroy us one of
the things I want to do as well in this
talk is to look a little bit at the
history of AI because the history is
quite fascinating and it goes back a
long way certainly at least as far as
this person Alan Turing I’m sure you’ve
all heard of Alan Turing he was a
Cambridge mathematician famous for many
things but amongst them in the 1930s he
really laid the foundations for modern
computer science in in a sense
conceptualized the what today we call a
digital computer and he asked the fall
in question could such a machine emulate
the capabilities of the human mind could
a machine think well he was largely left
to theorize because he didn’t have in
his day the technology to take this
forward and really the big development
that has allowed us to think of
artificial intelligence potentially as
something of a practical value is the
development of the digital computer the
modern in particular the modern silicon
computer this is a rather old silicon
chip containing what a modern ship will
contain billions of components these
chips are incredibly powerful my laptop
for example can could probably do
several billion operations a second and
here’s an example of an operation take
two 16 digit numbers multiply them
together get the answers correct to 16
digits so these are this humble rather
ordinary laptop can probably do several
billion several thousand million of
those per second so what that machine is
doing is its slavishly following
instructions in this case the
instructions to multiply two numbers
together but back in the 1950s computer
scientists working with machines far
less powerful in this laptop wondered
whether a set of instructions could be
found to program a computer which would
cause it to exhibit intelligence so this
is an example of such a program this is
from 1964 this is called Eliza as
developed at the AI laboratory at MIT
this this particular script this
particular version was a psychotherapist
in it today we call this a chatbot so
you had a little conversation with the
computer and it would use various tricks
it would ask you for your name and it
would call you by your name and too many
people at least superficially at least
for a short period of time it appeared
that it was exhibiting something a bit
like intelligence of course if you
interacted with it for more than a few
minutes you realize as extremely
durman’s was not in the slightest bit
intelligent but at least officially it
seemed to mimic intelligence and this
field of artificial intelligence was
tremendously popular back in the 1950s
in the 1960s
it was founded on the idea that computer
scientists would be able to program
computers to be intelligent that is to
say to write a computer program or a set
of instructions such that when the
computer followed these instructions
perhaps blindingly fast billions per
second perhaps it would exhibits
intelligence and there was a tremendous
amount of hype rather like there is
today a tremendous amount of excitement
about artificial intelligence and its
potential however that that excitement
didn’t last forever
and probably didn’t really last for very
long by the time we reached the the
1970s things began to change and one
landmark moment in the history of
artificial intelligence was the
publication of a report by light Hill in
1973 in which he presented a very
pessimistic prognosis for this field
which led to an almost complete
cessation of funding first in the UK and
then elsewhere and led to what some
people have called the AI winter where
AI artificial intelligence became very
unfashionable now that report was it was
a seminal report in the field and
shortly after its publication the BBC
televised a debate from this very
lecture theater from the Royal
Institution and I’m going to play a
short extract from that debate and this
is hosted by Sir George Porter who is
the president of all society and also
the director of the Royal Institution
good evening and welcome to the Royal
Institution tonight we are going to
enter a world where some of the oldest
visions that have stirred man’s
imagination
blend into the latest achievements of
his size tonight we’re going to enter
the world of robots robots like shakey
developed by the Stanford Research
Institute shakey is controlled by a
large computer he’s directed through a
radio antenna through a television
camera he gets visual feedback from his
environment the box appears on the
monitor screen the computer analyzes the
traces which appear on the visual
display until it can interpret them as
an object it recognizes shakey gets
tactile feedback through his feelers
he’s able to move boxes with his Pushpa
he’s programmed to solve certain
problems that can be contrived in his
environment to choose say an alternative
route to a certain point when his way
has been blocked shakey is
unquestionably an ingenious product of
computer science and engineering but is
he anything more is he the forerunner of
startling developments which will end
our machines with artificial
intelligence enable them to compete with
the leave now strip the human brain one
man who’s pessimistic about the
long-term prospects of artificial
intelligence is our speaker tonight
so James light Hill one of Britain’s
most distinguished scientists
he’s a location professor of applied
mathematics at Cambridge and has worked
in many fields of applied mathematics
he’s a former director of the Royal
aircraft establishment at Farnborough
last year he compiled a report for the
science Research Council which condemned
work on general-purpose robots not
surprisingly scientists who’ve been
working on such robots reacted strongly
in defense of their field three of them
are here tonight to challenge to James
findings after they’ve had their say the
discussions will be open to bring in
members of the audience here with many
mathematicians and engineers computer
scientists and psychologists among them
their contribution will be particularly
welcome Security’s know how hairstyles
have changed since 1973
so that was the beginning of the AI
winter and we had a period when it was
positively unfashionable to work on AI
or to say you’re working on on
artificial intelligence nevertheless of
course the field of computer science
advanced with great rapidity and there
were many exciting developments one
that’ll draw attention to is in the
field of computer programs that play
chess so in the heyday of old-fashioned
artificial intelligence chess was
considered to be one of the the
pinnacles of human intellectual
achievement surely if a machine could
play chess everything else like solving
world poverty and global warming
whatever would be sort of trivial by
comparison
well it turns out that in 1997 a
computer program like the chess machine
called deep blue built by IBM beats the
world chess champion Garry Kasparov now
what’s interesting about this is the way
in which it worked because this machine
was dedicated to playing chess it did
one thing and one thing only which was
the play chess and it did it very well
and it did it by following a series of
instructions to analyze moves and
responses to moves and to evaluate board
positions and it made use of the very
high speed of digital computers it would
analyze literally millions of possible
moves and countermoves in order to
choose a good move for the machine a
more recent example again from IBM is
the machine Watson which defeated Ken
and brave who’s Brad who are the two
champions at this US television quiz
show called Jeopardy and so this
challenge was conducted on live
television and again very impressive
this is a sort of general knowledge
question answering type of quiz and
Watson made use of information that it
pulled off the internet including the
whole of Wikipedia and much else besides
again tuned by a team of very smart
people at IBM over a period of about
seven years in order to do this one very
specific thing which was to win at
jeopardy and so we have during this
period two examples and there are other
examples of computers doing things which
previously only humans to do being able
to do things better than humans things
which appear to be very intellectual in
nature but they’re very specific and of
course every time another task was
completed by machine to a level greater
than humans people said well ok that
wasn’t really intelligence after all
that’s not really artificial
intelligence about some cynics have
actually said artificial intelligence is
simply anything that computers can’t yet
do
so every time we solve another problem
we’ve not we’ve not advanced artificial
intelligence and in some ways that
that’s fair because one of the most
extraordinary things about the human
brain it’s not that we can do question
answering although we can play chess but
we can do all of these things and we can
learn to do new tasks so there’s
something rather special that we haven’t
captured in examples I’ve shown so far
but something very interesting began to
happen about five years ago any concerns
a field which has been around as you’ll
see later since again going back to the
1960 the field of neural networks and
the development of so-called deep
learning a little bit later in this talk
I’m going to dive into some of the
science behind neural networks and deep
learning deep learning was developed by
Geoffrey Hinton at the University of
Toronto and other academics and
colleagues around the world and appeared
to show great promise and so an attempt
was made to see whether this would scale
to some sort of real-world problem and
back in 2012 Jeff and others from
Toronto collaborated in this case with
Microsoft Research in Redmond to apply
deep learning to the problem of speech
recognition a speech recognition is a
tough challenge it’s been around for
many years performances of speech
recognition systems if you’ve tried one
from 10 years ago you’ll know it’s
pretty pretty bad could deep learning
help with speech recognition now at the
time there’s a whole community of people
working on speech recognition people
doing PhDs going to conferences
publishing papers but the performance
the error rate of speech recognition
systems have been pretty much flat for
an entire decade along came deep
learning and immediately produced a 30%
reduction in error rate now that was
dramatic if you think well the effort
that had gone into this field over the
previous 10 years so this particular
example this is Rick rash it was the
founding vice president who set up
Microsoft research worldwide he’s seen
here in Beijing and he’s illustrating
the power of deep learning by giving a
talk to a Chinese audience in Mandarin
now he doesn’t speak any Mandarin what’s
happening is that he’s speaking in in
English and that is being translated for
first of all captured and transcribed
into text by a deep neural network that
English text is the
translated into Mandarin by another deep
network and in finally a third peak
Network is taking that Mandarin and
synthesizing speech but using samples of
Rick’s own voice so even though it’s
Mandarin it still sounds a little bit
like Rick so that was a sort of seminal
moment these deep neural networks we’re
not programmed to do this they learned
how to do that they learned from data so
a very different approach to artificial
intelligence and this is a particularly
moving occasion because they’re about
5,000 Chinese students in the audience
and very some of the students are in
tears at the thought of the language
barriers might at last come come
crashing down so that was interesting
but what’s particularly interesting is
that this one technique of deep neural
networks in seems capable of solving
many different tasks so this is a
British startup called deep mind and in
2014 they applied deep neural networks
because of the technical reinforcement
learning to again tackle some games not
chess this time but some old Atari games
about 50 of them and the neural network
learns to play these different games by
effectively a process of trial and error
so it makes random news it sees how a
scorer is doing tries different things
it gradually learns over a period of
many many games how to play the game
successfully in about half of the game
that achieved human level performance
now what’s interesting it wasn’t
programmed to solve a specific game the
exact same architecture the exact same
software was able to learn to stall or
play a whole variety of different games
so this is much more like the
capabilities of the brain the ability to
learn and the flexibility to learn new
problems her deep mind was acquired by
Google and in 2016 earlier this year
they use much the same technique deep
neural networks and reinforcement
learning to tackle a very much harder
game the game of Go so this is a
deceptively simple-looking game
involving black and white stones on a
simple grid of squares
but the combinations of moves is very
much larger than in chess so from the
computational point of view this is a
very much harder problem than chess and
therefore building a machine that
achieves human level performance had had
proven he’s very much harder than with
chess and it was thought at least
another decade before that was achieved
so it’s very surprising then when only
this year alphago as the program was
known beat one of the world’s leading go
players probably about a decade earlier
than people had anticipated so that’s
game playing that speech recognition but
much the same technique of deep learning
can be applied to very different fields
here’s another historically very
important example this is called
imagenet so this is a data set of the
order of a million images classified
according to many different thousands of
categories and the goal is for a machine
to take an image and to assign it to the
correct category so it needs to take the
top left image of his input and the
output has to be the label judo and not
the label Oceanfront for example well
that’s a tough problem and people have
been working on this for a number of
years and then they applied the learning
and the immediate effect was to have the
error rate compared to any previous
technique so again a very dramatic
improvement in performance and back in
2015 a deep neural network developed by
Microsoft Research was applied to this
data set and achieved the same error
rate that a human makes now I should say
that one of the reasons the neural
network is as good as a human is that
it’s better than humans at
distinguishing 57 different varieties of
mushroom but it also makes the mistake
that we think of as rather silly and
perhaps no human would make but
nevertheless remarkable that this same
architecture the same concept can be
applied to this very different domain
that achieved human level performance
and so it’s really that spectrum of
different successes in many different
domains that really has underpinned this
explosion of excitement
around artificial intelligence so I
thought I’d spend a little bit of time
now looking at neural networks and deep
learning and tell you a little bit about
some of the science and some of the
technology behind these successes so
neural networks as the as the name
suggests are inspired if it’ll be
loosely by by the human brain and in
particular by the neurons in the brain
so the interesting part of the brain of
these electrically active cells called
neurons and here’s a photo micrograph
showing some of the neurons which have
been stained because they are very
complex structures lots of branches and
they make lots of connections with each
other and they send electrical signals
to each other and thereby process
information now the human brain is often
described as the most complex object in
the known universe and I just want to
give you a little feeling for just how
extraordinary the brain is so this is a
picture of South America and outlined in
red is the Amazon rainforest and in
spite of our best attempts to destroy it
it’s nevertheless still absolutely
enormous and you can see a picture of
the UK for scale I used to have a
picture of Europe but I’ve had to update
it recently unfortunately now what’s
interesting is that the number of
neurons in the brain is of the same
order as the number of trees in the
Amazon rainforest so that’s not really
the interesting part the real
interesting part about the brain of the
neurons but the connections between the
neurons these are called the sign
answers and each neuron that makes
perhaps 10,000 connections with other
neurons and so the number of sign apps
is in the brain and the sign axes are
thought to be the seat of learning and
the number of sign apps is in the brain
is that the same order is the number of
leaves on the trees of the Amazon
rainforest so the brain is truly
extraordinary
we certainly don’t understand how the
brain works but our limited knowledge of
how the brain works has been
inspirational in developing a technology
called neuro networks
so here are two neurons
the neuron on the left if it stimulated
an appropriate way can can fire can send
an electrical impulse down this cable
the axon and that axon makes connections
called sine axes with other neurons and
can stimulate those other neurons
themselves to fire or can inhibit them
from firing and the strength of those
synaptic scan change as a result of the
operation of the brain as a result of
processing information so the brain has
the capability to learn as a result of
the effectively the data that it sees
the inputs that it receives and so going
back to as far as the 1940s people began
to build mathematical models of neurons
and synapses and learning in the brain
and there are some very sophisticated
models but the ones that interest us are
extremely simple ones and we can
describe them by this little picture so
the the dots down the left hand side
represent inputs if you like there you
can think of those inputs from other
neurons and they’re combined together to
cause neuron here labeled Y either to
fire or not to fire and the connections
between them labeled W represent the
strengths of those sign APS’s so this
little model can be expressed
mathematically and this is the only
equation in the entire talk I promise
you that the equation captures this very
nicely it says that you take each of
those inputs X I and you multiply by a
weight a strength W that can be positive
or negative and you add them all up you
add up all those weighted strengths and
you pass them through this function
Sigma and Sigma just says that if the
the total combination of inputs is
positive the output is a 1 or if it’s
negative the output is a 0 so you could
imagine this little neuron being a
little classified imagine those inputs
being something that’s extracted from an
image and perhaps the output says
whether or not there’s a face in the
image so our goal is to have this neuron
output a 1 if there is a face and output
0 if there isn’t a face and so we could
imagine adjusting all those little
weights those parameters then sign up
strengths if you like
using lots of examples of images of
faces and images of not faces until we
have tuned those premises in such a way
the system has learned to solve that
particular task so that very simple
mathematical model of a neurons called a
perceptron and there was a lot of
excitement and a lot of hype again
around these very early neural networks
back in the nineteen the 1960s this is
one of the pioneers of this field this
is Frank Rosenblatt in the late 1950s
and early 1950s did a lot of work those
theoretical and experimental on
perceptrons and what’s interesting of
course is he didn’t ask access to
wonderful machines like this laptop he
couldn’t just program these in software
so he had to build analog Hardware
instantiations of sept roms and here he
is in front of some symbols at the back
there’s a triangle the circle the square
shows a typical problem that the
perceptron would be asked to solve
could it distinguish between a circle
and a triangle could it be trained to
tell the difference between a circle and
triangle now the input to the set Ron
was this box on the desk this is an
array of twenty by twenty cadmium
sulfide photo cells photo sensitive or
light sensitive resistors which formed
effectively a very primitive digital
camera these like the pixels of a very
slow very low resolution camera and so
here’s a typical experimental setup in
this case it’s going to try to
distinguish between some letters of the
alphabet this is a as I said a very poor
quality camera so we need some very
powerful light shining to the object and
there’s a lens that focus on the image
onto those photo cells so the output of
that camera then goes into this big rack
of equipment and what you see here are
effectively the finances of these neuron
models so in the rack here in this
person’s hand each of those cylindrical
objects is a combination of an
electrical motor and a rotary resistor
or potentiometer so by purely a letter
process the electric motor can change
the resistance value so the value of
that resistance represents the strength
of that sign apps and Rosenblatt
invented something called the perceptron
algorithm which was a mathematical
procedure by which those motors could
adjust the strengths of the sine apses
in response to various inputs in order
that the system could learn so let’s say
we’re distinguishing between triangles
and squares your presenter triangle and
inputs get the output it’s a triangle
that’s fine if the system makes a
mistake and outputs a square there’s an
algorithm it’s making little adjustments
to all of those sine apses to make the
output closer to the desired value
now you presenta nother image represent
a square the outputs are square that’s
fine if it isn’t we make some
adjustments to the sine answers and so
that perceptron learning algorithm
allowed the system to learn by seeing
lots of examples of each class and if
you gradually improve in performance and
hopefully solve the problem now there’s
a lot of excitement about these
perceptrons because it turns out that
they could actually learn to solve
things like distinguishing shapes and
letters the alphabet and so on and for
the day that was remarkable but there’s
wasn’t just an empirical result
Rosenblatt also approved a theorem
he showed mathematically that if the
perceptron was capable of solving a
problem that is to say if there existed
a setting of all of those resistors such
as the system would solve the problem
then the perceptron was guaranteed to
find that solution so people got very
excited about this I’ll just show you a
little bit more of the structure of the
perceptron what you see it’s on the
right therapy are these racks of
attention or misses on the left this
jungle of wiring this general of wiring
looks like it’s just random the reason
it is it is just random these are the
input to those neurons
they’re called features and they’re just
little combinations of those pixels we
got a 20 by 20 grid of photo cells the
pixels at each of these the neurons
would combine some input so-called
features which were just little
combinations sub
sets of those pixels combined together
in some fixed away chosen by the design
of the perceptron and this particular
and there’s lots of ways of choosing
these lots of research was done one one
way of choosing them is just to take
random subsets of those pixels and
combine them and that’s what this random
looking wiring is the reason this was
interesting is that even though you’d
randomized the inputs the system could
still learn to unscramble them and solve
the problem so we’re sort of remarkable
and has even more remarkable is you
could take a pair of wire cutters take a
system which is learn it’s been trained
solve a problem go in with a pair of
wire cutters and cut 10% of the wires
and its performance would degrade a
little bit but it would continue to work
it’s a little bit like you know going
down the pub having a few too many beers
few extra neurons die the next day you
know you can still function maybe not
quite as well as as previously they call
that graceful degradation it’s a
property which I’m pretty sure my laptop
didn’t have I started cutting wires in
my laptop very soon it just stopped
working completely so again this is a
little bit more a little bit brain like
in some in some ways so that generated a
tremendous amount of excitement and so
let me just summarize what’s going on
here a little picture so on the left the
the nodes of units or the neurons if you
like on the left the left-hand column
represent
in the case of a perceptron those those
pixels the original wor pixels and the
dots down the middle or what we call the
features so each of those dots would be
some combination it might be just the
the sum of a randomly chosen subset of
those pixels and that’s represented by
those green connections so there’s green
connections of correspond to that jungle
of random looking wiring and that’s
fixed so be chosen by the designer at
the outset then it’s fixed it doesn’t
change during learning then what we have
is this red layer and the red
connections represent those resistors
these are adjustable parameters in this
perceptron and so the the neurons or the
nodes on the right hand side again take
combinations of some subsets of the
features but this time the string
the combinations I learned those are the
adjustable parameters so what you see is
we have a layered structure in fact
again this is reminiscent to the brain
if you think of the visual processing
the brain occurs through a series of
layers of neurons an important thing is
that only one of these layers is
actually adaptive only one layer changes
during learning now perceptrons were
interesting because they could they
could learn to solve problems I’m just
really exciting but sometimes you would
give it a very similar problem which
looked just as easy and it would fail to
learn so what was going on sometimes
they work sometimes they didn’t well
these two computer scientist Minsky and
Papert analyzed perceptrons
mathematically and they showed that
there are some very severe limitations
to the capabilities of perceptrons but
they are very limited in what they can
do and that limitation arises because
there is only a single layer of
adaptation and they published this in
this famous book called perceptrons and
it’s often said that the publication of
this book led to a loss of interest in
this alternative approach to to
artificial intelligence we’re sort of
programming the community to be
intelligent here the the system is
learning to be intelligent and this book
of course is a piece of mathematics
people kept it it was correct so it was
it was hard to refute but the proof
applied only to a single layer two
systems who have single layers of
adaptive connections at the end of the
book they conjectured that even if you
had more than one layer similar results
would apply they conjectured these
neural networks were never really going
to be very useful that part was a pure
conjecture so there we have the
perceptron with a single layer of
adaptation the field of neural networks
had been very exciting 1960s and had
gone into abeyance and people had lost
interest as a result of the mathematical
discussion of perceptrons and their
limitations then something very
interesting happened which was the
discovery of algorithms different from
the perceptron learning algorithm which
would allow networks having more than
one layer of adaptation to be trained
techniques like so-called error back
propagation
and so people could now apply
multi-layered systems to various
problems and see if they worked and they
discovered that these systems were very
much more powerful than the single layer
perceptrons now to various technical
reasons it turns out that you can really
only train the system with usually at
most a couple of layers but nevertheless
those systems were very powerful they
were capable of solving lots of problems
that hitherto had been impossible and
led to the second way you have
excitement around your networks in the
late 1980s and 1990s now I began my
career as a physicist I did a PhD in
quantum field theory and I went often
works on the fusion program so this tie
line is at Kellerman of oratory in
Oxfordshire as a theoretical plasma
physicist working on nuclear fusion and
I read about the discovery of the back
propagation algorithm and these two
layer neural networks and their ability
their brain like ability to learn to
solve problems and it reminded me of
know how the computer artificial
intelligence I thought this is
tremendously exciting my sense was that
we were at the dawn of a new era this
was so exciting that I was actually
going to change fields and change career
and I did that by taking your networks
and applying them to data that we were
gathering from experiments this is the
inside of the jet tokamak the world’s
largest soccer max is down in
Oxfordshire and is operating at the
container a hydrogen plasma at about 200
million degrees and it’s bristling the
outside of this is bristling with all
kinds of Diagnostics and lasers and
magnetic measurements and so on so for
the day we had a huge amount of data we
could analyze it in all sorts of
interesting ways and so I said about
applying new networks to analyzing this
data and I became so excited about this
I changed fields I left physics and
actually moved at the field of computer
science so those are the two layer
networks they were very powerful but
they were a long way short of achieving
human level performance on some of the
tasks of the kind that I’ve talked about
so these systems were deployed in
practical applications they were very
useful I think it’s fair to say they
remained reasonably niche
what happened then is other techniques
came along there’s technical support
vector machines that was very popular
that issues slightly better performance
than these neural networks and so for
the second time neural network sort of
went into decline people lost interest a
little bit moved on to other techniques
for for so-called machine learning and
then about five years ago there was
something of a breakthrough for a number
of years people like Hinton himself had
been pivotal in the development of
backpropagation and neural networks of
the 1980s to layer neural networks
discovered how to train networks having
more than two layers in fact having many
layers and so the term deep learning
refers to systems which have many many
layers of processing this makes them
extremely powerful because if you think
about a task such as taking an image and
then describing what’s going on in that
image in English language that’s
something which isn’t going to be done
in two simple steps there’ll be some
very low level processing discovering
edges in the image discovering
combinations of edges to make corners
discovering relationships between
corners that make shapes discovering how
those shapes combine together to make
objects like faces looking at shapes of
faces whether somebody’s smiling or not
completing those and using those to
generate words and combining those into
sentences eventually generating language
that describes what’s going on in that
image that’s many many many layers of
processing and so we need to be able to
train networks that are that are deep
that have many many layers of processing
and so really this is the breakthrough
that underpins the new excitement in the
field of artificial intelligence so
here’s a actual deep neural network this
is one that’s used for image processing
for example taking images and labeling
them according to the objects that are
present in the image and these blocks
represent groups of nodes or neurons so
in one of those blocks there are many
layers each layer as a whole grid of
units and the units make connection
two patches or set of fields in the
previous block so you can see the
structure is pretty complex but that
whole system is adaptive and that whole
system is trained on large data sets and
so now we have neural networks
containing thousands or even millions of
adjustable parameters trained on
millions or sometimes even billions of
examples of data points so that’s a
modern deep neural network and I like
this this is a this is a fairly geeky a
magazine called wired if you’re in the
IT business you’ll know wide magazine if
you’re not you perhaps won’t but this is
the front cover of Wired magazine for a
massing June of this year it just says
the end of code soon we won’t program
computers will train them so behind all
the if you like to hide the excitement
around artificial intelligence there
really is a very fundamental
transformation happening in the field of
computer science that’s a transformation
from programming a computer directly to
solve a problem that is they’re human or
team of humans devising a set of
instructions such that when the computer
follows those instructions it solves a
particular task and instead doing
something very different writing a set
of instructions or computer program
which allows the computer to learn and
then train the computer to solve the
task by using large amounts of data so I
view that is a very fundamental shift in
the nature of computation there’s
something else going on as well and I
interest rate it with this slide I got
asked in college just to write the word
uncertainty what you see of course is
the tremendous variability in human
handwriting you can see from this
example why it’s so difficult to program
a computer directly to do something like
recognize human handwriting there’s
tremendous variability if you think of a
little rule which describes the shape of
the letter e you’ll very quickly find an
exception to that rule and so you can
write another rule that captures the
exceptions but there’ll be exceptions to
the exception there’s this combinatoric
explosion of possibilities that’s really
what defeated old-fashioned AI back in
the 1950s and 60s plus of course the
lack of fast computers
there’s something else going on as well
and in a sense it’s complementary to
this idea of learning so we’re seeing a
shift in computation that I think a
revolution in computation between
software which is written by humans and
software which is learned from data so
there’s something else going on not only
are we seeing a transformation in
computation from software which is
written by hand to software which is
learn from data but we’re seeing a shift
from software which is based on logic
that is everything is zero or one is
deterministic to software which deals
with uncertainty it quantifies
uncertainty it deals it if you like
shades of grey and ambiguity so I’m
going to show you a little a little
demonstration and this demonstration was
actually designed really to illustrate
this idea of uncertainty and to show you
a modern view of machine learning so
I’ve shown you what I would think of as
a traditional view of machine learning
adjusting these sign apps are adjusting
these parameters in a neural network to
bring it closer and closer to the
desired performance but there’s a very
different view of machine learning and
it shows you the critical role played by
uncertainty so this is an example that
be very familiar to many of you it’s
what we called a recommendation system
in this case is going to recommend films
or movies to people so this is a huge
table at each column of the table
represents a different movie and each
row of the table is a different person
and our goal is to recommend movies to
somebody which we think they might enjoy
watching now in a real system we would
certainly make use of features so it
makes use of features of the film for
example its length and its genre is it a
comedy or an action-adventure or
whatever and who are the actors and so
on we’ll also make features of the user
that age their gender their geographical
location other things we might know
about them and those are certainly very
helpful in matching movies to people
those are the purpose of this demo let’s
ignore those features all we know are
the ratings which
people have given to movies and so we
know that a certain person has watched a
particular movie and they like that
movie that’s represented by the ticks in
these boxes so where a particular person
has watched a particular movie and
they’ve given it a positive rating
because sometimes people watch a movie
and they don’t like it and so I give it
a negative rating and so that’s those
are the crosses now this is essentially
a big table it might have ten thousand
movies and ten million people so this is
an enormous table and it’s mostly
empties we don’t have very many ratings
and I’ll go effectively is to fill in
the blanks so where a person has not yet
watched a movie we want to predict will
they like the movie or will they not so
I’m going to show you a demonstration of
a system which solves exactly that task
it’s based on machine learning and this
is this is a little demonstration system
although the actual technology behind
this is used in real systems and in this
case we’ve chose de couple of hundred
movies and the system has already done a
certain amount of learning based on the
ratings of a few tens of thousands of
people on these two hundred movies now
what it’s going to do is make
recommendations for me so to learn about
my preferences now I wasn’t one of the
people in the original dataset so it
knows nothing about me at this point so
I need to do is watch a movie and decide
whether I like it or not so let’s say
I’ve watch this movie and let’s suppose
I do like it okay so what it’s doing now
is it’s reordering the other movies
according to whether it thinks I’ll like
them or not
now the vertical position on the screen
is is irrelevant they’re just spread out
vertically so you can see them what
matters is the horizontal position so if
a movie is close to the right hand edge
of the white region it is close to that
green region then the system is very
confident that I will like the movie and
it measures that confidence using
probability so it signs a high
probability to my liking that movie
conversely if it positions the movies on
the left-hand side of the white region
towards that red edge it’s very
confident that I won’t like it and if
it’s in the middle if it a 50/50 it’s
really very unsure
what you see is that most of the movies
are clustered around the middle there’s
a lot of white space down the right
there’s a lot of white space down the
left and that’s not surprising the only
thing it knows about me is that I like
that one movie that’s all it knows about
it so hasn’t had much data for me to
learn and so it’s really very unsure
about most of this so let’s pick another
another example let’s suppose I don’t
like this one
so now what we see is the movies are
spreading out some of them are moving
towards the right where it’s more
confident that I will like them so I’m
moving to the left where it’s more
confident that I won’t like them and
this if you like is the modern view of
machine learning if the reduction in the
uncertainty of the system as a result of
seeing the data and so I can carry on I
can pick another one that’s air like
this one pick another one suppose I
don’t like that one so that it’s seen
for example so now you see a very
different picture you see a lot of
movies clustered down the right-hand
side very confident I should go and
watch those ones down the left-hand side
pretty confident I won’t like them most
of the white space is now in the middle
there are few that it’s really quite
unsure about the Sound of Music for some
reason number but nevertheless you can
see that it has learned from data
through a reduction in uncertainty so
that I think this is the modern view of
machine learning I’m going to use this
demo to illustrate something else as
well which i think is really very
powerful and ticket Illustrated is very
nicely which is the concept of
information so the whole field called
information theory it was invented by
Shannon back in the 1920s
and he provided a mathematical basis for
the concept of information and that
really is foundational in modern
computer science and information
technology and I can illustrate that by
going to one of these movies down the
right-hand side so it’s equality on the
right hand side it’s pretty confident
that I won’t like this movie so let’s
suppose I watch it and suppose it’s
pretty confident I will like the movie
so let’s suppose I watch it and indeed I
do like it so watch what happens watch
carefully because I let go of the mouse
button okay actually I’ll pick another
one
so it’s confident that I’ll like this so
let’s say I do like
watch what happens a tiny change the
reason is that there was very little
surprise there’s very little surprise in
that in that data so there very little
information and if I pick another one
here that it’s very confident I’ll like
let’s suppose that I don’t like this
movie again watch what happens as I like
the other nice button okay so this time
we see a dramatic change is that it’s
now got rather confused again a lot of
things have gone back to the middle so
there there was a high degree of
surprise really confident that was going
to like the movie and I said I didn’t
that was very surprising is that Shannon
defined information is the degree of
surprise now what’s interesting is that
this is a I think a very nice
illustration of the difference between
data and information because in every
case the amount of data is the same it’s
one bit or one binary digit in order to
say that I like a movie or I don’t like
the movie I can find it a 0 or a 1 so
each of those was its 7 ratings I
provided so far is represented by one
bit of data the amount of data is the
same but the amount of information is
very different so if it’s a movie on the
right hand side that I like the amount
of information goes to 0 at the right
hand side when it becomes certain that I
like it and I do like it mean and if
information is goes to 0 and the amount
of information goes logarithmically to
infinity as we go across to the left
hand side so there’s a very nice
illustration of the distancing data and
information ok
and that’s the first example we call
collaborative filtering because people
are collaborating together to help each
other work out which movies they’re
going to like
so this quantification of uncertainty is
based on a branch of mathematics that
goes back certainly 350 years called
probability theory natural mathematical
equations of probability are deceptively
simple
it’s a very beautiful and very elegant
theory and it’s just a way of putting
numbers behind uncertainty in a way
that’s very consistent so probability is
really the calculus of uncertainty now
there are two kinds of probability so if
you were taught probability in school
you’re probably taught probably probably
taught probability as the limit of an
infinite number of
so the probability that a coin will land
heads if it’s a fair coin is 0.5 or 50%
what that means in precise terms is if
you flip the coin a number of times and
you measure the fraction of times it
lands heads that if you take the limits
you flip more and more times you flip an
infinite number of times that fraction
it will fluctuate around but it’ll
eventually converge to a value and that
value is the probability we call that
the frequentist notion of probabilities
it’s the frequency with which something
occurs there’s another view of
probability which we call the Bayesian
view which in a sense is more general
because it encompasses the notion of
frequency but it applies also to things
like one-off events unrefuted events if
we want to ask what’s the probability
that the moon was once part of the earth
compared to being a separate body that
was captured by the Earth’s
gravitational field we can’t sort of
repeat the origin of the universe
millions of times and see which fraction
of the times you know it’s so on it
doesn’t make sense it’s a one-off event
but we’re using probabilities to
describe uncertainty and it’s
interesting that we use the same
terminology and the reason we use the
same terminology is that if you try to
ascribe numbers to quantify uncertainty
those numbers are base and very simple
equations and those equations are
exactly the same equations as a paid by
frequencies and lots of things like coin
flips and so we use the same terminology
we call the probability this is a much
more general definition I tried to
illustrate it with this example so
here’s a coin that it’s a bent coin so I
flipped a coin it wasn’t going to land
it might land concave side up okay
concave side up more often that it lands
concave side down let’s suppose I don’t
know if that’s that right according the
physics let’s suppose it is so imagine
that I flip this coin an infinite number
of times look at the fraction of times
it lands concave side up that’ll be 60%
or naught point 6 that’s the probability
of landing concave side up so that’s a
frequentist probability blood suppose
that one side of the coin is heads and
the other is tails but imagine that you
don’t know which side is heads you don’t
know whether the concave side is heads
or the convex side is
but as soon as I asked you to take a bet
bet 5 pounds according to whether it’s
heads or tails which way should you bet
well for your point of view its
symmetrical you don’t know which side is
heads or tails so even though you know
it’s going to land concave side up more
often than concave side down because you
don’t know which is heads on which is
tails it’s symmetrical and so if you’re
acting rationally you were bet according
to a probability of not 0.5 it doesn’t
mean that you believe that in the limit
of an infinite number of trials it will
land heads 50% of the time you believe
that it will either land heads 60
percent of the time or it will land
heads 40 percent of the time but you
don’t know which so in a sense the
frequency with which it lands concave
side up in that case is a bit like a
frequentist probability but this
uncertainty over which side is heads or
tails it’s a one-off event one side is
heads
or the other is heads it’s not a
repeatable event it’s a one-off thing
you just don’t know which it is that’s
like this quantification of uncertainty
this Bayesian view of policy now at this
point you might thinking well I’m making
a lot of fuss here because we’ve just a
tiny change instead of having zeros and
ones we’ve now got numbers between 0 & 1
it seems like a very small change I just
give you one little illustration of the
fact that probabilities can behave in
very peculiar ways so this is not a
trivial change at all this is really
quite some I think quite significant so
here’s a little example here’s a bus and
let’s suppose that the bus is longer
than the car and there’s a bicycle and
suppose the car is longer than the
bicycle then if the bus is longer the
car and the car is longer than the
bicycle and I think you’ll all agree
that the bus must be longer than the
bicycle does anybody not agree with that
good we call that mathematicians call
that property transitivity so these if
you like deterministic numbers these
certain numbers these lengths of objects
behave in this transitive way but if we
now go to uncertain quantities we
discover that they can be non transitive
and this is extremely peculiar
and it’s not just a theoretical thing
these are these are non transitive dice
and they’re very easy to construct
they’re just regular dice the only
unusual thing about them is the choice
of numbers on the faces so these are
unbiased dice they command on each of
the six eyes of equal probability that
the choice of numbers is a bit unusual
and a particular number only appears on
one of the dice and so you can never
have a draw so if you roll one die
against another one of them are always
come up with a higher number we’ll call
that the winner and what you discover is
that if you let’s say roll the red die
against the yellow die then 2/3 at the
time the number on the yellow dye will
be bigger than the number on the red die
so the yellow dye actually just has
threes so always comes up with a three
the red die have a couple of sixes and
four twos so 2/3 of the time it rolls or
two and one service and rolls are six so
2/3 of the time yellow beets red okay
that’s fine 2/3 of the time likewise
purple will be yellow 2/3 of the time
Green will be purple rubber the higher
number and here’s the amazing thing
two-thirds of the time red will be green
so you can equip yourself for the set of
non transitive dice if I sell these at
very reasonable rates by the way and you
can have a very profitable evening down
the pub with your friend because you
show them these dice and you say that
you examine them to your heart’s content
now you pick whichever you like and I’ll
pick one of course you pick the next one
in the sequence you say let’s do the
best of the leaven or best of 15 throws
bet 5 pounds after a reasonably large
number throws it Alma is very very
likely that you’ll win the bet so they
get a bit cross and they want the dye
that you’ve just used and because you
pick a different one and so on and
you’ll always win so this is just one of
many many examples of fact that
probabilities behave in very unusual and
very peculiar ways
so we’ve seen the idea that artificial
intelligence is being revolutionized by
learning from data and that learning
from data happens through the
quantification of uncertainty so
learning from data is one of the key
ideas so these algorithms things like
deep neural networks are one of the
foundations of this revolution
another of the foundations obviously is
the data the explosion that we’re seeing
in data is one of the things that’s
enabling this this revolution the amount
of data in the world is doubling on a
very short very short timescale probably
less than a year and so we’re seeing a
tremendous growth in the amount of data
that can be used to fuel machine
learning there’s a third ingredient as
well and that’s computation so these
techniques are very hungry for computer
power so today we use neural networks
with millions of adjustable parameters
trained on millions or billions of data
points using extremely powerful
computers and these computers live
increasingly in what are called data
centers so here’s a picture of a data
center this particular one is a
Microsoft data center somewhere in North
America
what you see are these low buildings
with no windows and inside racks and
racks and racks full of computers and
storage and networking so these are
really the world’s most powerful
computers these days now these data
centers the foundation what we call
cloud computing the idea that computing
is now increasingly centralized in these
data centers and accessed anytime
anyplace from any device and the trend
is growth in cloud computing and many
companies and including in particular
Microsoft are expanding these data
centers this data center if you look
closely at the top of the picture you’ll
see some Bill dozers and some
construction work going on because this
data center is being expanded that if we
fly around a hundred eighty degrees and
up from the other end you’ll get some
idea of the scale of expansion so this
particular data center is obviously
increasing in size by an enormous factor
and so all around the world new data
centers are being constructed all the
time we’re seeing this tremendous growth
in the capacity of these data centers
now the last few weeks have been very
interesting there have been some
variants
the announcements in the last few weeks
and in particular the announcement by
Microsoft of the world’s first exascale
supercomputer and it’s based on a
technology called FPGA or field
programmable gate array so the way to
understand what this is is think of it
as a hardware chip but we’re the the
architecture of the hardware can be
changed using software so it’s a very
flexible kind of chip the chip itself is
not as powerful as fast as a fixed
architecture chip like a central
processing unit in this laptop but it’s
very flexible and so we can change the
architecture and run and try out lots of
different kinds of algorithms and neural
networks and so on and so in these data
centers as well as the regular
computation we’ve been deploying these
field programmable gate arrays on a very
large scale to the point where a couple
of weeks ago we announced the world’s
first exascale AI supercomputer so an
exascale need to can do an excerpt of a
second that’s a billion Giga operations
per second or a million million million
operations per second I’m sure this
won’t end up being the fastest computer
there’s more to come so this is just an
extraordinary growth in processing power
coupled with data coupled with these new
algorithms is driving all of the
excitement around machine learning and
artificial intelligence what’s this
being used for well many many many many
different things one example of course
is personal assistants many
organizations are working on large
companies are working on developing
personal digital assistants this is
Microsoft’s this one called katana and
these technologies are at a very early
stage of development but I’m very
confident the next decade will see the
capabilities of these types of assistant
advanced very very rapidly there are
many many other applications of machine
learning and the technology continues to
advance at a tremendous pace so again
just in the last week or so an
announcement in this case again by
Microsoft of the achievement of human
parity in speech recognition so this is
an automatic speech recognition system
which achieves the same error rate at
the word level as a human transcriber
when humans franchise speech they make a
few errors sort of the machines
the error rate is now the same what else
we’ll be using this for we building
killer robots and wipe us all out well
they’re actually many many more useful
things we can do with this and I’ll just
briefly tell you about a research
project that we’re looking at in in
Cambridge at Microsoft Research in
Cambridge core project inner eye this is
using these machine learning and
artificial intelligence techniques to
look at the treatment of cancer so what
we see on the left is a cross-section of
an MRI scan of a brain tumor very nasty
brain tumor and the the radiologist is
using a mouse and looking at this image
and segmenting the tumor that is
defining the boundary of the tumor by
hand in order this can be used for
radiation therapy planning firing in
x-rays and radiation to try to kill the
tumor and do the minimum damage to the
surrounding tissue so that’s being done
by hand and is a very time-consuming
process but we can use these machine
learning techniques to speed this up and
also to improve the accuracy and
reproducibility of this so a little bit
of human intervention now to provide
some initial segmentation and after
about 10 seconds or so the segmentation
is complete and it’s more accurate or
more reliable than the human
segmentation this is a nice example of
artificial intelligence being used in
partnership with humans so it’s the case
today despite all the advances I’ve
talked about it’s the case today and I
think it will be the case for quite some
time to come to the capabilities of
machines are different from and
complimentary to the capabilities of
humans so here the radiologist with the
experience of looking at these images
many different images from many
different patients for many years it’s
built up a good qualitative
understanding of the tumor that h of the
tumor how it should be treated what the
computer is good at is this three
dimensional segmentation defining
accurately and reproducibly which of
those three dimensional pixels as voxels
is tumor and which is normal tissue I
started my talk with a rather gloomy
outlook of killer robots I think that
belongs firmly in the world of Hollywood
but nevertheless this is a very powerful
technology it’s a very general purpose
technology and as it’s deployed and I’m
sure it will be deployed in many ways
which are of enormous benefit to society
helping us as a species tackle some of
the tremendous problems that we face in
the 21st century well we must of course
expect a few bumps on the road and to
help us think about issues around
privacy and security around theimplications of this transformation tothe world of solutions which are learnfrom data again we just announced in thelast week or two the formation of thepartnership on AI where some of theleading organizations working on artificial intelligence at large scale have come together to work together to see how artificial intelligence can best be used for the benefit of society and finally I if you’re worried about killer robots I think I think we’re always remain in control thank you very much you



