Reinforcement learning

Reinforcement learning (RL)

– a way of programming agents by reward and punishment without needing to specify how the task is to be achieved.
– The basic RL problem includes states (s), actions (a) and rewards (r). The typical formulation is as follows:
1. Observe state
2. Decide on an action
3. Perform action
4. Observe new state
5. Observe reward
6. Learn from experience
7. Repeat

A quick background review
https://github.com/vmayoral/basic_reinforcement_learning/blob/master/BACKGROUND.md

– an area of machine learning inspired by behaviorist psychology
– concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

In machine learning, the environment is typically formulated as a Markov decision process (MDP)

The difference between the classical techniques and RL
– RL do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible.
– In RL, correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected.
– Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

Wikipedia

Markov decision process (MDP)

– provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.
– useful for studying a wide range of optimization problems solved via dynamic programming and reinforcement learning.

Markov Decision Process (MDP) Tutorial

Reference

http://www.cse.unsw.edu.au/%7Ecs9417ml/RL1/index.html

OpenAI Gym Tutorial

From a tutorial:
https://gym.openai.com/docs

Observations

The environment’s step function returns four values:

observation (object):
– an environment-specific object representing your observation of the environment.
e.g.
– pixel data from a camera
– joint angles and joint velocities of a robot
– the board state in a board game.

reward (float):
– amount of reward achieved by the previous action.
– The scale varies between environments, but the goal is always to increase your total reward.

done (boolean):
– whether it’s time to reset the environment again.
– Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated.
e.g.
– perhaps the pole tipped too far, or you lost your last life.

info (dict):
– diagnostic information useful for debugging.
– It can sometimes be useful for learning
e.g. it might contain the raw probabilities behind the environment’s last state change.
– However, official evaluations of your agent are not allowed to use this for learning.

=> agent-environment loop
Each timestep,
– the agent chooses an action
– the environment returns an observation and a reward.

observation

Spaces

The Discrete space
– allows a fixed range of non-negative numbers,
– in this case, valid actions are either 0 or 1.

The Box space
– represents an n-dimensional box,
– valid observations will be an array of 4 numbers.

Box and Discrete: the most common Spaces.

Environments

EnvSpec(environment ID)
– define parameters for a particular task, including the number of trials to run and the maximum number of steps.

e.g.
EnvSpec(Hopper-v1)
– defines an environment where the goal is to get a 2D simulated robot to hop;
EnvSpec(Go9x9-v0)
– defines a Go game on a 9×9 board.

Recording and uploading results

Use Monitor() to export your algorithm’s performance as a JSON file:

To upload your results to OpenAI Gym:

OpenAI Gym Setup

https://github.com/openai/gym

Basics

2 basic concepts:
1) the environment: the outside world;
2) the agent: the algorithm you are writing.
– The agent sends actions to the environment, and the environment replies with observations and rewards (a score).

The core gym interface: Env
https://github.com/openai/gym/blob/master/gym/core.py

To confirm Python’s version.

To confirm Python’s location from Terminal.

Install Atari environments.

Prepare Python files like following examples and execute it from Terminal.

Cart Pole

Space Invaders

Ms Pacman

GIF Animation

To make a GIF animation.

http://nbviewer.jupyter.org/github/patrickmineault/xcorr-notebooks/blob/master/Render%20OpenAI%20gym%20as%20GIF.ipynb

OpenAI

OpenAI

– a non-profit artificial intelligence (AI) research company
– one of founders is Elon Musk.
– aims to carefully promote and develop friendly AI in such a way as to benefit humanity as a whole.
– The founders are motivated in part by concerns about existential risk from artificial general intelligence.

OpenAI
https://openai.com/

Wikipedia
https://en.wikipedia.org/wiki/OpenAI

Wired
Inside OpenAI, Elon Musk’s Wild Plan to Set Artificial Intelligence Free

OpenAI Gym

– a toolkit for developing and comparing reinforcement learning (RL) algorithms.
– consists of a growing suite of environments (from simulated robots to Atari games), and a site for comparing and reproducing results.
– compatible with algorithms written in any framework, such as Tensorflow and Theano.
– As of May 2017, “OpenAI Gym” can only be used through Python, but more languages are coming soon.

OpenAI Gym
https://gym.openai.com/

Documentation
https://gym.openai.com/docs

GitHub
https://github.com/openai/gym

Universe

– a software platform for measuring and training an AI’s general intelligence across the world’s supply of games, websites and other applications.
– allows an AI agent to use a computer like a human does: by looking at screen pixels and operating a virtual keyboard and mouse.

Universe
https://universe.openai.com/

Blog Post
https://blog.openai.com/universe/

The Complete Jenkins Course For Developers and DevOps

Finished an online program @ Udemy.


The Complete Jenkins Course For Developers and DevOps

Tutorial
https://www.tutorialspoint.com/jenkins/index.htm

Tested Maven Plugin

GitHub
https://github.com/takuoyoneda/maven-project

ページトップへ