Deep RL Arm Manipulation

This Deep Reinforcement Learning Arm Manipulation project has two objectives to achieve using a template project. Whereby to achieve each objective we create a DQN agent and define reward functions to teach a robotic arm.

The template project is based on the Nvidia open source project “jetson-reinforcement” developed by Dustin Franklin.

Robot Arm in Gazebo simulator
Robot Arm in Gazebo simulator

The two primary project objectives are:

  • Have any part of the robot arm touch the object of interest, with at least a 90% accuracy for a minimum of 100 runs.
  • Have only the gripper base of the robot arm touch the object, with at least a 80% accuracy for a minimum of 100 runs.

Written by Nick Hortovanyi

Reward functions

The reward functions are defined in ArmPlugin.cpp. The arm joints were updated using position control (as that was the programs default setting). For each joint there are two actions (either increase or decrease joint position).

REWARD_WIN was set to 0.125 (0.1 2nd objective) with REWARD_LOSS to -0.125 (-0.1 2nd objective).

If the robot gripper hit the ground a REWARD_LOSS * 10 was given and the episode ended.

Interim rewards, within the episode, were issued if there was no ground contact or 100 frames had beed exceeded.

The main interim reward was based on the distance goal delta between the gripper and the cycling prop. If a positive weighted average was derived then a REWARD_WIN was recorded otherwise REWARD_LOSS * distance to goal was issued. Thus the REWARD_LOSS was higher the further away from the goal the arm was.

For the gripper base (2nd) objective an additional REWARD_LOSS was added if the absolute average goal delta was < 0.001 to penalise no movement.

If the robot arm hit the prop, a REWARD_WIN * 10 was used for the first objective otherwise a REWARD_LOSS * 5 for the second objective if the collision was not with the gripper_middle.

However for the second objective a REWARD_WIN * 20 was issued if the collision point was gripper_middle.

Any collision ends the episode.

Hyper Parameters

Image dimensions were set to the same size as the input. Training was performed on a GTX1070 and there was no need to restrict memory usage.

OPTIMIZER "Adam" was chosen as it in general performs better then RMSProp whilst maintaining its advantages.

For objective 1 the LEARNING_RATE was 0.1 with REPLAY_MEMORY at 1000. The value was chosen via trial and error.

For objective 2 the LEARNING_RATE was decreased to 0.01 due to the higher REPLAY_MEMORY set at 20000. The higher REPLAY_MEMORY was used so as to allow for more discrete learning, due to the smaller surface area required to achieve a collision to meet objectives.

For both BATCH_SIZE was set to 512 (again sufficient memory on the GTX 1070).

LSTM was used USE_LSTM true with LSTM_SIZE 256 which was set via trial and error.


Objective 1 – Have any part of the robot arm touch the object of interest, with at least a 90% accuracy for a minimum of 100 runs.

Results objective 1
Results objective 1

The robotic arm quickly learnt how to hit the prop with a degree accuracy in a repeatable fashion. On occasion if the arm trained initially away from the prop, it would take longer to achieve a higher accuracy.

Once a winning path was learnt this configuration consistently had the robotic arm quickly hitting the prop objective.

As can be seen in the above summary output the objective was achieved well within the criteria specified.

Objective 2 – Have only the gripper base of the robot arm touch the object, with at least a 80% accuracy for a minimum of 100 runs.

Results objective 2
Results objective 2

With the finer control required, and alteration to the interim reward system, this configuration would often hesitate before making a move. Whilst it learnt quickly how to get very close to having the gripper_middle hit the prop, it would also often just miss either hitting the ground or the arm itself hitting the prop. There seemed to be a repeatable pattern, of just extending past and swinging down in an arch, that once learnt gave consistent winning results.

Occasionally the middle joint would collide with the ground and this would lead to the objective not being met.

This configuration was not always reproducible, however with the above screen shot it was able to meet the objectives.

Future work

There were clear arcs that once found achieved a win quickly. Such that it would be worthwhile investigating an interim reward system based on not just the distance from the goal but also distance from an ideal arc trajectory as the arm approached.

Further using centre points to calculate distance from goals becomes less accurate the closer to the goal the arm is. Such that other points like the end of the gripper_middle and top of prop cylinder, would be worthwhile experimenting with.


Robotic Inference Project Writeup


The first project in term 2 of the Udacity Robotics Nano Degree program requires students to initiate their own inference project inclusive of data acquisition. The project builds on the initial reference project for digit image recognition inside the supplied Nvidia Digits environment.

The project ideas are the student’s own and must have at least 3 classification categories eg defective item vs normal item with classes (no item, defective item, normal item).


Pedestrian and bicycle lanes are often crowded with many people not aware of or selectively ignoring the signage. It can lead, to an unsafe or hazardous environment, for all that use it with police officers reluctant to enforce the rules via fines.

The concept selected, in this project, was to classify an image as either containing a pedestrian, not-pedestrian or background.

The goal being, that some sought of visual representation via a screen with a smile or a frown, could be given to act, as a robotic traffic controller. Other potential instantiation could include a torso using upper body movement to signal good or bad behaviour.

Background / Formulation

During the initial inference task, on supplied data, GoogLeNet was chosen as it had a good inference rate per image with reasonable accuracy. Using the Adam Optimiser with an initial learning rate of .001, it was able to meet the numerical requirements of inference time below 10 ms with accuracy > 75%. The input used for this reference model in Nvidia’s DIGITS was 256×256 3 channel colour images.

Similar requirements of accuracy would be required for this inference project. It was not necessary to be a 100% accurate as a smiling or frowning face at least makes people think about what they are presently doing. It was not going to be used to issue fines or other enforcement notices. Video cameras would stream image data between 24-30fps which means that an Inception, VGG model and some ResNet models may be too slow for inference in real time.

With the additional perception that colour could also be useful in detecting pedestrian vs not-pedestrian, GoogLeNet was again chosen for this project using the Adam Optimiser with an initial learning rate of .001.

Other experimentation with using a higher initial learning rate of 0.01 with the above configuration over 5 epochs did not improve validation accuracy which remained around 50%. Similarly AlexNet over 5 epochs with the same Adam optimiser and learning rate did not increase accuracy. One experiment was performed using GoogLeNet with RMSProp optimiser and an initial learning rate 0.001 which did not result in improvement of validation accuracy but did have significantly higher training loss so was also not progressed.

Data Acquisition

A GoPro mounted on a tripod was used. It was positioned on the side of a pedestrian esplanade at Surfers Paradise, Gold Coast, Australia. As it was summer holidays a reasonable amount of varying traffic was expected. The background looked over the ocean to have a consistent image where there was not going to be movement (besides cloud) other then what was on the esplanade.

GoPro Data Acquisition
GoPro Data Acquisition

Three angles (facing left, centre and right) were used for capture per the following graphic:

Camera Background Angles
Camera Background Angles

The GoPro was setup in wifi mode for time lapse capture, which was controlled via an iPhone. Initially 2 seconds elapsed was used, which eventually was dropped to 0.5. Using the GoPro time-lapse feature, meant that individual jpeg files were captured as opposed to a MP4 video.

Using the iPhone to control the control, meant that I could visualise what was coming before starting the next capture batch.

After the capture the images were manually placed into a directory for each category.

Image data was captured for the three categories background (322), pedestrian (349) and not-pedestrian (94). It became apparent at this time that not enough not-pedestrian image data had been captured. This was mainly due to the initial 2 seconds elapsed time used. Due to high heat and humidity of the Australian summer, in the following afternoons, it was not practical to capture more data from the same spot.

Skateboarders were placed in the pedestrian category.

An example of a pedestrian and not-pedestrian follows. Background examples are as above.

not-pedestrian example
not-pedestrian example
pedestrian example
pedestrian example

A jupyter notebook was used to create a generator to supplement the data by randomising the image brightness, randomly flipping the images vertically and jittering the images randomly
in the x (by or – 25 pixels),y (by or – 50 pixels) planes to create supplemental image data.

The images were also resized to 256 x 256 and saved as PNGs.

The final generated supplemental data had 2000 not-pedestrian, 1000 pedestrian with 1000 background images.


The initial inference task, on supplied data, GoogLeNet was chosen as it had a good inference rate per image with reasonable accuracy. Using the Adam Optimiser with an initial learning rate of .001, it was able to meet the numerical requirements of inference time below 10 ms (~5 ms actual) with accuracy > 75% (75.40984% actual).

During training of the initial inference task 100% validation was achieved per the following training graph after 5 epochs.

Training Graph
Training Graph

However similar training results were not achieved for this inference project on captured data. The following training graph after 10 epochs follows

Project Training Graph
Project Training Graph

This had a validation accuracy of ~50%.

The following are the results of two randomly selected images per classification category uploaded as original high-res jpeg images from the GoPro.

Inference Background Sample
Inference Background Sample
Inference Not-Pedestrian Sample
Inference Not-Pedestrian Sample
Inference Pedestrian Sample
Inference Pedestrian Sample

The indicative inspection suggests that there was insufficient data to get a result > 75% for this project at this time. It also appears that the model as trained can not distinguish between pedestrian and not-pedestrian but can distinguish a background image.

Inference times were not tested separately as GoogLeNet is known to have a fast inference time which would be sufficient for this project.


The dataset collected did not have enough sample images. This was as a result of using time-lapse with too high a value. In hind site, a combination of 0.5-1 sec time-lapse for slow moving pedestrians with >30 FPS video for higher speed moving non-pedestrians would of allowed for more data.

Of note is when there is a combination of non-pedestrian and pedestrian in the same frame led to the though of potentially using object detection first to find a window to classify. This was not implemented in this version however it would have led to a more accurate ability to classify as the background would be eliminated ie if no objects detected it must be a back ground image.

For this project the duration of a consistent display of say 2 to 3 seconds to the passing pedestrian and non-pedestrian traffic would drive the ultimate inference time required. It would suggest that it needs to be an average classification of a 1 second or two when leading up to where the video is captured.

Of note were skateboarders. There is only the skateboard (which has a low profile in the image) that distinguishes it from a pedestrian as velocity is not taken account with single images.

In addition the depth (away from the camera) of the traffic passing by changes the size of the object that needs to be classified. Pre-filtering and zooming these to a consistent size may improve accuracy.

Future Work

Whilst the project did not achieve a good train validation result, it has laid the foundation for future iterations. There is potential to capture more data and use object detection to refine the training and inference steps of the project.

Providing soft means to monitor and influence peoples decisions regarding signage and the associated rules for safe usage of (non-vehicle) transit paths would potentially be received positively by the community. Police forces lack the budget or people to enforce these rules and they are reluctant to issue minor infringement notices (with potential to destroy good will in the community). Hence a more subtle robotic person could improve the situation in a cost effective means.

The size of this market is unknown. However speed cameras using smiley and sad faces are used around the local area where I live. They are having a positive impact on driver behaviour in the areas deployed. Thus there is a market for a more community friendly and automated means to impact on behaviour.

Clone Driving Behaviour

Clone driving behaviour using Deep Learning

With this behaviour cloning project, we give steering & throttle instruction to a vehicle in a simulator based on receiving a centre camera image and telemetry data. The steering angle data is a prediction for a neural network model trained against data saved from track runs I performed.
simulator screen sot

The training of the neural net model, is achieved with driving behaviour data captured, in training mode, within the simulator itself. Additional preprocessing occurs as part of batch generation of data for the neural net training.

Model Architecture

I decided to as closely as possible use the Nvidia’s End to End Learning for Self-Driving Cars model. I diverged by passing cropped camera images as RGB, and not YUV, with adjusting brightness and by using the steering angle as is. I experimented with using 1/r (inverse turning radius) as input but found the values were too small (I also did not know the steering ratio and wheel base of the vehicle in the simulator).

Additional experimentation occurred with using, Steering angle prediction model but the number of parameters was higher then the nvidia model and it worked off of full sized camera images. As training time was significantly higher, and initial iterations created an interesting off road driving experience in the simulator, I discontinued these endeavours.

The model represented here is my implementation of the nvidia model mentioned previously. It is coded in python using keras (with tensor flow) in and returned from the build_nvidia_model method. The complete project is on github here Udacity Behaviour Cloning Project


The input is 66x200xC with C = 3 RGB color channels.


Layer 0: Normalisation to range -1, 1 (1./127.5 -1)

Layer 1: Convolution with strides=(2,2), valid padding, kernel 5×5 and output shape 31x98x24, with elu activation and dropout

Layer 2: Convolution with strides=(2,2), valid padding, kernel 5×5 and output shape 14x47x36, with elu activation and dropout

Layer 3: Convolution with strides=(2,2), valid padding, kernel 5×5 and output shape 5x22x48, with elu activation and dropout

Layer 4: Convolution with strides=(1,1), valid padding, kernel 3×3 and output shape 3x20x64, with elu activation and dropout

Layer 5: Convolution with strides=(1,1), valid padding, kernel 3×3 and output shape 1x18x64, with elu activation and dropout

flatten 1152 output

Layer 6: Fully Connected with 100 outputs and dropout

Layer 7: Fully Connected with 50 outputs and dropout

Layer 8: Fully Connected with 10 outputs and dropout

dropout was set aggressively on each layer at .25 to avoid overtraining


Layer Fully Connected with 1 output value for the steering angle.


Keras output plot (not the nicest visuals)

Data preprocessing and Augmentation

The simulator captures data into a csv log file which references left, centre and right captured images within a sub directory. Telemetry data for steering, throttle, brake and speed is also contained in the log. Only steering was used in this project.

My initial investigation and analysis was performed in a Jupyter Notebook here.

Before being fed into the model, the images are cropped to 66×200 starting at height 60 with width centered – A sample video of a run cropped.

Cropped left, centre and right camera image
Cropped left, centre and right camera image

As seen in the following histogram a significant proportion of the data is for driving straight and its lopsided to left turns (being a negative steering angle is left) when using data generated following my conservative driving laps.
Steering Angle Histogram

The log file was preprocessed to remove contiguous rows with a history of >5 records, with a 0.0 steering angle. This was the only preprocessing done outside of the batch generators used in training (random rows are augmented/jittered for each batch at model training time).

A left, centre or right camera was selected randomly for each row, with .25 angle ( for left and – for right) applied to the steering.

Jittering was applied per Vivek Yadav’s post to augment data. Images were randomly transformed in the x range by 100 pixels and in the y range by 10 pixels with 0.4 per xpixel adjusted against the steering angle. Brightness via a HSV (V channel) transform (.25 a random number in range 0 to 1) was also performed.
jittered image

During batch generation, to compensate for the left turning, 50% of images were flipped (including reversing steering angle) if the absolute steering angle was > .1.

Finally images are cropped per above before being batched.

Model Training

Data was captured from the simulator. I drove conservatively around the track three times paying particular attention to the sharp right turn. I found connecting a PS3 controller allowed finer control then using the keyboard. At least once I waited till the last moment before taking the turn. This seems to have stopped the car ending up in the lake. Its also helped to overcome a symptom of the bias in the training data towards left turns. To further offset this risk, I validated the training using a test set I’d captured from the second track, which is a lot more windy.

Training sample captured of left, centre and right cameras cropped

Center camera has the steering angle and 1/r values displayed.

Validation sample captured of left, centre and right cameras cropped

Center camera has the steering angle and 1/r values displayed.

The Adam Optimizer was used with a mean squared error loss. A number of hyper-parameters were passed on the command line. The command I used looks such for a batch size of 500, 10 epochs (dropped out early if loss wasn’t improving), dropout at .25 with a training size of 50000 randomly augmented features with adjusted labels and 2000 random features & labels used for validation

python --batch_size=500 --training_log_path=./data --validation_log_path=./datat2 --epochs 10 \
--training_size 50000 --validation_size 2000 --dropout .25

Model Testing

To meet requirements, and hence pass the assignment, the vehicle has to drive around the first track staying on the road and not going up on the curb.

The model trained (which is saved), is used again in testing. The simulator feeds you the centre camera image, along with steering and throttle telemetry. In response you have to return the new steering angle and throttle values. I hard coded the throttle to .35. The image was cropped, the same as for training, then fed into the model for prediction giving the steering angle.

steering_angle = float(model.predict(transformed_image_array, batch_size=1))
throttle = 0.35

Successful run track 1

Successful run track 1

Successful run track 2

Successful run track 2

note: the trained model I used for the track 1 run, is different to the one used to run the simulator in track 2. I found that the data I originally used to train a model to run both tracks, would occasionally meander on track 1 quite wildly. Thus used training data to make it more conservative to meet requirements for the projects.