Autonomous Cyber-Physical Systems: Sensing and Perception Spring 2019. CS 599. Instructor: Jyo Deshmukh USC Viterbi School of Engineering Department of Computer Science Autonomous systems software architecture Software architecture describes key software components and their interrelationships. A very general software architecture will give you insight about all the software that goes into making the autonomous system; this will include: A single real-time operating system or a hypervisor that controls several operating systems Libraries and components for interfacing with specialized components like

GPUs/NPUs, memory, I/O etc. Application-layer software (the interesting part that makes the autonomous system, autonomous) We will focus only on the application layer software architecture USC Viterbi School of Engineering Department of Computer Science 2 Self-driving car architecture Sensing Perception Decision Control Operating System/Hypervisor

USC Viterbi School of Engineering Department of Computer Science 3 Actuation Sensing GPS IMU LiDAR Radar Primary purpose is to know everything about the environment Most of the software modules in the sensing modules deal with: Sensor fusion: (using KFs, EKFs, UKFs, etc.) Data preprocessing: e.g. converting LiDAR data

into point-cloud representation, sampling video signals, compressing data Camera USC Viterbi School of Engineering Department of Computer Science 4 Perception Localization Object Detection Object Tracking Traffic recognition Road topology identification Many things fall under the vague category of

perception, list to the left is not complete Localization: Strongly connected to sensor fusion May use algorithms such as particle filters in addition to Kalman filter Could further sub-divide into road-level localization in a map, or lane-level localization on a road, or localizing within a lane USC Viterbi School of Engineering Department of Computer Science 5 Perception Localization Object Detection Object Tracking Traffic recognition Road topology

identification Object detection Use vision or deep learning algorithms to detect various kinds of objects Pedestrians, Bicyclists, other vehicles, traffic signs, lane markings, obstacles Objects could be static or dynamic, and detection algorithms may vary accordingly Object tracking Tracking trajectories of moving objects Could be based on deep learning or algorithms like optical flow USC Viterbi School of Engineering Department of Computer Science 6 Decision

Prediction Mission planning Behavioral Planning Obstacle avoidance Motion Planning Prediction: Generate most probable trajectories of vehicles, pedestrians, obstacles in the environment Mission/Route planning: Generate very high-level route plan based on way-points, maps Reference planning: Generate trajectories for vehicle + Behavioral planning (traffic-aware): modify according to environment models Motion planning: Synthesizing inputs for low-level

actuators to match a higher-level plan Obstacle avoidance: Stopping or maneuvering around an obstacle USC Viterbi School of Engineering Department of Computer Science 7 Decision Prediction Mission planning Behavioral Planning Obstacle avoidance Decision layer is the most software-intensive layer

Several algorithms have been proposed in the robotics and automotive community based on optimization, search-based planning, discrete decision-making (with state machines). Current trend is to investigate application of AI/control techniques such as reinforcement learning/deep reinforcement learning Motion Planning USC Viterbi School of Engineering Department of Computer Science 8 Control (Low-level control) Steering Control Torque control Lateral stability control

Energy management Emissions control These algorithms have been typically deployed to cars before self-driving cars took off Recent trend is to drive-by-wire, i.e. replace mechanical and hydraulic components by electrical and electronic components Also, researchers are interested in knowing if existing algorithms can be made more efficient with data (models of the environment) Techniques like model-predictive control are gathering momentum USC Viterbi School of Engineering Department of Computer Science

9 Sensors Transducers that convert one physical property into another In our context, a sensor convers a physical quantity into a numeric value Choosing sensors is a hard design problem, as sensors can have many factors that need tuning Accuracy: Error between true value and its measurement (noise values, rejection of external interference) Resolution: Minimum difference between two measurements (usually much smaller number than actual accuracy of the sensor) Sensitivity: Smallest change in value that can be detected USC Viterbi School of Engineering Department of Computer Science 10

Sensor selection Factors affecting sensor selection (continued) Dynamic Range: Minimum and maximum values that can be accurately detected Time-scale/Responsiveness: How often the sensor produces output and frequency bandwidth of the measurement output over time Interface technology: analog, digital or serial or network streams Perspective: Which portion of the environment can the sensor measure (e.g. the field of vision for a camera, orientation for an ultrasonic module) USC Viterbi School of Engineering Department of Computer Science 11 Basics of IMUs

Inertial Measurement Units or IMUs are part of an Inertial Navigation System Use accelerometers and gyroscopes to track position and orientation of an object relative to start position, orientation and velocity Typically: 3 orthogonal rate-gyroscopes measuring angular velocities, and 3 accelerometers measuring linear accelerations (along the 3 axes) Stable-platform IMUs: a platform is used to mount the inertial sensors, and the platform is isolated from external rotational motion Strapdown IMUs: Inertial sensors mounted rigidly (more common due to smaller size) USC Viterbi School of Engineering Department of Computer Science 12 Inertial navigation algorithm (for strapdown IMUs) USC Viterbi

School of Engineering Department of Computer Science 13 IMU equations Relation between body frame and global frame is given by a 3X3 rotation matrix , in which each column is a unit vector along one of the body axes specified in terms of the global axis Rotation matrix is an orthogonal matrix whose determinant is 1. Rotations of about the and axes respectively achieved by following rotation matrices: Body Frame vs. Global Frame USC Viterbi School of Engineering Department of Computer Science

Image from [1] 14 IMU equations continued Note that for a rotation matrix, . Let and be velocities in the global and body frame respectively, then: and We need to track through time Let be the rotation matrix relating body frame at time to body frame at time . Then . If the rotations are small enough, can itself be written as , where is the small angle approximation of the rotation matrix USC Viterbi School of Engineering Department of Computer Science

15 IMU equations So, Also, = , where is the rotational velocity along the axis. So actually evolves according to the linear dynamical system , and the solution for In implementation, we can approximate the updated at each time step by a numerical integrator Acceleration in the body frame is tracked in a similar fashion. Once we have velocity and acceleration, we can compute position, after subtracting the acceleration due to gravity USC Viterbi School of Engineering

Department of Computer Science 16 Basics of GPS Commercial GPS systems provide (GPS) coordinates of a vehicle within 2m accuracy, this is not enough for autonomous driving Differential GPS receivers provide decimeter level accuracy GPS gives information about current time, latitude, longitude, and altitude Often GPS data is used as observations along with an IMU-based inertial navigation system (INS) to localize the vehicle USC Viterbi

School of Engineering Department of Computer Science 17 Kalman Filter/EKF Predict Update Basics of LiDAR LiDAR stands for Light detection and Ranging Typical LiDARs e.g. Velodyne HDL-64E use multibeam light rays Mathematical model by ray-casting: rays are cast at

an angle, and you get the distance from the first obstacle that reflects the light Lidar data consists of rotational angle and distance to obstacle This can be represented in a point cloud form by mapping each obstacle point to coordinates (with respect to the body frame) USC Viterbi School of Engineering Department of Computer Science 18 Basics of Radar Frequency Modulated Continuous Wave Doppler radar is popular

Main idea is to compute the distance to the obstacle in a given direction by comparing transmitted and received signals Allows estimating both distance and relative velocity Radars may require additional signal processing to give precise answers when the environment is dusty, rainy, or foggy. Forward-facing radars estimate relative position/velocity of lead vehicle Surround ultrasonic sensors can help create environment models (Tesla approach) USC Viterbi School of Engineering Department of Computer Science 19 Sensor Fusion

We already learned about Kalman filter that can help do sensor fusion for localization using INS and GPU Sensor fusion for camera and LiDAR data requires new algorithms Centralized algorithms based on conditional random fields, Markov random fields, and decentralized algorithms based on boosting and Gaussian mixture models have been explored Deep learning is also being explored for doing sensor fusion Note: these approaches are exploratory, and there is no standard algorithm accepted by all USC Viterbi School of Engineering Department of Computer Science 20 Bibliography 1. O. J. Woodman, An introduction to inertial navigation - Cambridge Computer Laboratory, https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-696.pdf 2.

Pendleton, Scott Drew, Hans Andersen, Xinxin Du, Xiaotong Shen, Malika Meghjani, You Hong Eng, Daniela Rus, and Marcelo H. Ang. "Perception, planning, control, and coordination for autonomous vehicles." Machines 5, no. 1 (2017): 6. 3. S. Liu, L. Li, J. Tang, S. Wu, Jean-Luc Gaudiot, Creating Autonomous Vehicle Systems, Morgan & Claypool 2018. USC Viterbi School of Engineering Department of Computer Science 21 Perception from LIDAR and vision Data representation Segmentation algorithms from LIDAR data

Object Detection from camera data USC Viterbi School of Engineering Department of Computer Science 22 Data representation Following representations for LIDAR data are most popular: Point-cloud representation in the 3D space Feature representation Representation using grids The choice of representation guides the choice of the algorithms chosen downstream for segmentation/detection Point-cloud based approaches may need filtering algorithms to reduce number of points Voxel-grid filtering : cover the space with tiny boxes, and replace each box

with the centroid of the box USC Viterbi School of Engineering Department of Computer Science 23 Data representation Feature-based approaches Extract specific features from the point cloud such as lines or surfaces Most memory-efficient approach, but accuracy subject to nature of the point cloud Grid-based approaches Discretize space into small grids and represent the point cloud as a spatial data structure Discretization-delta is a heuristic choice, and efficacy depends on the chosen delta

USC Viterbi School of Engineering Department of Computer Science 24 Segmentation algorithms Segmentation: Clustering points into multiple homogenous groups Broadly divided into: Edge-based methods: Good when objects have strong artificial edge features (e.g. road curbs) Region-based methods: Based on region-growing, i.e. pick seed points, and then grow regions based on criteria such as Euclidean distance between points, surface normals etc. Model-based methods: Fit points into pre-defined categories such as planes, spheres, cones etc. Attribute-based methods: first compute attributes for each point, and then cluster based on attributes Graph-based methods: Cast point cloud into graph-based structures Deep-learning based methods

USC Viterbi School of Engineering Department of Computer Science 25 Some popular segmentation algorithms RANSAC (random sample and consensus) Hough Transform Conditional Random Fields, Markov Random Fields (also used for sensor fusion between LIDAR and vision) USC Viterbi School of Engineering Department of Computer Science 26

RANSAC Algorithm for robust fitting of a model in the presence of outliers Given a fitting problem with parameters , estimate optimal values for What is a model? Line, bounding box, etc., i.e. any parametric shape Assumptions: Parameters can be estimated from points There is a total of points USC Viterbi School of Engineering Department of Computer Science 27 RANSAC continued Select points at random 2. Estimate values for the shape fitted to the above points (say the values is ,

and the resultant shape is now 3. Find how many of the points are within some tolerance of . Say this is 4. If is large enough, accept model and exit with success 5. Repeat 1 to 4 some times 6. Fail if you get here Hard part: how to choose 1. USC Viterbi School of Engineering Department of Computer Science 28 Choosing parameters for RANSAC

Pick based on how many points required to find a good fit for the shape Pick based on intuitively how many points would lie in a shape If there are multiple models or structures with an image, remove the points associated with a shape once RANSAC terminates with success, and then redo RANSAC Probability that a selected point is an inlier: Probability that an iteration of RANSAC fails without finding a good fit: Pick = USC Viterbi School of Engineering Department of Computer Science 29 Hough Transform A tool to detect lines, circles and more general shapes One of the tools used for lane marking detection from (pre-processed)

images Operates on sets of points and helps obtain a geometric representation of shapes that points may form We will see how Hough transform works in 2D USC Viterbi School of Engineering Department of Computer Science 30 HT basics (1 , 1 ) (2 , 2 )

(3 , 3) 1 2 3 Lets look at a simple transformation: map space to the space are known! Plot lines in the space: each point maps to a line in the space If lines corresponding to different points intersect: this represents a collection of collinear points with the slope and intercept defined by the intersection USC Viterbi School of Engineering Department of Computer Science

31 HT in polar coordinates Problem with the space is that for vertical lines. So all lines in the space corresponding to points on a vertical line would only intersect at infinity. Resolved by instead considering the space Line in -space: Here, = length of normal to line, angle made by normal with -axis Now point in space maps to a sinusoid in -space To find lines, we let the sinusoids vote: i.e. identify the

points in a suitable grid that accumulate weight USC Viterbi School of Engineering Department of Computer Science 32 HT discussion Pros: Conceptually simple and easy to implement Robust to noise Handles missing and occluded data gracefully Can be adapted to various shapes beyond lines Cons: Computationally complex if there are many shapes to look for Can be fooled by apparent lines Collinear line segments may be hard to separate USC Viterbi School of Engineering Department of Computer Science

33 Detection algorithms for video data Detection of segmented clusters from LIDAR data is done using traditional machine learning algorithms based on SVMs, Gaussian Mixture Models etc. More interesting problem is detection from images Lane line marking detection Drivable path detection On-road object detection USC Viterbi School of Engineering Department of Computer Science 36 Image pre-processing Typically first step before applying detection algorithms

Remove obstacles (e.g. other vehicles) Weaken shadows Normalize images by controlling camera exposure Limiting region of interest USC Viterbi School of Engineering Department of Computer Science 37 Lane marking detection Used as feedback to vehicle control systems (Lane Departure Warning, Lane-Keep Assist, and Lane-Tracking Control) Several decades of work, but still not fully solved because of uncertainties in traffic conditions, and road-specific issues such as shadows, worn-out markings, directional arrows, warning text, pedestrian zebra crossings etc. Four/Five common steps:

Lane line feature extraction Fitting pixels into various models (lines, parabolas, hyperbolas) Estimating vehicle pose based on fitted model (optional fourth step: use of temporal continuity) Image to world coordinates transformation USC Viterbi School of Engineering Department of Computer Science 38 Model fitting and Pose estimation Simple case: lane lines are straight lines (straight highway segments) More complicated case: curvy roads, lane markings may have to be fit with splines, contours etc. Pose estimation is typically done using Kalman filter or particle filter

(typically more reliable) Requires inverse perspective transformation to go from lane coordinates to world coordinates Lane-level localization involves estimating vehicle lateral position and moving orientation (again can use vehicle odometry + Kalman filter) USC Viterbi School of Engineering Department of Computer Science 39 Lane line extraction Based on lane markings having large contrast with road pavement Edge detectors followed by Hough Transform to get the complete lane Basic idea in edge detection: Edges are discontinuities of intensity in images, correspond to local maxima of the image gradient Nave image gradients can be affected by noise in the image, so solution is to take smooth derivatives i.e. first smooth an image by convolving it with a Gaussian

filter, and then take the derivative Edges correspond to zero-crossings of the second derivative of the Gaussian or LOG (Laplacian of the Gaussian) Approach used in Canny edge detector (OpenCV) and Matlab USC Viterbi School of Engineering Department of Computer Science 40 Drivable-path detection Detecting road boundaries where vehicle can drive freely and legally without collisions Several image-processing based algorithms using feature detection or feature learning : generally deemed to be not robust enough for erratic driving conditions Deep learning provides the most popular set of techniques based on CNNs Other algorithms include exploiting GPS data and OpenStreetMap data USC Viterbi

School of Engineering Department of Computer Science 41 On-road object detection Detect other vehicles, pedestrians, bicycles etc. Again, deep learning based methods seem to be clear winners General pipeline for deep learning approaches: Set of proposal bounding boxes generated in the input image Each proposal box is passed through a CNN to obtain a label and fine tune the bounding boxes USC Viterbi School of Engineering Department of Computer Science 42

Basics of Neural networks 1 1 1 1 + 1 1 Compact representation of the layer of a neural network USC Viterbi School of Engineering Department of Computer Science A feedforward neural network with hidden layers is defined as follows: Input to the network

Output of the network Output of the layer And, Where, is known as the activation function, the weight matrix and, the bias term of layer 43 Basics of Training a NN Compute how off you are from the given example is the cost function for a given set of weights , biases and a training example and Add a regularization term (to avoid overfitting). This is a term proportional to the sum of squares of all elements in the weight matrics Use gradient descent to minimize cost update each variable (weights/biases) :

Here, is a learning rate, and the partial derivative is computed by backpropagation (chain rule applied to the function represented by the NN) USC Viterbi School of Engineering Department of Computer Science 44 Convolutional Neural Networks Inspired by visual cortex in animals Learns image filters that were previously hand-engineered Basic intuitions for CNNs: Images are too large to be monolithically processed by a feedforward neural network (1000x1000 image = 106 inputs, which means the weight matrix for the second layer is proportional to at least 106!) Data in an image is spatially correlated CNN divided into several layers with different purposes

USC Viterbi School of Engineering Department of Computer Science 45 Convolutional layerReceptive field First layer is a convolutional layer Convolutional layer contains neurons associated with subregions of the original image Each sub-region is called receptive field Convolves the weights of the convolutional layer with each cell in receptive field to obtain activation map or feature map

USC Viterbi School of Engineering Department of Computer Science Image from [1] 46 Convolution Convolution of a 2-D image I with a 2-D kernel K defined as: Most neural network libraries do not use convolution, but instead implement cross-correlation, i.e. The kernel function usually defines the receptive field

USC Viterbi School of Engineering Department of Computer Science 47 Purpose of convolutional layer Convolutional layer applies a filter to the pixels within its receptive field This allows identifying low-level features (curves, straight lines, etc.) The outputs of the first convolutional layer can be thought of as having a high value when a particular feature is detected and a low value otherwise Second convolutional layer allows learning higher-level features (e.g. semicircles, angles etc.) Second convolutional layer has a bigger receptive field (as it is able to simultaneously correlate over outputs of first layer) By convolving over the feature map, output of second layer tries to connect higher level features USC Viterbi

School of Engineering Department of Computer Science 48 More insights about convolution Convolution operation basically helps implement three ideas: Sparse interactions (between layers) Parameter sharing Equivariant representations Spare interaction: By using a kernel function that is smaller than input, not all outputs of the first layer interact with all inputs This reduces the cost of doing matrix multiplication USC Viterbi School of Engineering Department of Computer Science

49 More insights about convolution Parameter sharing As the kernel function is repeatedly applied to the image, (weight) parameters are shared This reduces storage requirements of the model Equivariant representations Parameter sharing leads to equivariance under translation is equivariant to if I.e. detected features of a linearly translated image will appear linearly translated USC Viterbi School of Engineering Department of Computer Science 50

CNN architecture Next layer Convolutional Layer Pooling stage Detector stage Convolution stage (Affine transform) Picture from [2] CNN architecture Small number of layers, where each layer contains stages Affine transform/convolution stage performs convolution Detector stage uses a nonlinearity such as a rectified linear unit (ReLU), i.e.

Pooling stage performs a suitable pooling operation Input to layer USC Viterbi School of Engineering Department of Computer Science 51 Pooling stage Pooling function replaces the output of a layer at a certain location with a summary statistic of the nearby outputs E.g. max pooling reports maximum output within a rectangular neighborhood Other pooling functions include average, L2 norm, weighted average etc.

Pooling helps representation approximately invariant to small translations By pooling over outputs of different convolutions, features can learn which transformations to become invariant to (e.g. rotation etc.) USC Viterbi School of Engineering Department of Computer Science 52 Fully connected layers CNNs may have some fully connected layers before the final output These layers allows performing higher-level reasoning over different features learned by the previous convolutional layers Various kind of convolution functions, pooling functions and detection functions are possible, giving rise to many different flavors Number of convolutional layers can be varied depending on complexity of features to be learned

USC Viterbi School of Engineering Department of Computer Science 53 R-CNN[3] R-CNN, Fast R-CNN and Faster R-CNN are specific architectures that help with object detection Objective is to obtain from an image: A list of bounding boxes A label assigned to each bounding box A probability for each label and bounding box The key idea in R-CNNs is to use region proposals and region of interest pooling We will briefly discuss the architecture of Faster R-CNN

USC Viterbi School of Engineering Department of Computer Science 54 R-CNN 3 main steps: Scan the input image for possible objects (using an algorithm called Selective Search) generating region proposals (bounding boxes where possible objects may lie) Run a CNN on top of these region proposals Take output of each CNN and feed it into a support vector machine to classify the region USC Viterbi

School of Engineering Department of Computer Science 55 Fast R-CNN Fast R-CNN performed feature extraction before proposing regions Replaced SVM with a softmax layer Fast R-CNN was much faster than RCNN (hence the name!) But the bottleneck was still region proposal using selective search Faster R-CNN changed that USC Viterbi School of Engineering Department of Computer Science

56 Faster R-CNN Replaces slow selective search algorithm with a fast neural net Introduces a region proposal network Uses an intermediate output of the CNN to generate multiple possible regions based on fixed aspect-ratio anchor boxes, and a score for each region reflecting possibility of containing an object Matlab 17b uses Faster R-CNN for object detection: you could use this or RCNN in Homework 3 USC Viterbi School of Engineering Department of Computer Science 57

YOLO algorithm (You Only Look Once) YOLO is one of the fastest real-time detection algorithms (See [4,5,6]) R-CNN etc. are algorithms that leverage classifiers and localizers to perform detection YOLO applies a single neural network to the entire image Network divides image into regions and predicts bounding boxes and probabilities for each region Bounding boxes are weighted by predicted probabilities USC Viterbi School of Engineering Department of Computer Science 58 References [1] Understanding CNNs:

https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/ [2] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press. [3] https://towardsdatascience.com/deep-learning-for-object-detection-a-comprehensive-review73930816d8d9 [4] https://towardsdatascience.com/yolo-you-only-look-once-real-time-object-detection-explained-492dc9230006 [5] https://pjreddie.com/darknet/yolo/ [6] YOLO algorithm: https://arxiv.org/abs/1506.02640 USC Viterbi School of Engineering Department of Computer Science 59 Bibliography 1. Pendleton, Scott Drew, Hans Andersen, Xinxin Du, Xiaotong Shen, Malika Meghjani, You Hong Eng, Daniela Rus, and Marcelo H. Ang. "Perception, planning, control, and coordination for autonomous vehicles." Machines 5, no. 1 (2017): 6. 2.

Good introduction to Hough transform and various vision algorithms: http://aishack.in/tutorials/hough-transform-normal/ 3. Hough transform basics: http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/HoughTrans_review.pdf 4. Graph-based clustering: http://vision.stanford.edu/teaching/cs231b_spring1213/slides/segmentation.pdf 5. MRF/CRF fundamentals https://www.cs.umd.edu/~djacobs/CMSC828seg/MRFCRF.pdf 6. Edge detection: https://www.swarthmore.edu/NatSci/mzucker1/e27_s2016/filter-slides.pdf 7. SLAM:

https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/Durrant-Whyte_Bailey_SLAM-tutorial-I.p df USC Viterbi School of Engineering Department of Computer Science 64