A robust robotic manipulation strategy combining Computer Vision and Force Feedback using the Universal Robots UR7e.
Watch the DemoIn the field of industrial automation, the "Peg-in-Hole" problem is a classic benchmark for manipulation capability. While humans perform this task using intuition and high-fidelity tactile feedback, robots struggle with the tight tolerances involved. Our project focuses specifically on inserting a USB-C connector—a task that requires sub-millimeter precision (±0.5mm) and delicate force control to avoid damaging the small, reversible port.
We developed a hybrid approach combining Eye-in-Hand visual servoing for coarse alignment with a force-feedback spiral search algorithm for fine insertion. The system uses a monocular camera mounted on the robot's end-effector to locate the USB-C port, then transitions to a "blind" tactile search phase that compensates for calibration errors and ensures safe, successful insertion.
By solving this precision manipulation challenge with low-cost sensors (a webcam and the UR7e's built in force sensor), we demonstrate that sophisticated insertion tasks don't require expensive force/torque sensors or sub-millimeter positioning systems. This approach can be adapted for automated testing, electronics assembly, and any application requiring reliable connector mating in industrial environments.
The primary objective was to develop a "blind-capable" insertion system. We aimed to build a robotic system that could:
Standard industrial robots rely on perfect calibration and fixed fixtures. If a part moves by 2mm, a standard robot will crash. Our project solves the Unstructured Environment problem. By combining vision (for "getting close") and force (for "getting it right"), we created a system that works even if the charging port is moved or placed under different lighting conditions.
Our approach to creating a precision USB-C insertion system
To achieve a successful insertion, our system had to meet the following criteria:
Our final design is in 4 stages:
We originally chose to assume the transform between the camera and end effector was fixed. That way, we can align the camera and move straight down to insert into the port. However, we eventually realized even the slightest error in camera mounting would cause significant misalignment at the tool tip. Thus, we used camera calibration to reduce error in each step.
In order to achieve higher accuracy, we revised our scope to not picking up any cable for the rigidity of a fixed tool. This reduced the complexity of cable detection as well as cable pick up, allowing us to focus on the insertion logic and execution. Ultimately, we pivoted a decent amount from our initial proposal (due to our conversation with our assigned TA) to prioritize robust core functionality of the USB-C twist of the traditional peg insertion problem. In terms of project specific alterations, we also switched from joint-space planning to Cartesian path planning to ensure the camera remained perpendicular to the workspace.
Our system was designed not just for a lab demo, but to model how robust assembly lines function under uncertainty. We focused on bridging the gap between "computer vision" and "mechanical reality."
Design Choice: We handed off the "last millimeter" of insertion to a tactile force-guided spiral search rather than relying on perfect vision.
Real-World Value: In factory environments, lighting is inconsistent and metallic parts (like USB-C ports) cause specular reflections that blind depth sensors. By decoupling success from perfect visibility, our approach mirrors cost-effective industrial automation where "compliance" (tactile sensing) is prioritized over expensive, infinite-resolution cameras.
Design Choice: We enforced strictly linear Cartesian paths rather than the curved, energy-efficient arcs of joint-space interpolation.
Real-World Value: In collaborative workspaces (Cobots), predictability is safety. Operators can intuitively anticipate a linear path, whereas joint-space arcs can swing unpredictably into human zones. Linearity also ensures sensors remain perfectly aligned with the workspace, eliminating the "perspective drift" common in dynamic environments.
Design Choice: We utilized a custom 18-point hand-eye calibration routine to solve for physical offsets, rather than relying on fixed CAD assumptions.
Real-World Value: This approach drastically lowers the "barrier to entry" for hardware. It demonstrates that high-precision tasks do not require expensive, perfect manufacturing. By using software to mathematically correct for physical imperfections (like loose mounts or 3D printing tolerances), manufacturers can achieve sub-millimeter accuracy using cheaper, off-the-shelf components.
End Effector with USB-C Attachment
Baseplate with Bigger Holes & Tolerance USB-C
To Spec USB-C Baseplate
We opted for a Fixed-Tool, Eye-in-Hand configuration.
The core perception logic relies on a robust OpenCV pipeline encapsulated within the
CalibrationNode. We utilize cv2.findChessboardCorners to identify the
grid pattern in the raw video feed, followed immediately by cv2.cornerSubPix to
refine these detections to sub-pixel accuracy. This refinement step is crucial for high-precision
insertion tasks, as it minimizes quantization error before any geometric calculations occur.
Once the 2D pixel coordinates are locked, cv2.solvePnP matches them against the
known 3D geometry of the chessboard (obj_points).
This algorithm solves the Perspective-n-Point problem, outputting the rotation and translation
vectors that define exactly where the chessboard sits relative to the camera's optical center.
To make this spatial data actionable for the robot, we perform a rigorous coordinate transformation
sequence. The raw output from the vision pipeline is in the camera's optical frame, which is distinct
from the robot's coordinate system. The code converts the rotation vectors into quaternions using
scipy.spatial.transform and leverages the ROS 2 tf2 buffer to transform
the pose from the camera_link frame to the base_link frame.
This ensures that the detection is not just relative to the sensor, but rooted in the robot's fixed
base frame, allowing for consistent motion planning regardless of how the camera is mounted or moved.
Finally, the node functions as a managed state machine rather than a continuous stream, ensuring
system stability. It utilizes a ROS service (/calibrate_height) to trigger detection
only when the robot is static, preventing motion blur from corrupting the measurement. The integration
of the PinholeCameraModel validates the intrinsics (focal length and principal point),
allowing us to project rays from 2D pixels (u, v) into 3D space.
Visual debugging tools embedded in the code—drawing the principal point and board center—allow
operators to visually verify that the physical camera matches the mathematical model before
calibration data is published.
def process_frame(image):
# 1. Detection: Find grid pattern in 2D image
corners_2d = cv2.findChessboardCorners(image)
if found:
# 2. Refinement: Improve accuracy to sub-pixel level
refined_corners = cv2.cornerSubPix(corners_2d)
# 3. PnP Solver: Match 2D pixels to known 3D geometry
# Returns position/rotation of board relative to Camera
pose_in_camera = cv2.solvePnP(refined_corners, known_3d_points)
# 4. Transformation: Convert to Robot Base Frame
# Crucial for motion planning
pose_in_base = tf_buffer.transform(pose_in_camera,
from="camera_link",
to="base_link")
publish(pose_in_base)
This node acts as the system's "Brain," using a Finite State Machine to command the UR7e via MoveIt. It orchestrates the transition from visual alignment to tactile insertion.
/spiral_search service, handing off control to force
feedback.We prioritize Cartesian Path Planning over joint interpolation. By enforcing strictly linear motion, we prevent the arc-induced "perspective drift" common in joint moves. This ensures the camera remains perpendicular to the target throughout the descent, validating our coordinate transforms.
The SHIFT state handles the critical hand-eye coordination. The code calculates the
real-time inverse transform between camera_link and tool0, determining
exactly how to position the gripper where the camera lens just was. This maneuver relies entirely
on the accuracy of our calibration node, as the target becomes occluded by the robot itself.
state = "ALIGN"
while node_is_running:
if state == "ALIGN":
# Closed-loop Visual Servoing
error_vector = get_usb_c_position()
if magnitude(error_vector) < 1.0mm:
state = "DIVE"
else:
# Move linearly to correct error
move_cartesian(dx=error_vector.x, dy=error_vector.y)
elif state == "DIVE":
# Move straight down 12cm, maintaining X/Y lock
move_cartesian(dz=-0.12)
state = "SHIFT"
elif state == "SHIFT":
# The "Blind" Hand-Eye Swap
# Calculate physical offset between camera lens and gripper tip
offset = get_transform(from="camera_link", to="tool0")
move_cartesian(offset)
state = "SPIRAL"
elif state == "SPIRAL":
# Hand off control to force feedback system
trigger_service("/spiral_search")
stop_cartesian_control()
While the Calibration Node handles static alignment, the CV node is responsible for
real-time target detection. Because the robot may approach the USB-C port from varying heights
(changing the apparent size of the port), standard template matching is insufficient.
To solve this, we implemented a Multi-Scale Template Matching pipeline. The algorithm
iteratively resizes the template across a defined range (0.1% to 1.0% of the image area) to ensure
scale invariance. We utilize Normalized Cross-Correlation (NCC)
(cv2.TM_CCOEFF_NORMED) as our matching metric, which makes the system robust to
global lighting changes—a critical feature for detecting metallic ports that reflect light unpredictably.
Figure C.1: Correlation Heatmap. Red areas indicate high probability matches (NCC > 0.8), while blue areas indicate low correlation.
Finally, to prevent false positives from cluttering the data, we apply Non-Maximum Suppression
(NMS).
This filters out overlapping bounding boxes, ensuring that only the single strongest detection candidate
is published to the /usbc_pose topic.
def find_usbc_port(image, template):
all_matches = []
# 1. Multi-Scale Search
# Iterate scale from 0.05% to 1.5% of image size
for scale in linspace(min_scale, max_scale, 20):
# Resize template to current scale
resized_template = cv2.resize(template, scale)
# 2. Normalized Cross-Correlation
# Robust against brightness differences
heatmap = cv2.matchTemplate(image, resized_template, TM_CCOEFF_NORMED)
# 3. Thresholding
locations = where(heatmap > 0.8)
all_matches.append(locations, score)
# 4. Non-Maximum Suppression (NMS)
# Remove overlapping boxes, keep only the highest score
best_match = non_max_suppression(all_matches, overlap_thresh=0.3)
return best_match.center_x, best_match.center_y
Standard ROS topics (and our attempted use of a load cell) have too much latency (50-100ms) for force insertion. We solved this by injecting URScript directly into the controller to toggle force mode on the UR7e.
# Snippet of our URScript Injection
def _perform_spiral_search(self):
ur_script = """
force_mode(p[0,0,0,0,0,0], [0,0,1,0,0,0], [0,0,-5.0,0,0,0], 2, ...)
while step_count < max_steps:
# Move in expanding spiral (0.5mm steps)
# Check Z-height
if drop > 0.003: # 3mm drop detected
textmsg("HOLE FOUND!")
break
"""
self.script_publisher.publish(ur_script)
We put all of our services, MoveIt, camera initialization into launch files for easier management and reproducibility.
Alignment Precision: The visual servoing consistently brought the tool tip within ±1mm of the port center.
Insertion Success: In final trials, the system achieved an 80% success rate with <5 mm localization error pre-insertion. Failures were typically due to calibration drift causing the spiral search to start too far from the hole.
Detail: The "Wiggle" (Spiral Search) in action.
Debug: Computer Vision detecting the board.
We successfully engineered a robotic system capable of sub-millimeter insertion without expensive hardware. The project demonstrated that Hybrid Control (Vision + Force) is far more reliable than Vision alone. The visual system gets us "in the ballpark," but the tactile system "scores the run." Overall, we are quite content with our work as we ended up successfully accomplishing our intended target outcomes achieving a 80% success rate with <5mm localization error given our adjusted project (focusing only on USB-C insertion) while leaving extensive room for taking the project to the next level going forward.
Challenges:
Flawed transform logic (live demo code).
Fixed and 100% accurate transform logic.
Future Work:
Role: Computer Vision & Control
Gabriel is a senior in EECS interested in deep learning and computer vision. He developed the OpenCV pipeline for chessboard calibration, port detection, and general movement.
Role: Hardware & Force Mode
Larry is a senior in MechE interested in optimization, optimal control theory, and multi-agent systems. He designed the all of the project's hardware including the 3D printed end-effector and pioneered the force-related components of PAIR.
Role: Search & Transforms
Erik is a senior in EECS interested in robotics, particularly in perception and sensing applications. He primarily handled the spiral search injection script as well as the coordinate transforms.
Our team worked collaboratively through the entirety of the project, and although each member took the lead of different components of the project, everyone was involved in every aspect of the project as a whole.
Below are links to our source code and CAD files.
Please click the following links to view launch files, CAD files, etc.