SURF algorithm has high computational complexity, and requires a lot of logic and memory resources. Moreover the process of descriptor extraction is difficult to implement in parallel and unable to meet real-time requirements. To solve the above disadvantages, an optimized SURF algorithm is put forward and the FPGA implementation is also provided. A rotation invariant and fully parallel optimized SURF algorithm is achieved using circular feature region and radial gradient transform method, which cancels the processes of main direction calculation and feature region rotation. Then the optimized SURF algorithm is implemented based on FPGA by using multi-memory and multi-channel parallel pipelined architecture. By experimental comparison, the matching performance of the optimized SURF algorithm is as good as the original SURF algorithm. Compared with the original SURF descriptor, the number of matching points reduces in 5% to 20%, but the accuracy of matching improves in 5% to 10%. The FPGA implementation of proposed SURF algorithm meets real-time requirements by using 13.5 MHz clock. For a video stream with resolution of 720×576, the processing speed reaches 25 fps.