Frame Coprocesor

Processing video frames is a common problem for FPGAs and ASICs. Presenting a raw 1080p60 image on a TV screen requires movement of roughly 6 MB of data every 60th of a second.
Common video problems include:

Encoding (Reducing picture data to manageable data levels for transport).
Decoding (Taking encoded data and reproducing the picture)
Upscaing/Downscaling
Changing Framerate
Changing ColorSpace
Filtering
Image Classification

Relevant work experience

At RESI I evaluated numerous GPU and some FPGA-based solutions for encoding, decoding and At Texas Instruments I worked on numerous chip projects including Davinci, and our (cellular)

Common approaches and tradeoffs

General Purpose Software Processing

Using a General Purpose datapath is the most flexible way to handle image processing, since you can iterate on software more easily than you can on hardware. The main disadvantage of this approach is that, especially for large canvases, it's difficult to achieve realtime speeds. This might be fine for archiving purposes but for live feeds realtime speeds are crucial. Generally, canvases at or below 720p can probably be processed realtime but higher resolution images may be a challenge.

GPU Processing with Specific hardware

GPU Processing with specific hardware (NVENC/NVDEC or other encoder/decode hardware) would be the fastest option for encoding and decoding. The obvious tradeoff is that the hardware can't be iterated in a two-week sprint. Innovation may take several multi-year design cycles and involve large developer groups and huge expense fabricating an ASIC.

GPU Processing with GPU Cores

Processing Frames with GPU Cores would be a useful tradeoff between specific hardware because it would be iterable in software but offer the advantage of high frequency dedicated cores which sacrifice some of the classic operations of CPUs for a simpler datapath. For simpler problems like downscaling, colorspace or framerate changes this is probably very workable by simply assigning each GPU core to a specific section of the frame.
There are not many GPU based encoding engines out there. My presumptions are that encoding might be practical on a GPU if bitrates were not a concern as each GPU core could be assigned a separate section of the picture and encode it. However, encoding algorithms are often evaluated by quality against bitrate, which would mean the GPU would require expensive synchronization operations after initial encoding to determine which sections of the image were most worth updates to keep the bitrate down.

FPGA Processing

An FPGA implementation of a GPU core would likely be a wasted effort (except as an ASIC qualification effort). An FPGA implementation at any lithography (say, 10 nM) is likely going to be much slower than an ASIC at that same lithography, because you can custom-route the ASIC. However, a FPGA implementation of an encoder circuit allow the ability to iterate on encoding quality. Xilinx and other FPGA IP providers provide IPs for encoding.

Data Formats

Picture data is normally provided in two formats:

RGB (usually a packed set of bytes with Red, Green, and Blue intensities in sequence)
YUV-usually a planar format.... YUV 422 1080p 8-bit frames would be 1920*1080 bytes of Y intensities followed by (1920*1080)/2 of U and (1920*1080)/2 of V intensities.

The advantages of the RGB format are:

They are optimized for the display, which needs to set the red, green and blue intensity.
If you are doing a simple transform you can likely work on sections of the RGB image at a time and not need to store the whole frame.

The advantages of the YUV format are:

Color-subsampling is easier. If you want to reduce color information to save data capacity you simply throw-away some of the U and V intensities.

So looking at each of the above problems and applying to FPGAs:

Encoding

Encoding would involve receiving an entire frame into pre-set sections of memory and doing some sort of DCT or other encoding algorithm on specific sections of the frame to get a good set of frame data. The tradeoffs would likely be that you could evaluate more solutions but that would require more hardware. A good FPGA-based solution might become the basis for an ASIC in future development.

Decoding

Decoding is likely a deterministic process which could certainly be done with an FPGA. However, since the process is easier, by the time an FPGA image was developed there would likely be a faster ASIC-based solution available. The biggest reason to do this is if you are evaluating a new encoding algorithm.

Upscaing/Downscaling

Upscaling/Downscaling (as long as they are by integer values) is fairly trivial and usually involves either replicating pixels (upscaling) or averaging/maxing pixels (downscaling). A full frame buffer is probably not necessary but you may need a few lines of the image stored in memory.

Changing Framerate

Changing framerate can either be a simple process with poor results or a more complicated process with better results. For example, if you wanted to change a picture from 23.98 fps to 60 fps there are a few options:

Use the last fully loaded frame. That is, if you had a 23.98 fps image arriving and needed to convert to a 60fps image, you simply allocate two frame buffers worth of memory. One being the frame you most recently completely received and the other being the frame you are in the process of receiving. When it comes time to send out a new frame you simply pick the frame you most recently completely received. The disadvantage is you will likely end up with a choppy or irregular picture.
You could try to interpolate frame-to-frame changes. For example, if frame X had a pixel with blue intensity of 250 and X+1 had that same pixel with 240, and you needed a frame at time of X+0.5, you could averaage or otherwise linearly interpolate the intremediate pixel values. Obiously this would require a lot more hardware as well as enough frame information for the number of frames you want to interpolate. It would also add latency as you likely need to load in input frames X and X+1 before you could output the interpolated frame.

Changing ColorSpace

The biggest challenge in changing color spaces are that you would need to have the entire frame loaded (for YUV to RGB conversion) or a place to save it (for RGB to YUV). There would also be some (integer * floating point) math involved. However, an (8-bit integer)*(4-digit floating point) multiply and some adds is probably not a major area or speed challenge.

Filtering

Simple pixel-wide filters such as color filters could be easily done on the fly. More complicated filters such as edge-detection filters would require more complicated operations. A major challenge in FPGAs is scarce routing resource so likely redundant hardware would be dedicated to specific sections of the frames in order to reduce routing and increase performance.

Image Classification

Imagine you had a 720p60 video of a highway and you wanted to identify license plates so you could read them and see if any cars on the highway are stolen.
My approach would be:

Reduce frame counts- I likely don't need to process *EVERY* frame of the image- so perhaps one or two frames a second is all that is needed.
Reduce granularity - 720p is a deep resolution. 360p is likely fine.
Remove colors - License plates are usually made with contrasting colors, so if I had a YUV image I would only use the Y components.
Develop a grid. One of the common problems in image detection is that an image of, say, a dog, might be a picture of 15 dogs from far away or one dog very close up. If I know that my camera will be mounted, say, 30 feet off the ground, I can approximate how big a license plate might be.
Develop a CNN with an Edge-filter to look for license plate characters. I would run through an edge-detect filter and then bit reductions to resolve into maybe 16-32 squares of images. A 360p image is 640x360, and if that can be broken into 16 grids of 160x90 which is then downscaled to 40x24, then a 960 input neural network with a few hidden layers is probably suffficient (assuming it was pre-trained). A 2-deep 960 neuron network is a lot of hardware but its processing could probably be pipelined to process each of the 16 grid-squares. If a license plate is detected a separate network could be used to identify the specific plate characters (or this could be done in software).

Key design considerations:

Frame information storage A Alveo U200 is a large FPGA and has about 2.3 Million Registers. Clearly that is far too little for a whole frame. There is plenty of DDR memory and transaction capacity, however, and DDR memory is usually very wide allowing multiple bytes of a line to be accessed at once.
Processing Time A 1080p60 signal would be about 6 MB per 16 ms and about 3 Gbps or 450 MBps. A 100 MHz datapath would need to process 5 Bytes per cycle which is quite reasonable.