Frame Generator
Background
Video Frame Generation poses specific challenges for logic designers because:
- A 1080p60 frame setup requires approximately 3 GBps of data movement. Each Frame is around 6 MB of data, consisting 1920x1080 pixels at 3 or more bytes per pixel. Sixty frams of 6 MB is around 360 MB or 3 GBps.
- Frame operations often involve complex color calculations, such as involved when converting from RGB to YCbCr.
- Frame processing mechanisms often must meet latency requirements (downstream tools do not want their frames delayed or out-of sync with other media).
Additionally, implementation details make computing frame values more difficult. For instance, a simple RGB24 Test Frame may be generated by rotating red, green and blue byte values (i.e. Byte0=0xFF, Byte1=0x00, Byte2=0x00 for an all-red screen) for all pixels. However, complexity is introduced when the following are considered:
- Planar Formats Formats such as YCbCr are planar. YCbCr requires all the Y components of the pixels to be output first, followed by all the Cb and Cr components. Conversion from RGB to YCbCr therefore requires at least 2/3 of a Frame's worth of memory.
- Color depth options commonly include 8-bit, 10-bit and 12-bit, requiring byte packing.
- Subsampling requires a (usually even) fraction of data.
- Operations to manipulate frames (add overlays, etc) often need to work on tens of thousands of pixels. This is one of the reasons GPUS have become so popular- they can provide several smaller but much more powerful micro-CPUS to work independently across the frame.
Relationship to work experience
At Resi I did several projects involving populating frames and filling in data on proposal systems. I mostly used NVIDIA CUDA but looked at OpenGL as well.
Implementation
My github contains the source to build and run.
Next Steps:
Add Wider Data Output
I started this process with a 1-byte output. This was simple but allowed me to compute the byte "on the fly" and build my product incrementally as recommended by Agile. Since I want to be able to support 32 or 64-bit wide outputs, the best approach is to cache the values I output into an output register.
For instance:
reg [31:0][7:0] CurrentDataPixelGroup_q;
reg [31:0][7:0] NextDataPixelGroup_q;
would hold the data for the next pixels I plan to output. If my dataOutput Bus width were 8 bytes, I would have 4 cycles of time to compute my NextDataPixelGroup_q.
The issue here is that 32 is not divisible by 3, so my first CurrentDataPixelGroup[0] might be red on one cycle and green on the next. It could be widened:
reg [95:0][7:0] CurrentDataPixelGroup_q;
reg [95:0][7:0] NextDataPixelGroup_q;
96 is divisible by 3 and 8. But suppose the system required 10-bit pixel entries. 96*8=768, which is not divisible by 10. In that case, assuming a packed format, one could do:
reg [31:0][9:0] CurrentDataPixelGroup_q;
which would give me 320 bits, which is 40 bytes worth of data representing 32 pixel entries.