This a reprint of my September 1997 letter to IEEE Computer magazine’s Open Channel. It’s still relevant today. For instance, when designing add-in computer cards or chips that connect over PCI-express connection.
Preface
Back in the 1990’s when designing add-in cards for PC’s and Apple computers various busses such as PCI (precursor to PCI Express), presented an issue – reading from card’s memory by the processor was abysmally slow – about 5 MegaBytes per second, while writing was not a problem at all, going at nearly full bandwidth of the bus.
On today’s computers with PCIe (PCI Express) point-to-point connection, the same problem seems to still exist, even when reading using 64-bits at a time or SIMD/SSE instructions – reading is still abysmally slow while writing is fast.
It seems like the following 25 year-old suggestion provided in this article still applies today. Enjoy!
Article Re-Print
I’ve designed interfaces to several industry standard busses. One this that becomes immediately clear is that writes are easy and reads are hard. I believe this fact is fundamental to other fields as well: Writing is a deterministic process and you know with a high degree of certainty what it takes. However, reading (that is sensing) requires more time to make sure. This may, for all I know, be some fundamental law of information theory.
Be that as it may, when designing bus interfaces, writes are easy to handle at full bandwidth using a queue, such as a FIFO. It is more difficult, however, to build an interface circuit that handles reads at full bandwidth. Yes, you can sacrifice the latency of the first access and then prefetch more data items than you may need. But let’s ask a more fundamental question: Are reads really necessary in a protocol of communication?
Splitting the read
If writes are so easy and perform so well, then instead of reading why don’t we simply make a request to be written back to? In other words, split a read into two parts: a request and a reply. This begins to look like a split-transaction bus protocol.
Split-transaction busses have been used in several supercomputers. The first, I believe, was an early Sequent Balance that had 2 to 30 processors sitting on a shared bus, each processor having a 5 percent cache miss. The designers needed to ease the 150 percent bus utilization. They realized that reads wasted time on the bus not transferring data. Thus, they split the request and the reply into separate bus transactions, a very clever idea. A split-transaction protocol prevents the access time during a read from being seen on the communication channel. Apple attempted to bring this concept back with PSI+ bus, but then caved into standardization pressure and chose PCI. Intel is bringing it back with AGP – a single device port, not a bus.
I believe it is time to bring back the split-transaction protocol, but with an added twist – write-only bus. To get a piece of information, an initiator (to speak PCI-ese) would write a request into the “ear” of the target. That is, it politely asks for one or more pieces of information. Then the initiator gets off the bus. The target satisfies the request by writing the requested information into the initiators “ear”. To send a piece of information, the initiator writes the request, followed by the information itself.
This model of communication is very natural – it is very hard to read another person’s mind or body language. Instead, you could simply ask them for the piece of information yuou require and in turn they will (or will not) reply in due time. From this simple model and the above discussion, we can extract an interesting relationship – writes are equivalent to speaking, and reads are equivalent to speaking followed by listening. Of course, the reply is spoken by someone (hopefully the one that was asked).
This protocol is beginning to look like a DMA channel that the protocol programs automatically. The request programs the DMA channel to return the desired data. Then when the DMA actually gets the date, the DMA writes it back to the requestor. Thus, a read is two writes.
Still Not Enough
Today’s CPUs and motherboard chipsets don’t burst during reads (for example, reads of PCI memory space by the CPU). Even if we add a split-transaction protocol, it would not be a significant enough improvement over PCI’s retry mechanism, but it would significantly reduce the need for retries.
What we need is for programmers to be able to get at this DMA capability as well. Then instead of of using some silly MOVSD loop to read a bunch of data from PCI device into system memory (a bunch of single-word reads), the programmer would ask the PCI device to write the data into system memory. Thus, a bunch of single-work reads would be replaced a burst write, which transfers data at the channel’s peak capacity. The destination can also absorb it at that peak rate, since writes are easy.
We should just admit that technology has reached a level where all entities using a communication channel (such as a bus) can be economically made sophisticated enough to DMA. Programming of the DMA then simply becomes part of the communication protocol and can be made very efficient. By allowing the programmer access to the protocol as well, the entire system is better utilized. The communication protocol can then be simplified by realizing that if all entities are capable of writing, reading is unnecessary. And writes are easy!
The concept of a write-only bus came out of several fun discussions between myself, Eric Shumard and Keith Klingler at Truevision.
Post-Article Ideas
Augmenting communication channels such as PCIe with native support for a write-only protocol would provide a substantial performance boost for sequential reads. In this protocol, reads would be supported as they are now. Support for faster performance reads (by doing writes) would be added, where the initiator would write into the register space of the target the starting address and the number of bytes to be returned. The target would then write that number of bytes into the initiators response receive FIFO.
This mechanism would be part of the PCIe protocol, which would be accessible to the processor to initiate these types of read transactions. This mechanism needs to be multi-thread safe, to allow multiple cores and multiple threads of execution to share use of such high performance transactions.