|
Back
to résumé
Article ghostwritten for General
Microsystems for Electronic Engineering Times
Asynchronous real-time multiprocessing
on CompactPCI
CompactPCI has impressive
speed capabilities, offering 64-bit data at 66 MHz and speeds of up to
528 Mbytes per second. But quantum increases
in the size and power of real-time
machine control, industrial automation, data acquisition, military systems,
and telecom applications have created demand for even more bandwidth and
more speed. Customers were
asking us for a CompactPCI card with a high-performance CPU for real-time,
I/O-intensive applications, high throughput for hard disks, "huge" memory,
and powerful graphics. The existing cards just didn't have the horsepower
needed to meet these demands.
To meet these requirements on a 6U (about 9 x 6 inches) CompactPCI board was
challenging, to say the least. It would take an array of components that normally go on a
full-size ATX motherboard, and we needed to squeeze it all onto an extremely small form
factor. Compounding the challenge was the fact that we wanted to use two CPUs on the
board. In the targeted environments one CPU was needed to run the system and a secondary
processor was needed to handle real-time tasks--something no one had done before on a
CompactPCI board. We wanted true real-time asymmetric multiprocessing -- two or more
processes running independently of one another on the same memory subsystem.
The model today in Symmetric Multiprocessing (SMP) is that two CPUs on the same board
give you increased processing power of thirty to forty percent (operating-system
dependent) over the speed of a single CPU. With asymmetric multiprocessing the increase in
speed of instruction execution is approximately 98 percent -- just about the equivalent of
two CPU boards. There were other performance gains realized using this approach. For example, in
CompactPCI telecom applications, the big problem is trying to execute a deterministic
function in a high-performance application with low latency. The ATM controller is very
I/O intensive -- it takes a lot of a processor's bandwidth to perform its functions. So in
a single-board computer application one processor allocates 90-100 percent of its time
just dealing with the data transfers to and from the ATM controller. But once the data
comes in, you no longer have the determinism or computational ability to work with the
data stream on a real-time basis. In an average ATM application the controller is carrying
a lot of calls. The processor must bring this data from the controller, process it, and
put it back onto some other media, whether disk or Ethernet or other I/O channel.
Historically, this cannot be done on one processor card because of the bandwidth required.
Having a second CPU allows one processor to process data from the ATM controller while the
other processes calls and all other system functions.
A primary objective
was to make the second CPU virtually independent of the primary processor,
carrying the "loosely coupled" concept a step further to what we like
to call "barely coupled." Our idea was to provide the ability to
hot-swap the second CPU's kernels. That is, we wanted the second CPU to be
independent enough that we
could boot the operating system on the main processor, run a microkernel
or application on the second processor, and then swap it out for a different
kernel or application without
having to reboot the system. We knew this would be of supreme value to always-on
applications such as you find in telecom.
We began by choosing the Pentium III for our CPUs, not only because of its 550 MHz
speed, but because it provides an integrated Advanced Programmable Interrupt Controller
(APIC). This controller enables multiple Pentiums to communicate with each other over a
high-speed serial bus, providing minimal latency in parallel with program execution.
The Pentium is also available with a companion chip, the 82093AA IOAPIC, that provides
multiprocessor interrupt management with a dynamic interrupt distribution scheme. In
systems with multiple I/O subsystems, each subsystem can have its own set of interrupts.
Together, the APIC and IOAPIC overcome the communications bottlenecks and lack of
determinism that often plague both SMP and distributed multiprocessing systems.
However, on a CompactPCI board processor height is a big issue since Slot 1 Pentiums
normally are installed vertically. We solved that by using our own right-angle connector,
the Slot Saver, which lays the Pentiums down flat. This solution also allowed us to design
a heat sink that meets the needs of the Pentium processors. A bigger problem we faced was that of putting more than one CPU board in a CompactPCI
system. Since the primary system controller sets up the memory and I/O allocations
throughout the system, a standard PCI-to-PCI bridge could not be used. Therefore if you
add another CPU with its own local resources -- such as Ethernet, SCSI, or a PCI-to-ISA
bridge -- and you put a PCI-to-PCI bridge on the board, the system controller will come
across that bridge and reconfigure memory and I/O of the local device. The local CPU will
also attempt to configure these same devices, causing many system conflicts.
In order
for the local CPU to keep control of its own resources, you have to use
a "blocking" bridge so that neither
CPU can launch configuration cycles on the downstream side. We used the
DEC 21554 Draw Bridge chip, which is a PCI-to-PCI bridge that
will not forward Type 1 configuration cycles. This allows for more than
one CPU board in a system, each one configuring its own local devices
without interfering with any of the
other CPU boards or intelligent peripherals in the system. As far as we
know, this is another first on a CompactPCI board. We answered
the call for "huge memory" with a gigabyte
of PC-100-compliant synchronous DRAM and a megabyte of L2 cache. Both processors
run out of shared memory.
To achieve high I/O throughput on the hard disks, we chose a 40-Mbytes/sec UltraSCSI
and two autodetecting, autoswitching 100BaseT Ethernet devices used for redundancy or
routing applications. We also decided to design an Ultra-DMA 33 IDE interface, a pair of
USB ports, dual Serial I/O with optional RS422 drivers, and a parallel port. To meet the demand for powerful graphics, we used the Intel 740 Accelerated Graphics
Port (AGP) chip that gave us over 7 million triangles per second. AGP provides a direct
connection between the display adapter and memory and doesn't use up PCI bus bandwidth.
This would be another element in providing fast I/O.
We chose Intel's Front Side Bus because of its 100-MHz capability, giving us the speed
we needed for our CPUs. Formerly the limit for this type of architecture was 66 MHz. The
processors, cache and memory are linked by this bus, which not only supports current
500-MHz processors, but will support next-generation processors as well.
The finished board, designated the C2P3, runs a variety of desktop and real-time OSes,
including Windows NT, Solaris x86, QNX, and VxWorks, with multiprocessing CompactPCI
drivers that treat the CompactPCI bus as a network and communicate via TCP/IP over the
backplane. This is in conformance with a new CompactPCI standard, currently in draft, that
will use TCP/IP-type packet sending as a method of communication between multiple CPUs
within a system.
When economy is an issue, the board can accommodate two Celeron PPG370 processors that
currently run at 466 MHz, with a future clock speed of 667 MHz on the drawing board. The
Celeron version also includes 256K of on-die cache with one-to-one clocking that
dramatically improves the performance of the cache. This configuration provides Pentium
III performance at fraction of the cost.
In tandem with designing the C2P3, we needed to write the code to make it jump through
hoops. As mentioned above, one of our primary challenges was to get the secondary
processor configured and up and running on a system where the primary processor is already
running. In order to do this we created RAMP (Real-time Asynchronous Multiprocessor), a
board-support program that runs on the primary processor under the OS; a microkernel that
runs on the secondary processor; some libraries used to compile user-created kernels or
applications; and an API. We configured the microkernel to handle eight real-time tasks
scheduled in its task list. These can be executed independently from the tasks on the
primary processor.
But this wasn't simple. When a CPU comes up under VxWorks it has control over the
system memory, all the hardware ports, and interrupts. To bring up another kernel in that
environment is difficult because you can't touch any of the memory or hardware resources.
It has to be set up so resources are requested from the host processor on an as-needed
basis. The way we solved it was to bring up our kernel running a do-nothing loop, then
when we give it a task it requests needed memory or port allocations from the host
processor.
Finally, we optimized the kernel for high-speed context switching and interrupt
response. We utilized an event-driven, priority-based, multitasking scheduler to make it
capable of supporting up to eight tasks.
We wrote the API to provide calls for uploading, downloading, starting and stopping
tasks, allocating resources (i.e., memory and hardware resources), establishing protected
memory regions, and facilitating communications between tasks. The programmer can use API
calls from within the main OS to distribute tasks and resources to slave processors and to
coordinate slave processor activity. Programmers can also add their own API calls to
address application-specific requirements. Designers can guarantee worst-case latencies
and take full advantage of available hardware resources by assigning particular tasks to
particular processors.
RAMP is loaded onto the primary processor along with the OS and will support however
many processors are on the board. In the case of the C2P3, that's one secondary processor,
but theoretically it will support as many as 32 processors. It can run code in its own
cache concurrently with the other processors, eliminating a large number of bus accesses.
This system is deterministic, and it's completely configurable by the user as to how much
load each processor will have to handle.
Under RAMP both processors operate out of the same memory and utilize common
interprocessor and interrupt communications protocols and pathways. In
addition, RAMPs
single-processor programming model enables complex multiprocessing programs
to be developed just as they would for a single processor. The master processor
runs a
full-featured RTOS, such as VxWorks, that is primarily responsible for
resource allocations and housekeeping. The slave processors are free to
run independent real-time
tasks. RAMP is royalty free, so designers need only purchase a full-featured RTOS for the
master processor. An efficient way to handle board-to-board communications is with
packages like VxMP, with RAMP used to handle on-board Real-time Asynchronous
Multiprocessing.
The RAMP board support package, including API and microkernel, are provided
free of charge with General Micro Systems single-board computers.
Back to top
Back
to résumé
Back
to Word Sculptors main page
|