Here are some comments adapted from internal materials that bear on this thread:
Given the end of Moore's Law the only cost-effective way to increase processing bandwidth in personal computers is parallelism. GPGPU provides cost-effective parallelization in the many hundreds to low thousands of cores range.
GPGPU is miraculous for those applications that are processing bound that can be parallelized. However, improved performance in large tasks is mainly about data access, not about processing throughput. GPGPU helps processing throughput but not data access. GPGPU usually will be irrelevant in larger tasks if data access is not also improved.
Because disks are so slow, improving data access is much more difficult than coding quick processing in GPGPU. Probably only 5% of code in Manifold's new generation products is about GPGPU. Improving data access and harnessing heterogeneous parallelism to reduce bottlenecks in the non-GPU part of the system account for most of the rest. Those ratios will probably hold true for any third generation GPGPU application.
In addition to data access and general GPGPU programming, also key are logistics and business issues such as assuring that supporting technologies are not impractical for commercial vendors.
Vendor Choices between CUDA and Open CL:
The technical differences between NVIDIA and AMD approaches to GPGPU are not as significant compared to other considerations:
- Open CL is not significantly different from CUDA in terms of algorithms or code. Less than 5% more effort for an experienced team is required to write code that can end up in either CUDA or Open CL. Logistics and other business issues usually determine whether production code targets CUDA or Open CL.
- In most cases the architectural details of the GPU/RAM/CPU system are not significant compared to the overall effort of writing heterogeneous parallel code that can utilize CPUs for housekeeping, dispatch and data access while saturating GPU cores for processing throughput. The differences in RAM access, for example, would not be expected to have significant effects in production applications at the user level compared to much greater effects from other system characteristics such as total amount of RAM, specific GPU configurations, nature and type of CPU, disk architecture, specific software in use, etc. This is similar to economic effects seen today: for larger applications it is better in most cases to buy a slower Core i7 with 24GB of slower RAM running 64-bit Windows than it is to buy a faster Core i7 with only 2 GB of faster RAM running 32-bit Windows.
- Data access dominates most GPGPU tasks involving larger data. Because disks are thousands of times slower than RAM or CPU variations between memory, CPU and GPU architectures in larger data applications will not be as consequential as variations in data access.
The business differences are significant in two key details:
- CUDA made it practical to distribute commercial products when Open CL made it difficult. CUDA made it possible to publish a single executable for all NVIDIA GPUs that did not disclose source code for GPGPU code. Open CL required either publication of multiple versions for different GPUs or risk of disclosure of proprietary innovations.
- NVIDIA has aggressively supported CUDA through many Windows versions and many GPUs, including legacy GPUs. Any gaps in a GPU vendor's software coverage impact a commercial application vendor's tech support. Releasing for NVIDIA has had much lower risk of gaps.
When Deng Xiaoping was asked if economic reforms meant China was becoming capitalist and not communist he replied "It does not matter if the cat is black or white as long as it catches mice." For software vendors it does not matter if the GPU is green or red as long as it is practical to sell and support software for the GPU. Most vendors appear to have Open CL work ongoing to provide a backup in case NVIDIA begins charging for CUDA and to provide selected support for specific AMD products as desired.
Third Generation Applications do not Recycle Code
GPGPU is taking off but mainly in small data, computation-bound applications and not for large data interactive applications typical of GIS. Most development teams are still doing first and second generation GPGPU applications. They have not learned heterogeneous processing or developed data access innovations that are central to third generation applications aimed at larger data.
Once a team begins third generation work it becomes clear that GPGPU and innovative data access is the scientifically difficult but organizationally interesting part. More tedious from a labor perspective is to create a large application within a non-parallel OS environment that is safe in every detail for CPU parallelism as well as GPU parallelism - the hallmark of third generation heterogenous parallelism.
For example, very few parts of standard Windows or Microsoft code can be used since almost none of that is thread safe. Code from old applications is not likely to be thread safe. It also means every connection to any external data source has to be thread safe. Oracle's OCI is thread safe but ODBC is not. Some MySQL native drivers advertise if they are thread safe so they can be tested and used if thread safe. If a connection is not thread safe, special code must be written to access a data source through a dedicated thread. Larger, fully third generation applications must deal with tens of thousands of such issues potentially involving millions of lines of code. These issues are not technically as challenging compared to the fundamental science of heterogeneous parallelism but there are many such issues to work through. Even with large staff it is time consuming to write, test and debug all of the code involved. Work on that cannot begin until the fundamental GPGPU and data access technologies have been implemented.
The solution most third generation vendors may adopt is similar to Manifold's approach: solve the most difficult scientific and algorithmic problems involving heterogenous parallelism and data access first and get that code into OEM channels where customers can use it within their own large development organizations.
The vendor can then leverage the OEM experience to validate and improve the engine while steadily working through thousands of smaller issues involved in implementing a large retail application using that engine. This is a long slog but it is a reliable process since no new science or algorithmic breakthroughs are required.