<br><spacer type="horizontal" size="10">Distributed Shared Memory: Concepts and Systems (pp. 63-79)

IEEE PARALLEL & DISTRIBUTED TECHNOLOGY 1063-6552/96/$4.00 © 1996 IEEE
Vol. 4, No. 2: SUMMMER 1996, pp. 63-79

Distributed Shared Memory: Concepts and Systems

Jelica University of Belgrade
Milo University of Belgrade
Veljko University of Belgrade
In surveying current approaches to distributed shared-memory computing, these authors find that the reduced cost of parallel software development will help make the DSM paradigm a viable solution to large-scale, high-performance computing.

Research and development of systems with multiple processors has shown significant progress recently. These systems, needed to deliver the high computing power necessary for satisfying the ever-increasing demands of today's typical applications, usually fall into two large classifications, according to their memory system organization: shared- and distributed-memory systems.
A shared-memory system¹ (often called a tightly coupled multiprocessor) makes a global physical memory equally accessible to all processors. These systems offer a general and convenient programming model that enables simple data sharing through a uniform mechanism of reading and writing shared structures in the common memory. Users can readily emulate other programming models on these systems. The programming ease and portability of these systems cut parallel software development costs. However, shared-memory multiprocessors typically suffer from increased contention and longer latencies in accessing the shared memory, which degrades peak performance and limits scalability compared to distributed systems. Memory system design also tends to be complex.
In contrast, a distributed-memory system (often called a multicomputer) consists of multiple independent processing nodes with local memory modules, connected by a general interconnection network. The scalable nature of distributed-memory systems makes systems with very high computing power possible. However, communication between processes residing on different nodes involves a message-passing model that requires explicit use of send/receive primitives. Because they must take care of data distribution across the system and manage the communication, most programmers find this process more difficult. Also, process migration imposes problems because of different address spaces. Therefore, compared to shared-memory systems, hardware problems are easier and software problems more complex in distributed-memory systems.
A relatively new conceptdistributed shared memory^2,³combines the advantages of the two approaches. A DSM system logically implements the shared-memory model on a physically distributed-memory system. System designers can implement the specific mechanism for achieving the shared-memory abstraction in hardware or software in a variety of ways. The DSM system hides the remote communication mechanism from the application writer, preserving the programming ease and portability typical of shared-memory systems. DSM systems allow relatively easy modification and efficient execution of existing shared-memory system applications, which preserves software investments while maximizing the resulting performance. In addition, the scalability and cost-effectiveness of underlying distributed-memory systems are also inherited. Consequently, DSM systems offer a viable choice for building efficient, large-scale multiprocessors.
The DSM model's ability to provide a transparent interface and convenient programming environment for distributed and parallel applications have made it the focus of numerous research efforts in recent years. Current DSM system research focuses on the development of general approaches that minimize the average access time to shared data, while maintaining data consistency. Some solutions implement a specific software layer on top of existing message-passing systems. Others extend strategies applied in shared-memory multiprocessors with private caches to multilevel memory systems.^4-⁶
This article reviews the increasingly important area of DSM. After first covering general DSM concepts and approaches, it surveys existing DSM systems, developed either as research prototypes or commercial products and standards. Although not exhaustive, this survey tries to provide an extensive, up-to-date overview of several key implementation schemes for maintaining data in DSM systems.

General DSM system structure

A DSM system generally involves a set of nodes or clusters, connected by an interconnection network (Figure 1). A cluster itself can be a uniprocessor or a multiprocessor system, usually organized around a shared bus. Private caches attached to the processors are virtually inevitable for reducing memory latency. Each system cluster contains a physically local memory module, which maps partially or entirely to the DSM global address space. Regardless of the network topologybus, ring, mesh, or local-area networka specific interconnection controller in each cluster must connect it into the system.

Figure 1. Structure and organization of a DSM system.

Information about states and current locations of particular data blocks usually resides in a system table or directory. Directory storage and organization are among the most important design decisions, greatly affecting system scalability. Directory organization varies from full-map storage to different dynamic organizations, such as single- or double-linked lists and trees. No matter the organization, the cluster provides storage either for the entire directory or for just part of it. In this way, the system directory can distribute across the system as a flat or hierarchical structure. In hierarchical topologies, if clusters on intermediate levels exist, they usually contain only directories and the corresponding interface controllers. Directory organization and the semantics of information kept in directories depend on the applied method for maintaining data consistency.

Classifications of DSM systems

Since the first DSM research efforts in the mid-eighties, interest in this concept has grown significantly, resulting in tens of systems developed predominantly as research prototypes. Designers of the early DSM systems were inspired by the principles of virtual memory, as well as by cache-coherence maintenance in shared-memory multiprocessors.
Networks of workstations, which are becoming increasingly popular and powerful, represent the most suitable platform for many programmers to approach parallel computing. However, communication latency, including operating system overhead and transfer time, remains the main obstacle for matching the performance of high-end machines with those network systems. Conversely, designers of shared-memory multiprocessors strive for scalability, achieved through physical distribution of shared memory and sophisticated organization of the overall system through such techniques as clustering and hierarchical layout.
Thus, as the gap between multiprocessors and multicomputers (that early DSM intended to bridge) narrows and the basic ideas and performance of both classes of systems seemingly converge, many more emerging systems fit into the large family of modern DSM. To eliminate misunderstanding, we adopt a general definition for this family, which assumes that all systems providing a shared-memory abstraction on a distributed-memory system belong to the DSM category.
There are three key issues when accessing data items in the DSM address space, while keeping the data consistent:

How the access actually executesDSM algorithm.
Where the access is implementedImplementation level of DSM mechanism.
What the precise meaning of the word consistent isMemory consistency model.

DSM ALGORITHMS
The algorithms for implementing DSM deal with two basic problems:

static and dynamic distribution of shared data across the system, to minimize access latency, and
preserving a coherent view of shared data, while minimizing coherence-management overhead.

Two frequently used strategies for distributing shared data are replication and migration. Replication allows multiple copies of the same data item to reside in different local memories (or caches). It is mainly used to enable simultaneous accesses by different sites to the same data, predominantly when read sharing prevails.
Migration implies that only a single copy of a data item exists at any one time, so the data item must be moved to the requesting site for exclusive use. To decrease coherence-management overhead, users prefer this strategy when sequential patterns of write sharing are prevalent. System designers must choose a DSM algorithm that is well-adapted to the system configuration and characteristics of memory references in typical applications.
Classifications of DSM algorithms and the evaluation of their performance have been extensively discussed.^7-¹⁰ Our presentation follows a classification of algorithms similar to that of Michael Stumm and Songnian Zhou.⁸
Single reader/single writer algorithms. This class of algorithms prohibits replication, while permitting but not requiring migration. The simplest DSM management algorithm is the central server algorithm.⁸ The approach relies on a unique central server that services all access requests from other nodes to shared data, physically located on this node. This algorithm suffers from performance problems because the central server can become a bottleneck in the system. Such an organization implies no physical distribution of shared memory. A possible modification is the static distribution of physical memory and the static distribution of responsibilities for parts of shared address spaces onto several different servers. Simple mapping functions, such as hashing, can serve to locate the appropriate server for the corresponding piece of data.

More sophisticated SRSW algorithms also permit migration. However, only one copy of the data item can exist at any one time, and this copy can migrate upon demand. This kind of algorithm is called hot potato.¹⁰ If an application exhibits high reference locality, the cost of data migration is amortized over multiple accesses, because data moves not as individual items, but in fixed-size unitsblocks. The algorithm can perform well when a longer sequence of accesses from one processor uninterrupted by accesses from other processors is likely, and write after read to the same data occurs frequently. Anyway, performance of this rarely used algorithm is restrictively low, because it does not capitalize on the parallel potential of multiple read-only copies, when the read sharing prevails.
Multiple reader/single writer algorithms. The main intention of MRSW (or read-replication) algorithms is to reduce the average cost of read operations, counting that read sharing is the prevalent pattern in parallel applications. To this end, they allow simultaneous local execution of read operations at multiple hosts. Only one host at a time can receive permission to update a replicated copy. A write to a writable copy increases the cost of this operation, because the use of other replicated stale copies must be prevented. Therefore, MRSW algorithms are usually invalidation-based. A great many protocols follow this principle.

Algorithms in this class differ in the allocation of DSM management responsibility. Kai Li and Paul Hudak proposed several of them.⁷ Several terms need defining before we discuss those algorithms:

Manager: the site responsible for organizing the write access to a data block,
Owner: the site that owns the only writable copy of the data block, and
Copy set: a set of all sites that have copies of the data block.

The algorithms proposed by Li and Hudak include:

(1) Centralized manager algorithm. All read and write requests go to the manager, which is the only site that keeps the identity of a particular data block's owner. The manager forwards the data request to the owner, and waits for confirmation from the requesting site, indicating that it received the copy of the block from the owner. For a write operation, the manager also sends invalidations to all sites from the copy set (a vector that identifies the current holders of the data block, kept by the manager).
(2) Improved centralized manager algorithm. Unlike the original centralized manager algorithm, the owner, not the manager, keeps the copy set. The copy set goes together with the data to the new owner, which is also responsible for invalidations. Here, decentralized synchronization improves overall performance.
(3) Fixed distributed manager algorithm. In this algorithm, instead of centralizing the management, each site manages a predetermined subset of data blocks. The distribution proceeds according to some default mapping function. Clients can still override it by supplying their own mapping, tailored to the application's expected behavior. When a parallel program exhibits a high rate of requests for data blocks, this algorithm outperforms the centralized solutions.
(4) Broadcast distributed manager algorithm. This algorithm has no manager. Instead, the requesting processor sends a broadcast message to find the data block's true owner. The owner performs all actions just like the manager in the previous algorithms, and keeps the copy set. But in this approach, all processors must process each broadcast, slowing their own computations.
(5) Dynamic distributed manager algorithm. In this algorithm, the identity of the probable owner, not the real owner, is kept for each particular data block. All requests go to the probable owner, which usually is also the real owner. However, if the probable owner is not the real one, the algorithm forwards the request to the node representing the probable owner according to the information kept in its own table. For every read and write request, forward, and invalidation message, the probable owner field changes accordingly, to decrease the number of messages to locate the real owner. This algorithm is often called Li's algorithm. For its basic version, where the ownership changes on both read and write faults, the algorithm's performance does not deteriorate as more processors add to the system, but rather degrades logarithmically when more processors contend for the same data block.⁷
A modification of the dynamic distributed manager algorithm suggests a distribution of the copy set that should be organized as a tree, rooted at the owner site. This modification also distributes the responsibility for invalidations.⁷
Multiple reader/multiple writer algorithms. The MRMW (also called full-replication) algorithm allows replication of data blocks with both read and write permission. To preserve coherence, updates of each copy must distribute to all other copies on remote sites, by multicast or broadcast messages. Because this algorithm tries to minimize the cost of write access, it is appropriate for write sharing and often serves with write-update protocols. This algorithm can produce high coherence traffic, especially when update frequency and the number of replicated copies are high.

Protocols complying to the MRMW algorithm can be complex and demanding. One way to maintain data consistency is to globally sequence the write operationsto implement reliable multicast. When a processor attempts to write to the shared memory, the intended modification goes to the sequencer. The sequencer assigns the next sequence number to the modification and multicasts the modification with this sequence number to all sites having the copy. When the modification arrives at a site, it verifies the sequence number, and if not correct, it requests a retransmission.
A modification of this algorithm distributes the sequencing task.¹¹ In this solution, the server managing the master copy of any particular data structure sequences writes to that data structure. Although the system is not sequentially consistent in this case, each particular data structure is maintained consistently.
Avenues for performance improvement. Researchers have dedicated considerable effort to various modifications of the basic algorithms, to improve their behavior and enhance their performance by reducing the amount of data transferred in the system. Most of those ideas were evaluated by simulation studies, and some were implemented on existing prototype systems.

An enhancement of Li's algorithm called the Shrewd algorithm eliminates all unnecessary page transfers with the assistance of the sequence number per copy of a page.¹⁰ On each write fault at a node with an existing read-only copy, the sequence number goes with the request. If this number is the same as the number kept by the owner, the requester can access the page without its transfer. This solution shows remarkable benefits when the read-to-write ratio increases.
All of Li and Hudak's solutions assume that the page transfer executes on each attempt to access a page that does not reside on the accessing site. One modification employs a competitive algorithm and allows page replication only when the number of accesses to the remote page exceeds the replication cost.⁹ A similar rule applies to migration, although because only one site can have the page in this case, the condition to migrate the page is more restrictive and dependent on other sites' access pattern to the same page. The performance of these policies is guaranteed to stay within a constant factor from the optimal.
The Mirage system applies another restriction to data transfer requests, which reduces thrashingan adverse effect occurring when an alternating sequence of accesses to the same page issued by different sites makes its migration the predominant activity. The solution to this problem defines a time window in which the site is guaranteed to uninterruptedly possess the page after acquiring it. Users can tune the value of statically or dynamically, depending on the degree of processor locality exhibited by the particular application.
A variety of specific algorithms have been implemented in existing DSM systems or simulated extensively using appropriate workload traces. Early DSM implementations found the main source of possible performance and scalability improvements in various solutions for the organization and storage of system tables, such as copy set, as well as in the distribution of management responsibilities. To improve performance, recent DSM implementations relax memory consistency semantics, which requires considerable modification of the algorithms and organization of directory information. Implementations of critical operations using hardware accelerators and a combination of invalidate and update methods also improve modern DSM system performance.
IMPLEMENTATION LEVEL OF THE DSM MECHANISM
The level where the DSM mechanism is implemented is one of the most important decisions in building a DSM system: it affects both programming and overall system performance and cost.
To achieve ease of programming, cost-effectiveness, and scalability, DSM systems logically implement the shared-memory model on physically distributed memories.^12,¹⁴ Because the DSM's shared address space distributes across local memories, a lookup must execute on each access to these data, to determine if the requested data is in the local memory. If not, the system must bring the data to the local memory. The system must also take an action on write accesses to preserve the coherence of shared data. Both lookup and action can execute in software, hardware, or the combination of both. According to this property, systems fall into three groups: software, hardware, and hybrid implementations.
The choice of implementation usually depends on price/performance trade-offs. Although typically superior in performance, hardware implementations require additional complexity, which only high-performance, large-scale machines can afford. Low-end systems, such as networks of personal computers, based on commodity microprocessors, still do not tolerate the cost of additional hardware for DSM, which limits them to software implementation. For mid-range systems, such as clusters of workstations, low-cost additional hardware, typically used in hybrid solutions, seems appropriate.
Software DSM implementations. Until the last decade, distributed systems widely employed message-passing communication. However, this appeared to be much less convenient than the shared-memory programming model because the programmer must be aware of data distribution and explicitly manage data exchange via messages. In addition, such systems introduce severe problems in passing complex data structures, and process migration in multiple address spaces is aggravated. Therefore, the idea of building a software mechanism that provides the shared-memory paradigm to the programmer on the top of message passing emerged in the mid-eighties. Generally, this can be achieved in user-level, run-time library routines, the OS, or a programming language.

Some DSM systems combine the elements of these three approaches. Larger grain sizes (on the order of a kilobyte) are typical for software solutions, because DSM management is usually supported through virtual memory. Thus, if the requested data is absent in local memory, a page-fault handler will retrieve the page either from the local memory of another cluster or from disk. Coarse-grain pages are advantageous for applications with high locality of references, and also reduce the necessary directory storage. But, parallel programs characterized with fine-grain sharing are adversely affected, because of false sharing and thrashing.
Software support for DSM is generally more flexible than hardware support and enables better tailoring of the consistency mechanisms to the application behavior. However, it usually cannot compete with hardware implementations in performance. Apart from introducing hardware accelerators to solve the problem, designers also concentrate on relaxing the consistency model, although this can put an additional burden on the programmer. Because research can rely on widely available programming languages and OSs on the networks of workstations, numerous implementations of software DSM have emerged.
The "Software DSM implementations" sidebar describes some of the better-known representations.
Hardware-level DSM implementations. Hardware-implemented DSM mechanisms ensure automatic replication of shared data in local memories and processor caches, transparently for software layers. This approach efficiently supports fine-grain sharing. The nonstructured, physical unit of replication and coherence is small, typically a cache line. Consequently, hardware DSM mechanisms usually represent an extension of the principles found in cache-coherence schemes of scalable shared-memory architectures. This approach considerably reduces communication requirements, because finer sharing granularities minimize the detrimental effects of false sharing and thrashing. Searching and directory functions implemented in hardware are much faster than with software-level implementations, and memory-access latencies decrease. However, advanced coherence-maintenance and latency-reduction techniques usually complicate design and verification. Therefore, hardware DSM is often used in high-end machines where performance is more important than cost.

See the "Hardware DSM implementations" sidebar for a description of three especially interesting groups of hardware DSM systems.
Hybrid-level DSM implementations. During the evolution of this field, the research community proposed numerous entirely hardware or software implementations of the DSM mechanism. However, even in entirely hardware DSM approaches, there are software-controlled features explicitly visible to the programmer for memory reference optimizationfor example, prefetch, update, and deliver in Dash; and prefetch and poststore in KSR1. Many purely software solutions, however, require some hardware supportsuch as virtual memory management hardware in IVY and ECC in Blizzard-E. As to be expected, neither the entirely hardware nor entirely software approach has all the advantages. Therefore, it is quite natural to employ hybrid methods, with predominantly or partially combined hardware and software elements, to balance the cost-to-complexity trade-offs.

The "Hybrid-level DSM implementations" sidebar summarizes some of these tradeoffs.
MEMORY CONSISTENCY MODELS
The memory consistency model defines the legal ordering of memory references issued by a processor, as observed by other processors.^2,¹⁴ Different types of parallel applications inherently require various consistency models. The model's restrictiveness largely influences system performance in executing these applications. Stronger forms of the consistency model typically increase memory access latency and bandwidth requirements, while simplifying programming. Looser constraints in more relaxed models, which allow memory reordering, pipelining, and overlapping, consequently improve performance, at the expense of higher programmer involvement in synchronizing shared data accesses. For optimal behavior, systems with multiple consistency models adaptively applied to appropriate data types have recently emerged.
Stronger memory consistency models that treat synchronization accesses as ordinary read and write operations are sequential and processor consistency. More relaxed models that distinguish between ordinary and synchronization accesses are weak, release, lazy release, and entry consistency.
Sequential consistency mandates that all a system's processors observe the same interleaving of reads and writes issued in sequences by individual processors. A simple implementation of this model is a single-port shared-memory system that enforces serialized access servicing from a single first-in, first-out (FIFO) queue. DSM systems achieve a similar implementation by serializing all requests on a central server node. Neither case allows bypassing of read and write requests from the same processor. Conditions for sequential consistency hold in the majority of bus-based, shared-memory multiprocessors, as well as in early DSM systems, such as IVY and Mirage.
Processor consistency assumes that the order in which different processors can see memory operations need not be identical, but all processors must observe the sequence of writes issued by each processor in the same sequence. Unlike sequential consistency, processor consistency implementations allow reads to bypass writes in queues from which memory requests are serviced. Examples of systems that guarantee processor consistency are VAX 8800, Plus, Merlin, and RMS.
Weak consistency distinguishes between ordinary and synchronization memory accesses. It requires that memory becomes consistent only on synchronization accesses. In this model, requirements for sequential consistency apply only to synchronization accesses. A synchronization access also must wait for all previous accesses to execute, while ordinary reads and writes must wait only for completion of previous synchronization accesses. Sun's Sparc architecture uses a variant of the weak consistency model.
Release consistency further divides synchronization accesses to acquire and release, so that protected ordinary shared accesses can execute between acquire-release pairs. In this model, ordinary read or write access can execute only after all previous acquires on the same processor execute. In addition, a release can execute only after all previous ordinary reads and writes on the same processor execute. Finally, acquire and release synchronization accesses must fulfill the requirements that processor consistency puts on ordinary read and write accesses. The Dash and Munin DSM systems exhibit different implementations of release consistency.
Lazy release consistency is an enhancement of release consistency.¹⁵ Instead of propagating modifications to the shared address space on each release (like in release consistencysometimes called eager consistency), modifications wait until the next relevant acquire. Also, not all modifications must propagate on the acquire, but only those associated with the chain of preceding synchronization operations on that specific lock. This minimizes the amount of data exchanged, while also reducing the number of messages by combining modification with lock acquires in one message. The Treadmarks DSM system implements lazy release consistency.
Finally, entry consistency also improves release consistency. This model requires that each ordinary shared variable or object be protected and associated to the synchronization variable, using language-level annotation. Consequently, modification of the ordinary shared variable waits until the next acquire of the associated synchronization variable that guards it. Because only the changes for a subset of shared variables protected by the particular synchronization variable must propagate at that moment, the traffic significantly decreases. Latency also falls because a shared access does not have to wait on the completion of other unrelated acquires. Performance improves at the expense of higher programmer involvement in specifying synchronization information for each variable. The Midway DSM system first implemented entry consistency.

Important design choices in building DSM systems

In addition to the DSM algorithm, implementation level of DSM mechanism, and memory consistency model, characteristics that strongly affect overall DSM system performance include cluster configuration, interconnection network, structure of shared data, granularity of coherence unit, responsibility for the DSM management, and coherence policy.
Cluster configuration
Varying greatly across different DSM systems, cluster configuration includes one or several usually off-the-shelf processors. Because each processor has its own local cache (or even cache hierarchy), cache coherence on the cluster level must be integrated globally with the DSM mechanism. Parts of a local memory module can be configured as private or sharedmapped to the virtual shared address space. In addition to coupling the cluster to the system, the network interface controller sometimes integrates important DSM management responsibilities.
Interconnection networks
Almost all types of interconnection networks found in multiprocessors and distributed systems will work in DSM systems. Most software-oriented DSM systems are network independent, although many are built on top of Ethernet, readily available in most environments. But, topologies such as multilevel buses, ring hierarchies, or meshes have served as platforms for hardware-oriented DSM systems. The interconnection network's topology can offer or restrict a good potential for parallel exchange of data related to the DSM management. For the same reasons, it also affects scalability. In addition, it determines the possibility and cost of broadcast and multicast transactions, which is very important for implementing DSM algorithms.
Shared data structure
The structure of shared data represents the global layout of shared address space, as well as the organization of data items in it. Hardware solutions always deal with nonstructured data objects, while software implementations tend to use data items that represent logical entities, to take advantage of the locality naturally expressed by the application.
Coherence unit granularity
The granularity of the coherence unit determines the size of the data blocks managed by coherence protocols. This parameter's affect on the overall system performance relates closely to the locality of data access typical for the application. In general, hardware-oriented systems use smaller units (typically cache blocks), while software solutions, based on virtual memory mechanisms, organize data in larger physical blocks (pages), counting on coarse-grain sharing. In a phenomenon called false sharing, the use of larger blocks saves space for directory storage, but also increases the probability that multiple processors will require access to the same block simultaneously, even if they actually access unrelated parts of that block. This can cause thrashing.
DSM management responsibility
The responsibility for DSM management determines which site must handle actions related to the consistency maintenance in the system, and can be centralized or distributed. Centralized management is easier to implement, but the central manager represents a bottleneck. Designers can define the responsibility for distributed management statically or dynamically, eliminating bottlenecks and providing scalability. Distribution of responsibility for DSM management relates closely to the distribution of directory information.
Coherence policy
The coherence policy determines whether the existing copies of a data item being written to at one site will be updated or invalidated on the other sites. The choice of the coherence policy relates to the granularity of shared data. For very fine-grain data items, an update message costs approximately the same as an invalidation message. Therefore, systems with word-based coherence maintenance often use the update policy, but coarse-grain systems largely use invalidation. An invalidation approach's efficiency grows when the read and write access sequences to the same data item by various processors are not highly interleaved. The best performance comes when the coherence policy dynamically adapts to observed reference patterns.

CONCLUSION

Because of the combined advantages of the shared-memory and distributed systems, DSM approaches appear to be a viable solution for large-scale, high-performance systems with a reduced cost of parallel software development. However, efforts to build successful commercial systems that follow the DSM paradigm are still in their infancy, so research prototypes still prevail. Therefore, DSM remains a very active research area. Promising research directions include

improving DSM algorithms and mechanisms, and adapting them to the characteristics of typical applications and system configurations,
synergistic combining of hardware and software DSM implementations,
integrating shared-memory and message-passing programming paradigms,
creating new and innovative system architectures (especially in the memory system), and
combining multiple-consistency models.

From this point of view, further investments in exploring, developing, and implementing DSM systems seem to be quite justified.

ACKNOWLEDGMENTS

This work was partly supported by National Science Foundation of Serbia and National Technology Foundation of Serbia. We also want to thank Vojislav Protic' for his help in providing up-to-date literature, and Liviu Iftode, who kindly provided some of his most recent papers.
REFERENCES

1. M.J. Flynn, Computer Architecture: Pipelined and Parallel Processor Design, Jones and Bartlett, Boston, 1995.
2. V. Lo, "Operating Systems Enhancements for Distributed Shared Memory," Advances in Computers, Vol. 39, 1994, pp. 191-237.
3. J. , M. , and V. , "A Survey of Distributed Shared Memory Systems," Proc. 28th Ann. Hawaii Int'l Conf. System Sciences, IEEE Computer Society Press, Los Alamitos, Calif., 1995, pp. 74-84.
4. M. and V. , "Hardware Approaches to Cache Coherence in Shared-Memory Multiprocessors, Part 1 (Basic Issues)," IEEE Micro, Vol. 14, No. 5, Oct. 1994, pp. 52-59.
5. M. and V. , "Hardware Approaches to Cache Coherence in Shared-Memory Multiprocessors, Part 2 (Advanced Issues)," IEEE Micro, Vol. 14, No. 6, Dec. 1994, pp. 61-66.
6. I. Tartalja and V. , "A Survey of Software Solutions for Maintenance of Cache Consistency in Shared Memory Multiprocessors," Proc. 28th Ann. Hawaii Int'l Conf. System Sciences, CS Press, 1995, pp. 272-282.
7. K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems," ACM Trans. Computer Systems, Vol. 7, No. 4, Nov. 1989, pp. 321-359.
8. M. Stumm and S. Zhou, "Algorithms Implementing Distributed Shared Memory," Computer, Vol. 23, No. 5, May 1990, pp. 54-64.
9. D.L. Black, A. Gupta, and W. Weber, "Competitive Management of Distributed Shared Memory," Compcon '89, CS Press, 1989, pp. 184-190.
10. R.E. Kessler and M. Livny, "An Analysis of Distributed Shared Memory Algorithms," Proc. Ninth Int'l Conf. Distributed Computing Systems, CS Press, 1989, pp. 498-505.
11. R. Bisiani and A. Forin, "Multilanguage Parallel Programming of Heterogeneous Machines," IEEE Trans. Computers, Vol. 37, No. 8, Aug. 1988 pp. 930-945.
12. J. , M. , and V. , "A Survey of Distributed Shared Memory: Concepts and Systems," Tech. Report No. ETF-TR-95-157, Dept. of Computer Engineering, Univ. of Belgrade, Belgrade, Yugoslavia, 1995.
13. J. , M. , and V. , "Tutorial on Distributed Shared Memory: Concepts and Systems," CS Press, to be published in 1996.
14. K. Gharachorloo et al., "Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors," Proc. 17th Ann. Int'l Symp. Computer Architecture, CS Press, 1990, pp. 15-26.
15. P. Keleher, A.L. Cox, and W. Zwaenepoel, "Lazy Release Consistency for Software Distributed Shared Memory," Proc. 19th Ann. Int'l Symp. Computer Architecture, CS Press, 1992, pp. 13-21.
Jelica is on the faculty of the School of Electrical Engineering, University of Belgrade, where she received her BSc and MSc in computer engineering. She is currently working toward her PhD in the field of DSM. Her research interests are in computer architecture, distributed systems, and performance analysis. She can be reached at jeca@ubbg.etf.bg.ac.yu.
Milo is on the faculty of the School of Electrical Engineering, University of Belgrade, where he received his BSc in electrical engineering and MSc and PhD in computer engineering. His research interests are computer architectures, multiprocessor systems, and distributed shared-memory systems. He can be reached at etomasev@ubbg.etf.bg.ac.yu.
Veljko