The complexity of a globally distributed delivery network bringsabout a unique set of challenges in architecture, operation andmanagement—particularly in an environment as heterogeneousand unpredictable as the Internet. For example, networkmanagement and data collection needs to be scalable and fastacross thousands of server clusters, many of which are located inunmanned, third-party data centers, and any number of whichmight be offline or experiencing bad connectivity at any giventime. Configuration changes and software updates need to berolled out across the network in a safe, quick, and consistentmanner, without disrupting service. Enterprises also must be ableto maintain visibility and fine-grained control over their contentacross the distributed platform.To guide our design choices, we begin with the assumption that asignificant number of failures (whether they be at the machine,rack, cluster, connectivity, network levels) is expected to beoccurring at all times in the network. Indeed, while not standardin system design, this assumption seems natural in the context ofthe Internet. We have seen many reasons that Internet failures canoccur in Section 3, and have observed it to be true empiricallywithin our own network.What this means is that we have designed our delivery networkswith the philosophy that failures are normal and the deliverynetwork must operate seamlessly despite them. Much effort isinvested in designing recovery from all types of faults, includingmultiple concurrent faults.This philosophy guides every level of design decision—down tothe choice of which types of servers to buy: the use of robustcommodity servers makes more sense in this context than moreexpensive servers with significant hardware redundancy. While itis still important to be able to immediately identify failinghardware (e.g., via ECC memory and disk integrity checks thatenable servers to automatically take themselves out of service),there are diminishing returns from building redundancy intohardware (e.g, dual power supplies) rather than software. Deeperimplications of this philosophy are discussed at length in [1].We now mention a few key principles that pervade our platformsystem design:Design for reliability. Because of the nature of our business,the goal is to attain extremely close to 100% end-to-endavailability. This requires significant effort given ourfundamental assumption that components will fail frequentlyand in unpredictable ways. We must ensure full redundancyof components (no single points of failure), build in multiplelevels of fault tolerance, and use protocols such as PAXOS[26] and decentralized leader election to accommodate forthe possibility of failed system components.Design for scalability. With more than 60,000 machines(and growing) across the globe, all platform componentsmust be highly scalable. At a basic level, scaling means
handling more traffic, content, and customers. This also
translates into handling increasingly large volumes of
resulting data that must be collected and analyzed, as well as
building communications, control, and mapping systems that
must support an ever-increasing number of distributed
machines.
Limit the necessity for human management. To a very
large extent, we design the system to be autonomic. This is a
corollary to the philosophy that failures are commonplace
and that the system must be designed to operate in spite of
them. Moreover, it is necessary in order to scale, else the
human operational expense becomes too high. As such, the
system must be able to respond to faults, handle shifts in load
and capacity, self-tune for performance, and safely deploy
software and configuration updates with minimal human
intervention. (To manage its 60,000-plus machines, the
Akamai network operation centers currently employ around
60 people, distributed to work 24x7x365.)
Design for performance. There is continual work being
done to improve the performance of the system‘s critical
paths, not only from the perspective of improving end user
response times but for many different metrics across the
platform, such as cache hit rates and network resource
utilization. An added benefit to some of this work is energy
efficiency; for example, kernel and other software
optimizations enable greater capacity and more traffic served
with fewer machines.
We will explore these principles further as we examine each of the
the Akamai delivery networks in greater detail in the next
sections. In Section 5 and Section 6 we outline specific challenges
and solutions in the design of content, streaming media, and
application delivery networks, and look at the characteristics of
the transport systems, which differ for each of the delivery
networks.5 In Section 7, we provide details on the generic system
components that are shared among the Akamai delivery networks,
such as the edge server platform, the mapping system, the
communications and control system, and the data collection and
analysis system.
đang được dịch, vui lòng đợi..