Value of the Cloud – CPU Performance

Abstract

This post compares CPU performance and value for 18 compute instance types
from 5 cloud compute platforms – AWS EC2,
Google Compute Engine,
Windows Azure,
HP Cloud and
Rackspace Cloud. The most
interesting content is the data and resulting analysis. If you’re in a rush,
scroll down or click below to go straight to it.

Go To Comparisons

Overview

In the escalating cloud arms race, performance is a frequent topic of
conversation. Often, overly simplistic test models and fuzzy logic are used to
substantiate sweeping claims. In a general sense, computing performance is
relative to, and dependent on workload type. There is no single metric or
measurement that encapsulates performance as a whole.

In the context of cloud, performance is also subject to variability due to
nondeterministic factors such as multitenancy and hardware abstraction. These
factors combined increase the complexity of cloud performance analysis because
they reduce one’s ability to dependably
repeat and reproduce
such analysis. This is not to say that cloud performance cannot be measured,
rather that doing so is not a precise science, and differs somewhat from
traditional hardware performance analysis where such factors are not present.

Performance is workload dependent. Cloud performance is hard to measure
consistently because of variability from multitenancy and hardware
abstraction.

Motivation

My goal in starting CloudHarmony in 2010 was to provide a credible source for
objective and reliable performance analysis about cloud services. Since then,
cloud has grown extensively and become an even more confusing place. The intent
of this post is to present techniques and a visual tool we’re using to help
assess and compare performance and value of cloud services. The focus of
this post is cloud compute CPU performance and value. In the coming weeks,
follow up posts will be published covering other performance topics including
block storage, network, and object storage. As is our general policy, we have
not been paid or otherwise influenced in the testing or analysis presented in
this post.

The focus of this post is compute CPU performance and value. Follow up posts
will cover other performance topics. We were not paid to write this post.

Testing Methods

To test performance of compute services we run a suite of about 100 benchmarks
on each type of compute instance offered. These benchmarks measure various
performance properties including CPU, memory and disk IO. Each test
iteration takes between 1-2 days to complete. When multiple configuration
options are offered, we usually run additional test iterations for each such
option (e.g. compute services often offer multiple block storage options).
Linux CentOS 6.* is our operating system of choice because of its nearly
ubiquitous availability across services.

CPU Performance

Although our test suite includes many CPU benchmarks, our preferred method for
compute CPU performance analysis is based on metrics provided by the
CPU2006 benchmark suites. CPU2006 is
an industry standard benchmark created by the
Open Systems Group of the non-profit
Standard Performance Evaluation Corporation (SPEC). CPU2006 consists of
2 benchmark suites that measure Integer and Floating Point CPU performance. The
Integer suite contains 12 benchmarks, and Floating Point 17. According to the
CPU2006 website “SPEC designed
CPU2006 to provide a comparative measure of compute-intensive performance
across the widest practical range of hardware using workloads developed from
real user applications.”
Thorough documentation about CPU2006 including
about the benchmarks used
is available on the CPU2006 website.
CloudHarmony is a SPEC CPU2006 licensee.

The results table below contains CPU2006 SPECint (Integer) and
SPECfp (Floating Point) metrics for each compute instance type
included in this post. Each score is linked to a PDF report generated by the
CPU2006 runtime for that specific test run.
CPU2006 run and reporting rules
require disclosure of settings and parameters used when compiling and running
the CPU2006 test suites and this data is included in the reports. To
summarize, our runs are based on the following settings:

Compiler
Intel C++ and Fortran Compilers version 12.1.5
Compilation Guidelines
Base
Run Type
Rate
Rate Copies
1 copy per CPU core or per 1GB memory (lesser of the two)
SSE Compiler Option
SSE4.2 or SSE4.1 (if supported by the compute instance)

Our preferred method for compute CPU performance analysis is based on metrics
provided by the SPEC CPU2006
benchmark suites

CPU2006 Test Results

To be considered official, CPU2006 results must adhere to specific run and
reporting guidelines. One such guideline states that results should be
reproducible.
While this is important in the context of hardware testing, it is impractical
for cloud due to performance variability resulting from multitenancy and
hardware abstraction. However, CPU2006 guidelines allow for reporting of
estimated
results in cases where not all guidelines can be adhered to. In such cases
results must be clearly designated as estimates. It is for this reason
that results in the table below are designate as such.

Compute Service Instance Type CPU Type Cores Price2 SPECint1 SPECfp1
AWS EC2 cc2.8xlarge Intel E5-2670 2.60GHz 32 2.40 441.511194 357.602046
HP Cloud double-extra-large Intel T7700 2.40GHz 8 1.12 168.55417 132.3234
AWS EC2 m3.2xlarge Intel E5-2670 2.60GHz 8 1.00 150.30509 128.159625
Google Compute n1-standard-8 Intel 2.60GHz 8 1.06 149.354133 143.1015
HP Cloud extra-large Intel T7700 2.40GHz 4 0.56 98.430955 85.24574
Rackspace Cloud 30gb AMD Opteron 4170 8 1.00 95.43979 83.89602
Windows Azure A4 AMD Opteron 4171 8 0.48 91.33294 77.93744
AWS EC2 m3.xlarge Intel E5-2670 2.60GHz 4 0.50 80.180578 71.753345
Google Compute n1-standard-4 Intel 2.60GHz 4 0.53 66.945866 66.84303
Rackspace Cloud 8gb AMD Opteron 4170 4 0.32 51.709779 47.562079
Windows Azure A3 AMD Opteron 4171 4 0.24 51.58953 46.9475
HP Cloud medium Intel T7700 2.40GHz 2 0.14 48.825275 44.085027
Google Compute n1-standard-2 Intel 2.60GHz 2 0.265 39.469478 39.094813
AWS EC2 m1.large Intel E5645 2.40GHz 2 0.24 39.023586 34.7884
AWS EC2 m1.large Intel E5-2650 2.00GHz 2 0.24 38.816635 37.10992
AWS EC2 m1.large Intel E5430 2.66GHz 2 0.24 29.534628 23.805172
Windows Azure A2 AMD Opteron 4171 2 0.18 27.38071 25.92939
Rackspace Cloud 4gb AMD Opteron 4170 2 0.16 25.854861 24.25972

1: Base/Rate – Estimate
2: Hourly, USD – On Demand

Simplifying the Results

In order to provide simple and concise analysis derived from multiple relevant
performance properties, it is helpful to reduce metrics from multiple
related benchmarks to a single comparable value. The CPU2006 benchmark suites
produce two metrics, SPECint for Integer, and SPECfp for
Floating Point performance. A naive approach might be to combine them using a
mean or sum of their values. However, doing so would be inaccurate because they
are dissimilar values. Although they are calculated using the same algorithms,
SPECint and SPECfp are produced from different benchmarks,
and thus represent different meanings – as the idiom goes, this would be an
apples to oranges comparison. An external analogy might be attempting
to average 1 gallon of milk with 2 dozen eggs – in doing so, the resulting
value:

$$(1+2)/2=1.5$$

is meaningless because they are dissimilar values to begin with.

To merge dissimilar values like metrics from different benchmarks, the values
must first be normalized to a common notional scale. One method for doing so is
ratio conversion using factors from a common scale. The resulting ratios
represent relationships between the original metrics and the common scale.
Because the values share the same scale, they may then be operated on together
using mathematical functions like mean and median. Using the same milk and eggs
analogy, and assuming a common scale of groceries needed for the week,
defined as 2 gallons of milk and 3 dozen eggs, grocery deficiency ratios may
then be calculated as follows:

[text”Milk deficiency” = text”2 gallons needed” / text”1 gallon on hand” = text”Deficiency ratio 2″]
[text”Eggs deficiency” = text”3 dozen needed” / text”2 dozen on hand” = text”Deficiency ratio 1.5″]

The resulting ratios, 2 and 1.5, may then be reduced to a single ratio
representing the average grocery deficiency for both milk and eggs:

[text”Average grocery deficiency” = (2+1.5)/2 = text”1.75″]

In other words, in order to stock up on groceries for the week, we’ll need to
buy 1.75 times the milk and eggs currently on hand. Take note, however, that
this ratio is only relevant in the context of milk and eggs as a whole, not
separately, nor does it apply to other types of groceries.

The benefit of reducing dissimilar benchmarks values to a single representative
metric is to simplify the expression and comparison of related performance
properties. It allows us to present cloud performance more generally, and at a
level more fitting to the interests and time of cloud users. As much as we’d
like users to become well versed in the intricacies of benchmarking and
performance analysis, this is simply not feasible for most, and is a primary
reason for our existence. Our goal is to provide users with a simple starting
point to help narrow the scope from hundreds of possible cloud services.

In order to more generally and simply present cloud performance information we
generate a single value derived from multiple related benchmarks

CPU Performance Metric

The CPU performance metric displayed in the graph below was calculated using
both SPECint and SPECfp metrics and the common scale ratio
normalization technique described above. The common scale was the mid 80th
percentile mean of all CloudHarmony SPECint and SPECfp test
results from the prior year. These results included many different compute
services and compute instance types, not just those included in this post.
This calculation results in the following common normalization factors:

SPECint Factor
64.056
SPECfp Factor
55.995

To shorten resulting long decimal values, ratios were multiplied by 100.
The meaning of the metric can thus be interpreted as CPU performance relative
to the mean of compute instances from many different cloud services. A value
of 100 represents performance comparable to the mean, 200 twice the mean, and
50 1/2 of the mean. For example, the HP double-extra-large compute
instance produced scores of
168.55417
for SPECint, and
132.3234
for SPECfp. The resulting CPU performance metric of 249.72
was then calculated using the following formula:

$$text”CPU Performance” = (((168.55417/64.056) + (132.3234/55.995))/2)*100 → (4.99448532/2)*100 → 249.724266$$

The value 249.72 signifies this instance type performed about 2.5
times better than the mean.

The CPU performance metric used below represents SPECint and SPECfp scores
relative to compute instances from many cloud services. A higher value is
better

Value Calculation

Cloud compute pricing is usually tied to CPU and memory allocation, with larger
instance types offering more (or faster) CPU cores and memory. The CPU2006
benchmark suites are designed to take advantage of multicore systems when
compiled and run correctly. Given the same hardware type, our test results
generally show a near linear correlation between CPU allocation and CPU2006
scores. Because of these factors, the CPU performance metric derived from
CPU2006 is well-suited for estimating value of compute instance types. To do
so, we calculate value by dividing the metric by the hourly USD instance cost.
For example, the HP extra-large compute instance costs 0.56 USD per
hour and has a performance metric of 152.96. The resulting value metric 273.14
is calculated using the following formula:

$$text”Fixed Value” = 152.96/0.56 → 273.142857$$

Tiered Value

The graph below allows selection of either Tiered or Fixed
Value options. Tiered Value is Fixed Value with an adjustment applied to
instances ranked in the top or bottom 20 percent. The table below lists the
exact adjustments used. The concept behind tiered values is based loosely on
CPU pricing models where the top end processors generally command premium per
GHz pricing, while the low end is often discounted. The
HP double-extra-large compute instance costs 1.12 USD per hour and has a
performance metric of 249.72. It is also ranked in the 91st percentile which
receives a +10% value adjustment. The resulting tiered value metric 245.256 is
calculated using the following formula:

$$text”Tiered Value” = (249.72/1.12)*1.1 → 222.96*1.1 → 245.256$$

Tiered Value Ranking Adjustments
Ranking Percentile Value Adjustment
Top 5% +20%
Top 10% +10%
Top 20% +5%
Mid 60% None
Bottom 20% -5%
Bottom 10% -10%
Bottom 5% -20%

Cloud compute pricing is usually tied to CPU and memory allocation. Value
metrics in the graph below are derived by dividing CPU performance by the
hourly cost

Price Normalization

Most cloud providers, including all those covered in this post, offer on demand
hourly pricing for compute instances. In addition, some providers offer commit
based pricing and volume discounts. AWS EC2 for example offers six 1 and 3 year
reserve/commit based pricing tiers. These pricing tiers exchange lower hourly
rates for a setup fee paid in advance, and in the case of heavy reserve,
commitment to run the compute instance 24x7x365 for the duration of the
term (light and medium reserve tiers do not have this requirement). In order to
represent these pricing tiers in the graph below, the total cost was normalized
to an hourly rate by amortizing the setup fee into the hourly rate. For example,
the m3.xlarge instance type is offered under a 1 year heavy reserve
tier for 1489 setup and 0.123 per hour. For this instance type and pricing
model the hourly rate used in the graph and for value metrics was 0.293/hr
calculated using the following formula:

$$text”Normalized Hourly Rate” = ((1489/365)/24) + 0.123 → 0.17 + 0.123 → 0.293$$

AWS EC2 is also available under a bid based pricing model called Spot
pricing. Although spot pricing is typically priced substantially below standard
rates, it is highly volatile and subject to transient spikes that may result in
unexpected termination of instances without notice. Due to this, spot pricing
is generally not recommended for long term usage. The spot pricing included in
the graph below is based on a snapshot taken in early June 2013 and may not
represent current rate.

Volume discount and membership based pricing like Windows Azure MSDN, were not
included in the graph and value analysis because they are not as straight
forward, and often require substantial monthly spend commitments at which
users would likely be able to negotiate similar discounts with any vendor.

The graph provides a drop down list allowing select of different pricing
models. When changed, the graph and table below will automatically update.

The AWS EC2 reserve hourly pricing in the graph below is based on a normalized
hourly value calculated by amortizing the setup fee into the hourly rate

Visualizing Value & Performance

On our current website and in prior posts we’ve often used traditional bar
charts to represent data visually. While this is a typical approach to
presenting comparative analysis, it often resulted in lengthy displays, and
did not lend well to large multivariate data sets. In the search for a more
efficient and intuitive way to visualize such data, we discovered the
D3 visualization library, which provides
excellent tools and examples for creating data visualizations. It is based
on this that we designed the graph below. The goal of this graph is to
present large multivariate data sets in a concise, intuitive and
interactive format. In a relatively small space, this graph allows users to
observe many different characteristics of cloud services including:

Performance
The size or diameter of the circle represents proportional CPU
performance of each compute instance. A larger circle represents more
performant systems.
Price & Value
The fill color of each circle represents either the value or the price of
each compute instance (defaults to value). Users can toggle between
price, fixed value and tiered value fill options. Blue represents better
value/lower price, while red represents lower value/higher price. A grey
color is used for the midrange.
Vertical Scalability
Not all workloads lend well to horizontal scaling models (load is spread
across many compute nodes). Legacy database servers for example often do
not (easily) support multi-node clusters. By observing variation in
circle sizes from small to large, users may better understand the
vertical scaling range and limits of each cloud service.
Instance Type Variability
Results are grouped by instance type and CPU architecture. In the case of
EC2, this allowed display of multiple records for a single instance type.
The m1.large, for example, deployed to 3 different host types
during our testing, each of which demonstrating slightly different
performance characteristics.
Multiple Pricing Models
Users may view pricing and value based on different service pricing
models. In the case of EC2, this allows toggling between on demand,
reserve and spot pricing. Results in the graph and details table are
updated instantly when the pricing model selection is changed.

Below the graph a sortable table displays details for each service and
compute instance displayed in the graph. This table updates dynamically
when fill color or pricing model selections are changed. Details for
specific compute instances can also be viewed by hovering over a circle.
In addition, users may zoom into a particular service by clicking on the
container for that service. The graph can also be displayed in a larger
popup view by clicking on the blue zoom icon displayed in the upper right
corner when hovering over it.

The interactive graph below displays multiple characteristics of compute
services and instance types including performance, price, value and
vertical scalability. EC2 price and value can be toggled between on
demand and reserve pricing tiers

HOW TO READ THIS DIAGRAM
Performance Worse

Better

Performance

Performance is represented by the diameter of the circle. Larger circles represent more performant systems.

Close

Price Hour » USD
Value
$1.50+
Lower

$0.05
Higher

Price & Value

Price and value are represented by the circle fill color. Blue represents lower pricing/better value.

Close

OPTIONS
Fill Metric


Fill Metric

The Value fill metric represents a ratio between performance and price, while Price represents a fixed hourly cost.

Close

Value Calculation


Value Calculation

Fixed values are based on a simple ratio between performance and hourly cost. Tiered values are Fixed Values with an adjustment applied to services ranked in the top or bottom 5, 10 and 20 percent.

Close

Compute CPU Performance & Value

Close

Zoom Diagram

Results Summary Interactive Diagram

This diagram uses a circle with color shading segments to represent the result of each compute instance type and benchmark. Each service and compute instance type is represented as a slice in the circle.

HOW TO READ THIS DIAGRAM
Benchmark Result Worse

Better

Pricing Model Help

Pellentesque nibh felis, eleifend id, commodo in, interdum vitae, leo. Praesent eu elit. Ut eu ligula. Class aptent taciti sociosqu

Close

OPTIONS
Group by Service

Pricing Model Help

Pellentesque nibh felis, eleifend id, commodo in, interdum vitae, leo. Praesent eu elit. Ut eu ligula. Class aptent taciti sociosqu

Close


Graph 2


Close


Zoom Diagram

Graph 2 caption

Comments and Observations

As is our generally policy, we don’t recommend any one service over another.
However, we’d like to point out some observations about each compute service
included in this post.

AWS EC2

  • On demand pricing provides similar value as other compute services.
    However, EC2 value increases substantially for reserve pricing models
  • EC2 provides a broad performance range, topping out in this post with the
    16 core (32 core hyper threaded) cc2.8xlarge instance type
  • CPU architecture varies between instance types, with higher end types
    generally running on newer and faster hardware
  • Older instance types like m1.large may deploy to different
    hardware platforms, and thus demonstrate variable performance. For
    example, there was a notable difference in performance between
    Intel E5430 and Intel E5-2650 based m1.large
    instances
  • The cc2.8xlarge provides good value for multithreaded workloads
    with high CPU demand

Google Compute Engine (GCE)

  • Performance increased near linearly from small to large instance types
  • The n1-standard-4 performed roughly 10% slower than we expected
    (112 actual CPU performance versus 120-125 expected)
  • The GCE hypervisor does not pass thru full CPU identifiers – but in
    GCE documentation
    Google has stated processors are based on the Intel Sandy Bridge
    (E5-2670) platform
  • n1-standard-4 and n1-standard-8 instance types
    performed very similar to comparable EC2 instance types
    m3.xlarge and m3.2xlarge. All are based on the same
    Intel Sandy Bridge platform, and on demand pricing is also nearly the
    same (GCE is just a few cents higher)

Windows Azure

  • The A3 and particularly A4 instance types are
    priced notably lower than
    instance types from other services with comparable CPU cores. This factor
    attributed to the higher value rankings associated with those instance
    types regardless of their performance being generally lower
  • Vertical scalability is limited with the largest A4 VM (in terms
    of CPU cores) having the lowest performance ranking of all 8 cores
    instance types – however, at 1/2 the cost, the value is still good.
    Exclusive use of AMD 4171 2.1GHz processors (released in 2010)
    are also a limiting factor. The forthcoming release of Intel Sandybridge
    Azure Big Compute
    instance types may address this deficiency

HP Cloud

  • HP compute instances provided marginally higher performance rankings for
    each of the 2, 4 and 8 core instance type groups
  • For on demand pricing, the medium instance type provided the
    highest value ranking in the graph
  • Performance increased 2X from medium (2 core) to
    extra-large (4 core) instance types, but the price difference is
    4X. The 4 core large instance type between them was not tested

Rackspace Cloud

  • Rackspace and Windows Azure performed nearly the same. Both are based on
    the AMD 4100 processor platform. However, Azure value is much
    higher for the 8 core A4 instance type (versus the Rackspace 8
    core 30GB) because the cost is less than half (0.48/hr versus
    1.00/hr – 14GB memory Azure versus 30GB Rackspace). The same
    applied to a lesser extent for the 2 and 4 core instance types (Azure
    A2/3.5GB and A3/7GB versus Rackspace 4GB and 8GB)
  • The 30GB compute instance had the lowest value of all instance
    types included in this post
  • Like Windows Azure, vertical scalability may be limited due to
    observed exclusive use of AMD 4170 2.1GHz processors (released
    in 2010). Rackspace does offer an upgrade path through its dedicated
    hosting offerings, however.

Next Up – Storage IO

CPU and storage IO are generally the two most important performance
characteristics for compute services. Depending on workload, one might be more
important than the other. Compute services often offer multiple storage
options. Many storage options are networked and thus subject to higher
variability than CPU and memory. Many workloads are sensitive to IO variations
and may perform poorly in such environments. In the next post, we’ll present
IO performance and consistency analysis for the same providers covered in this
post. Storage options covered will include:

AWS EC2
Ephemeral, EBS, EBS Provisioned IOPS, EBS Optimized
Google Compute Engine
Local/Scratch, Persistent Storage
HP Cloud
Local, Block/External Storage
Azure
Local Replicated, Geo Replicated
Rackspace
Local, SATA and SSD Block/External Storage

Follow storage IO, we will also release posts covering network performance
(inter-region, intra-region and external), and object storage IO.

Originally posted: CloudHarmony

Read Original Post

Comments are closed.