Deep Value runs Hadoop at scale on EC2, but we find that running our own cluster is significantly cheaper
We have been using Amazon’s EC2 cluster with Hadoop for a number of years to run simulations of various stock trading algorithms. We have found EC2 to quite useful in spinning up large clusters of machines on short notice and generally deploying Hadoop clusters.
The monthly bills however became more and more eye-popping ($70,000/month and growing), and some rough back of the envelope calculations led me to believe that what we were paying for storage and compute was excessive.
The long and the short of it is that Amazon’s EC2 service is 380% more expensive than running our own hardware. Of course EC2 can be provisioned on demand, but such a large multiple certainly makes having an internal cluster a key part of our ongoing Hadoop strategy. Read on for our story…
The back of the envelope calculation is this: Tiger Direct will sell you a Seagate 3 terabyte drive for $154. For the same storage on S3 for 2 years, I would pay (1,000 * 0.125 + 2,000 * 0.11) * 12 mths * 2 yrs1 = $8,232 at the standard rates. Buying our own drive was 2% of the cost of using EC2, so this certainly seemed worth investigating.
To do the investigation we deployed a Hadoop cluster in Telx at Clifton NJ. Telx offers competitive rates and great access to the US exchanges. The approximate power costs, hosting and bandwidth costs come to roughly $185 dollars per server per month.
We purchased 20 Linux servers running CentOS for $7,418 per server from Silicon Mechanics – these are fairly powerful commodity servers sporting 2 Intel E5-2650 processors, 64GB of RAM and 8 x 3TB hard drives. These 20 machines spanned 2 racks with space to spare. Each rack needed a switch (~$3,000.) The total cost is thus (7,418 * 20 + 3,000 * 2) = $154,360. Amortizing this over a 2 year period, works out to roughly $7,700 per month at 5% interest for the hardware.
Of course there is the hosting costs. With various minimums in place (you can’t really rent half a rack), for this analysis we will ballpark the hosting cost for these 20 servers at $5,000 per month – a little high, but usable.
The total cost is thus $12,700 per month.
We also run simulations that require hundreds of machines. These simulations load up past dates trade data and then run analysis to understand how the trading strategy would have performed. The Map tasks are fairly compute intensive, with perhaps 5% of the Map task spent in I/O and 95% in calculation. Because of this we have been utilizing c1.xlarge instances (High-CPU Extra Large) on up to 900 machine clusters reading from S3 to run our analysis. These typically take an 1-2 hours to run.
Our level of parallelization is a stock-day (i.e. analysing one stock for one day.) An analysis for September 2012 has us to doing map tasks for 18,050 stock-days. Running this on a 100 machine EC2 cluster reading from S3 takes 39 minutes or 4.5 stock-days per minute per machine. Running this on the 20 Deep Value machine cluster takes 35 minutes for a speed of 26 stock-days per minute per machine.
Thus we can say that a one of our dual CPU servers is equivalent to approximately 5.75 EC2 c1.xlarge instances. Our 20 off-the-shelf machine cluster is roughly equivalent to a 115 high performance machine EC2 cluster.
This cluster has 480 TB of storage. If we set the replication rate in Hadoop to 3, this means an effective storage capacity of 160TB. On S3, these 160 TB would cost us $15,9652 per month or $383,160 over 2 years.
The calculated cost of running a 115 machine c1.xlarge cluster on EC2 would be $0.66 per hour, or $55,407 per month with a basic instance. Even with a reserved instance it is $2,000+ $0.16 per hour, or $33,5993 per month.
If we compared just on what we are getting in terms of compute and storage, our cluster is costing us $12,700 per month versus $48,564 (33,599+15,965) for EC2.
EC2 is thus costing us over 3.8 time more per month.
Whatever way we slice this, either by storage cost or by compute, it seems clear that using your own data center rather than EC2 makes sense for us. For one-off peaks EC2 makes sense, but given the ongoing nature of our simulated analysis, moving to our own datacenter is a very clear winner