Deep Value runs Hadoop at scale on EC2, but we find that running our own cluster is significantly cheaper
We have been using Amazon’s EC2 cluster with Hadoop for a number of years to run simulations of various stock trading algorithms. We have found EC2 to quite useful in spinning up large clusters of machines on short notice and generally deploying Hadoop clusters.
The monthly bills however became more and more eye-popping ($70,000/month and growing), and some rough back of the envelope calculations led me to believe that what we were paying for storage and compute was excessive.
The long and the short of it is that Amazon’s EC2 service is 380% more expensive than running our own hardware. Of course EC2 can be provisioned on demand, but such a large multiple certainly makes having an internal cluster a key part of our ongoing Hadoop strategy. Read on for our story…
The back of the envelope
The back of the envelope calculation is this: Tiger Direct will sell you a Seagate 3 terabyte drive for $154. For the same storage on S3 for 2 years, I would pay (1,000 * 0.125 + 2,000 * 0.11) * 12 mths * 2 yrs1 = $8,232 at the standard rates. Buying our own drive was 2% of the cost of using EC2, so this certainly seemed worth investigating.
Deep Values Cluster
To do the investigation we deployed a Hadoop cluster in Telx at Clifton NJ. Telx offers competitive rates and great access to the US exchanges. The approximate power costs, hosting and bandwidth costs come to roughly $185 dollars per server per month.
We purchased 20 Linux servers running CentOS for $7,418 per server from Silicon Mechanics – these are fairly powerful commodity servers sporting 2 Intel E5-2650 processors, 64GB of RAM and 8 x 3TB hard drives. These 20 machines spanned 2 racks with space to spare. Each rack needed a switch (~$3,000.) The total cost is thus (7,418 * 20 + 3,000 * 2) = $154,360. Amortizing this over a 2 year period, works out to roughly $7,700 per month at 5% interest for the hardware.
Of course there is the hosting costs. With various minimums in place (you can’t really rent half a rack), for this analysis we will ballpark the hosting cost for these 20 servers at $5,000 per month – a little high, but usable.
The total cost is thus $12,700 per month.
EC2 Server Comparison
We also run simulations that require hundreds of machines. These simulations load up past dates trade data and then run analysis to understand how the trading strategy would have performed. The Map tasks are fairly compute intensive, with perhaps 5% of the Map task spent in I/O and 95% in calculation. Because of this we have been utilizing c1.xlarge instances (High-CPU Extra Large) on up to 900 machine clusters reading from S3 to run our analysis. These typically take an 1-2 hours to run.
Our level of parallelization is a stock-day (i.e. analysing one stock for one day.) An analysis for September 2012 has us to doing map tasks for 18,050 stock-days. Running this on a 100 machine EC2 cluster reading from S3 takes 39 minutes or 4.5 stock-days per minute per machine. Running this on the 20 Deep Value machine cluster takes 35 minutes for a speed of 26 stock-days per minute per machine.
Thus we can say that a one of our dual CPU servers is equivalent to approximately 5.75 EC2 c1.xlarge instances. Our 20 off-the-shelf machine cluster is roughly equivalent to a 115 high performance machine EC2 cluster.
Amazon EC2 Costs
This cluster has 480 TB of storage. If we set the replication rate in Hadoop to 3, this means an effective storage capacity of 160TB. On S3, these 160 TB would cost us $15,9652 per month or $383,160 over 2 years.
The calculated cost of running a 115 machine c1.xlarge cluster on EC2 would be $0.66 per hour, or $55,407 per month with a basic instance. Even with a reserved instance it is $2,000+ $0.16 per hour, or $33,5993 per month.
3.8x time more expensive
If we compared just on what we are getting in terms of compute and storage, our cluster is costing us $12,700 per month versus $48,564 (33,599+15,965) for EC2.
EC2 is thus costing us over 3.8 time more per month.
Whatever way we slice this, either by storage cost or by compute, it seems clear that using your own data center rather than EC2 makes sense for us. For one-off peaks EC2 makes sense, but given the ongoing nature of our simulated analysis, moving to our own datacenter is a very clear winner
By Paul Haefele, Managing Director – Technology
1.See http://aws.amazon.com/s3/pricing/ storage pricing..
2. ((1*0.125+49*0.11+110*0.095)*1000 *12 *2) on http://aws.amazon.com/s3/pricing/.)
3. 115 machines * (2000 upfront + 0.16*24*365) / 12 on http://aws.amazon.com/ec2/pricing/ for heavy utilization reserved instances for 1 year term.
Deep Value from its start began as a distributed organization. From the very beginning the 3 principals where in 3 separate cities – New York, Toronto and Chennai. Today, the primary development, research and operations center are in Chennai with Toronto and Chicago being management, sales and support centers.
To make this work, process and distributed tools are buried deep in our DNA, rather than tacked on after we’ve reach some size were process is needed to move forward. In addition, we needed to focus on our core competencies so we looked for tools to enhance our productivity, rather than trying and build them ourselves. As such we were (and continue to be) on the lookout for great, inexpensive productivity tools.
As such I wanted to share some of the open source and inexpensive tools we’ve utilized to help organize ourselves at Deep Value.
1. Atlassian JIRA and Greenhopper (http://www.atlassian.com)
Getting work done and tracking what you’re doing is key. We used Bugzilla for some time, but given the critical nature of work tracking, we decide that a more robust solution was required. We experimented with several solutions (MS Project (waterfall – arrggg) and TeamWork (www.twproject.com) ) but in the end an agile development methodology is really the best way to build systems in a complex, fluid environment. We wanted to have a centralized system for issues and user stories – JIRA with GreenHopper works well for this.
2. Codesion (http://codesion.com/)
We needed source control, and having someone manage this securely and safely for a minimal fee, this made sense. As we’ve grown we are looking at doing this on-site, but the cost is low and the service level high.
3. Google Apps (http://www.google.com/apps/intl/en/business/)
We have a sizable systems team, but we want them to focus on managing our various data centers, not setting up calendars. After trying a few open source solutions for calendars and email, we went with Google. We use it for email, calendars and shared documents. The shared documents have been especially useful in collaborating with clients – providing us with a centralized “whiteboard” that multiple parties can view. We are now using Webcams and Google Hangouts to build a more cohesive team feeling.
4. Aretta (http://www.aretta.com) now CBeyond
After using Skype for a while, we went with a more full featured VOIP provider. We tried several VOIP providers multiple times, and Aretta won out each time. They had some minor reliability issues when they migrated data centers a year or so ago (hence searching twice) but they are the best of the inexpensive variety.
5. Workforce Growth (http://www.workforcegrowth.com)
Doing reviews is essential. Tracking all the questions and doing 360s without a tool is not for the feint hearted. WorkForceGrowth is a great tool and improved our review processes, although nothing replaces being a good listener.
6. Asana (http://asana.com/)
For managing multi-team projects with many small tasks and co-ordination, we found JIRA to be too heavy weight. Google documents are too unstructured and don’t prompt action. Asana fits well with client integrations and cross-team project management. A great tool that we’ve recently started using more and more.
7. FollowUpThen (www.followupthen.com)
One of our issues was the “dropped email thread” problem leading to dropped work. If you have an email that you need to ensure you follow-up to completion, adding a cc: to followupthen.com (e.g. firstname.lastname@example.org or email@example.com) will get followupthen to remind you that you did not get a reply from the person whom you sent the email to. This allows you to send-and-forget emails as followupthen will prompt you if you received no reply. No more dropped email threads.
8 RecruiterBox (www.recruiterbox.com)
Managing your recruitment pipeline and job postings is a real problem. Recruiterbox has helped us track candidates as they move through our recruitment pipeline. This ensures we have suitable statistics to measure recruitment performance as well as have a centralized repository for all the information relating to a candidate. Recruiterbox can also push job postings out to our website and other social media (linkedin, facebook.)
By Paul Haefele, Managing Director – Technology
Anti-gaming logic that is dynamic and embedded
Anti-gaming logic is at the core of what we do. We deploy both dynamic strategies and embedded intelligence to ensure best execution you can trust.
Deep Value uses advanced visualization tools and statistical analysis packages to challenge gaming risk. We invest approximately 4,000 hours annually reviewing our NYSE trading algorithms. Our algorithms emit their internal state at the rate of approximately 39 emissions per algorithm minute. These emissions are the detailed internal driving variables of the algorithm as it moves through its various internal states through time, including market data. Our research and automated validation teams analyze these emissions to measure how our algorithms are performing, including potential market impact and leakage.
Deep Value embeds complex logic into all of its algorithms as a means of eliminating gaming risk at the core. Our algorithms deploy a host of sophisticated techniques designed to reveal as little information as possible, while still executing the trader intention expressed in the choice of algorithms and its settings. Rather than focusing on detection, Deep Value algorithms are shielded by (a) intelligent algorithmic logic that reduces information leakage and (b) constant tuning of algorithms for improving performance measurably that, apart from delivering better prices, also introduces variations in automated behavior and resulting signatures in the market that make the detection of their presence harder.
These techniques include the following:
- Deep Value algorithms are volume-sensitive and do not overreact to out sized volume, or volumes that print too far away from the inside market; rather they exclude these volumes from consideration, so that our algorithms don’t mindlessly “catch up” and reveal their hands when large volumes print.
- Deep Value algorithms display sizes that are sensitive to the ambient liquidity at the inside market, measured real-time, so that our participation is, to the extent allowed by the algorithmic intent, statistically “like” the liquidity already visible to market participants.
- Deep Value algorithms are rebate and price performance sensitive, and as a rule try and fill as much of an order passively as possible. This has two benefits: first, economic benefit of making Deep Value a leader in terms of price performance and rebate performance; and second, protecting the customer from impacting the stock price mechanically and paying too much crossing wide spreads.
- Deep Value algorithms don’t overreact by chasing adverse price moves (unless validated broadly in the market).
- When serving aggressive orders, Deep Value algorithms typically don’t just reach across to take volume to the limits allowed, but rather wait to give contras an opportunity to replenish sizes at better prices after takes (as algorithmic orders typically do) before resuming aggressive actions.
NOTE: Supporting performance analysis with regard to arrival price is contained within NYSE Executive Summary POV Performance Report (August 2010 – June 2011).
An algorithmic trading saga unfolds
Testing our high frequency trading platform has always been a challenge. The amount of trading, and the complexity of that trading, have been increasing rapidly. This has led us to deploy more machinery to ensure we are performing as we expect. Read more →