Hadoop And Big Data Enterprise Challenges

Hadoop Highlights

  • Apache HTTP server is used on 65% of all active websites
  • Hadoop is a software library allowing for the distributed processing of large data sets across clusters of computers
  • Public Cloud suppliers have the widest expertise and largest production systems today
  • Cloudera, IBM, Red Hat and Microsoft are among a number of suppliers addressing the commercial market
  • We predict a rise in the number of large users using Hadoop for OLAP and other workloads
  • Big Data isn’t about scary numbers

hadoop ecosystem
A number of vendors are getting more involved in bringing OLAP applications to enterprises by using Hadoop, which hitherto has typically been used by large public Cloud providers and academia. You’ll be interested to learn more about the massive scale of some of the current deployments as well as to look at some of the opportunities going forward.
The Apache HTTP Server is of course the web server software, which drove the phenomenal growth of the World Wide Web. It is Open Source software and currently in use on 65% of all active websites in January 2013, according to Netcraft’s research.
Apache Hadoop is a software library – ‘a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model’ according to Hadoop. It has been widely adopted, with 166 references currently provided on its site. There is some interesting information published there from networking/Public Cloud suppliers. In particular:

  • eBay is running 8 x 532-node clusters with 5.3PB of associated storage, using it for search optimisation and research; it also reports heavy usage of Java MapReduce[1], Pig[2], Hive[3] and Hbase[4]
  • Facebook is running it on a 1.1k machine cluster (8.8k cores) with 12PB storage and a 300 machine cluster (2.4k cores) and 12TB storage; each ‘commodity’ node has 8 cores as 12TB storage; it reports heavy use of streaming and Java APIs, has built a higher-level data warehousing framework known as Apache Hive and a FUSE implementation over HDFS (Hadoop Distributed File System)
  • Linkedin has multiple grids ‘divided up based upon purpose’; according to our calculations in total it runs Hadoop on around 4,100 machines with a total of 46k cores, 110TB RAM and 60PB storage; it uses RHEL 6.3 Linux, Sun JDK and Pig ‘heavily customised’, Hive, Avro, Kafka and ‘other bits and pieces’
  • Spotify uses Hadoop for ‘content generation, data aggregation, reporting and analysis’ on a 120-node cluster with 2,880 cores, 2TB RAM and 2.4PB storage
  • Twitter uses Hadoop to store and process Tweets, log files and other data generated across Twitter; it uses Cloudera’s CDH distribution of Hadoop and uses Scala and Java to access its MapReduce APIs
  • Yahoo! has 40k computers with 100k CPUS running Hadoop; its largest cluster has 4.5k nodes, which it uses to support research into Ad Systems and Web Search and for scaling tests to support Hadoop on larger clusters; it claims that 60% of Hadoop developers within the company work on Pig

The massive scale of some of these clusters is testament to the value of Open Source development and a continuation of the work done in the HPC market many years ago. Their use of Hadoop also goes beyond analytics to data warehousing, Web search and optimisation.

Supplier Developments

We’ve looked at a number of Hadoop developments over the last year or so, In particular:

  • Cloudera’s CDH is the best-known Hadoop distribution . It also works with vendors such as Cisco, Oracle, HP, Dell, IBM and NetApp for their commercial implementations.
  • IBM has its own Hadoop distribution called BigInsights. It is in the process of bringing OLAP workloads to medium and large sized organisations through a number of software, service and hardware offerings – most notably in its Pure Data For Analytics and PureData for Operation Analytics systems introduced in January .
  • HP has launched its AppSystem – a turn-key factory-assembled Hadoop platform using HP Proliant servers and network switches, RHEL, Cloudera, HP Insight CMU and Vertica Community edition software
  • Red Hat is integrating its Gluser acquisition into RHEV. It uses an ‘elastic hosting algorithm’ which does away with the discovery nodes used in traditional storage, simplifying the addition of extra capacity. Hadoop applications can be run across a number of nodes, avoiding the need to build a separate storage cluster. It claims that integrating Hadoop with Gluster gives uses the ability to extract real time knowledge from active data.

In addition Microsoft has integrated the ability to run Hadoop on its Azure and Windows Server software. There are distributions available from Amazon, Hortonworks, EMC, IBM, Intel, MapR and many others. We also know of another important development to talk to you about shortly.

Some Conclusions – Eying Up The Enterprise

We hate the scary train-set numbers used by almost all Big Data vendors in setting the scene for their sales presentations. Our view is that the world gets as much storage as the disk and flash memory suppliers can make at a reasonable price. However we’re impressed with the success of the public Cloud vendors in implementing large-scale Hadoop clusters, albeit for relatively simple – but admittedly massively scaled – workloads. We’re also pleased to see that there is still a part of our world which uses distributed rather than centralised computing still.
Not everyone will need Hadoop and difficult and/or expensive solutions to Big Data problems. In fact often the Data proves to be smaller than the scary numbers suggest.
For those enterprises that do – they’ll gain more insight in talking to the public Cloud providers and the developer community than to system vendors today, although they’ll need commercial hardware and software to be modified to work well with Hadoop in future.
If you’re a large organisation increasing ‘systems of access’ workloads and lacking Open Source developers you’ll be pleased to know that there are vendors out there to simplify the adoption of OLAP workloads and Hadoop if appropriate.
We’ll do our best to keep you up to date with the big releases.


[1] a data warehouse infrastructure providing data summarization and ad hoc querying
[2] a high-level data-flow language and execution framework for parallel computation
[3] an open-source data warehouse system for querying and analysing large datasets stored in Hadoop files
[4] Hadoop distributed database supporting structured data storage for large tables

One Response to “Hadoop And Big Data Enterprise Challenges”

Read below or add a comment...

Trackbacks

  1. […] archiving to tape, but wants to have secondary storage without middleware; Yahoo! is also a major Hadoop user with 100k nodes; BlackPearl gives it the opportunity to link that to tape more easily than […]