We’re now entering what I call the “Industrial Revolution of Data,” where the majority of data will be stamped out by machines: software logs, cameras, microphones, RFID readers, wireless sensor networks and so on. These machines generate data a lot faster than people can, and their production rates will grow exponentially with Moore’s Law. Storing this data is cheap, and it can be mined for valuable information.
In this context, there is some good news for parallel programming. Data analysis software parallelizes fairly naturally. In fact, software written in SQL has been running in parallel for more than 20 years. But with “Big Data” now becoming a reality, more programmers are interested in building programs on the parallel model — and they often find SQL an unfamiliar and restrictive way to wrangle data and write code. The biggest game-changer to come along is MapReduce, the parallel programming framework that has gained prominence thanks to its use at web search companies.
To understand where we’re headed with parallel software, let’s look at what the computer industry has already accomplished. The branch of parallel research that has had the most success in the field is parallel databases. Rather than requiring the programmer to unravel an algorithm into separate threads to be run on separate cores, parallel databases let them chop up the input data tables into pieces, and pump each piece through the same single-machine program on each processor. This “parallel dataflow” model makes programming a parallel machine as easy as programming a single machine. And it works on “shared-nothing” clusters of computers in a data center: The machines involved can communicate via simple streams of data messages, without a need for an expensive shared RAM or disk infrastructure.
The MapReduce programming model has turned a new page in the parallelism story. In the late 1990s, pioneering web search companies built new parallel software infrastructure to manage web crawls and indexes. As part of this effort, they were forced to reinvent parallel databases –- in large part because the commercial database products at the time did not handle their workload well. Like SQL, the MapReduce framework is a parallel dataflow system that works by partitioning data across machines, each of which runs the same single-node logic.
SQL provides a higher-level language that is more flexible and optimizable, but less familiar to many programmers. MapReduce largely asks programmers to write traditional code, in languages like C, Java, Python and Perl. In addition to its familiar syntax, MapReduce allows programs to be written to and read from traditional files in a filesystem, rather than requiring database schema definitions. MapReduce is such a compelling entryway into parallel programming that it is being used to nurture a new generation of parallel programmers. Every Berkeley computer science undergraduate now learns MapReduce, and other schools have undertaken similar programs. Industry is eagerly supporting these efforts.
Technically speaking, SQL has some advantages over MapReduce, including natural combinations of multiple data sets, and the opportunity for deep code analysis and just-in-time query optimizations. In that context, one of the most exciting developments on the scene is the emergence of platforms that provide both SQL and MapReduce interfaces within a single runtime environment. These are especially useful when they support parallel access to both database tables and filesystem files from either language. Examples of these frameworks include the commercial Greenplum system (which provides all of the above), the commercial Aster Data system (which provides SQL and MapReduce over database tables), and the open-source Hive framework from Facebook (which provides a SQL-like language over files, layered on the open-source Hadoop MapReduce engine.)
MapReduce has brought a new wave of excited, bright developers to the challenge of writing parallel programs against Big Data. This is critical: a revolution in parallel software development can only be achieved by a broad base of enthusiastic, productive programmers. The new combined platforms for data parallelism expand the options for these programmers and should foster synergies between the SQL and MapReduce communities. Longer term, these Big Data approaches to parallelism might provide the key to keeping other sectors of the software industry on track with Moore’s Law.
Joe Hellerstein is a professor of Computer Science at the University of California Berkeley and has written a white paper with more detail on this topic.
Slide courtesy of Green Plum
Great post Joe!
One thing I would recommend looking into (you and anyone interested in next-generation software development tools) is Microsoft’s latest forays into parallel processing and subsequently, parallel programming.
Assuming you don’t have any kind of bias against Microsoft (like so many people seem to), they’re doing some pretty innovative things in regards to the exact problem you’re talking about – specifically in regards to “big data” and multicore systems. Their innovations really allow the developer to cover more ground (ground as in slicing and dicing large amounts of data) with less code.
I know personally (I’m not speaking for everybody here) but the constructs they provide have completely changed the way I code. I’m able to do more, with less code.
1) LINQ and PLINQ. LINQ stands for Langugage INtegrated Queries, which lets the developer write SQL-like queries in their code, and these queries are compile-time checked. As developers, we are constantly having deal with collections of “stuff”. Collections of numbers, collections of database rows, collections of nodes, collections of objects = collections of “things”. Sometimes those collections are in memory; sometimes they’re in a document; sometimes they’re on a separate server; sometimes they’re smeared across multiple servers; heck, sometimes they’re ‘in the cloud’. BUT when you’re writing code – you don’t care where they’re at, all you know is you want the entities that match criteria X, Y and Z, ordered by W and aggregated by J.
So LINQ allows you to get at those entities regardless of where they’re persisted (when I say “regardless”, I mean assuming there’s a LINQ Provider, custom or otherwise, that gives you access to those entities wherever they may live).
LINQ allows the engineer to focus on the *what* and removes the *how*. PLINQ is just like LINQ, but the P stands for Parallel. As in execute this query, not in serial manner, but in whatever it takes it get it done – serial or otherwise.
2) CCR. CCR stands for Concurrency and Coordination Runtime, which is currently being flown under the guise of the Robotics Extensions , but they obviously have bigger plans for it. Essentially, it help different services and the depencies those services may have on one another. From a robotics standpoint, you can see where this would come into play (the depth sensor detects a cliff, halt the wheels) but higher up, you can see where a stack like this would be important for any type of “metaservice”.
More info: http://msdn.microsoft.com/en-us/library/bb648752.aspx
3) Windows Workflow. Workflows allow the developer to define services and the “flow” (lame description, I know, but this is getting longer than I anticipated) between all of them. It’s another example of telling the “system” (whatever that may be in your case) *what* you want done and not *how* to accomplish it.
http://en.wikipedia.org/wiki/Windows_Workflow_Foundation
For what it’s worth, I do not work for Microsoft and I have no affiliation to them besides using their tools. I’m pretty proficient in Unix and it’s associated tool sets (Java, Python, Apache, MySQL). I have no corporate or organizational allegiances, I’m allegiant to the tools that make me a better and lazier engineer – regardless of commercial or “for profit” status.
-josh jonte
WOW! i am sure glad i didn’t decide to go to UCB now… making my decision to be around hot girls and beaches instead of frumpy chubby girls in sweat shirts was definitely the right call now that i’ve read this article.
So just a word of advice Mr. CS prof. Google isn’t going to hire you for slobbing their knob on a gigaom article.
last time i checked the techniques used in mapreduce were used at a lot of other places decades before google ever called their divide and conquer schema mapreduce… DID WE FORGET THE ENTIRE SUPER COMPUTING INDUSTRY IN THIS ARTICLE??? But hey i guess we can chalk up the discovery of divide and conquer algorithms to LARRY PAGE CIRCA 1990 like who the fck is Von Neumonn and Gauss? unmentionable morons who did nothing for parallelism.
or the fact that a slow down in cpu innovation causes a need in parallelism which will diminish as soon as innovation in cpu technology is created? you didn’t even touch on that
i’m a freshman and i know you’re a dumbass so is someone going to be sending me a cracker jack UCB PH.D in the mail now?
OMG.
Now we got to the point that normalization, automated or done manually, is the future of parallel computing. No wonder CS is in such a bad shape.
@Freshman, don’t be so hard on him.
Dear Prof.
Have you looked at a Neuron lately ?
Ever wondered why they look so different from a logic gate( input output are not exactly uni connections are they, let alone uni directional) ?
Ever wondered why they a so stochastic (In firing and creating connections) ?
Ever thought about why GLIA absorb and create none uniform neural transmitters (at least in math models) ?
IFF you want to understand how to create a really complex massive parallel system which can handle lockfree updates and can do some other fancy stuff, like learning. I suggest you rethink the map reduce hype. I know it’s not Googly, but it’s really estonnishing what mother Nature has done.
A couple thoughts:
* SQL isn’t inherently connected to megadata processing, of course. It would be better to speak of specialized data warehousing servers, of which Greenplum and Aster are two (Teradata, the recently-acquired ParAccel, and Vertica being some others). Curt Monash covers these pretty extensively at http://www.dbms2.com
* None of the above systems, as far as I am aware, handle the data volumes currently being routinely processed by M/R systems (at Google, Yahoo, and several other places). Which isn’t to say that they’re not extremely useful; it’s just that the scope is different. They might get better, but I’d question whether they’re more optimizable in general.http://s2.wordpress.com/wp-content/themes/vip/gigaom3.5/images/buttons/comment.gif
* In my experience, SQL (and similar query languages) are more familiar to the vast majority of commercial programmers (although they may not have an in-depth grasp of either normalization or dimensional modeling) than MapReduce-style systems, although perhaps this is changing. People with functional programming seem to ramp up quicker with the model, for obvious reasons.
Finally, I think the title’s a bit misleading: while you can certainly credit MapReduce with attacking the problems of parallel *data processing*, there are plenty of endeavours involving parallel systems where it doesn’t apply. The resurgence of interest in actor systems is probably helping more widely when it comes to tackling concurrency in general.
I have found it funny that people concentrate on Googles Map Reduce when its other product is something people could kill for. I am talking about BigTable obviously.
Map Reduce is good but its use (until now) has been restricted to indexing operations search and data mining being examples. These applications have a unique advantage of NOT requiring realtime answers.
Most webapps like flickr / youtube / amazon / blogs /twitter have found that the main problems that they face scaling is caused by the database. SQL while being easy to learn is also opaque in the sense that it is really not known what will happen behind the scenes, Query plans change leaving massive website slowdowns etc etc. Also parallel / distributed databases are not really out there – mysql or oracle are not really parallel on commodity hardware.
Thus Facebook / Amazon / Google are producing a simple parallel databases for webapps, I am talking of Cassandra / Dynamo and BigTable obviously. There is also CouchDB etc
Why is this important. You assume that the MapReduce will interact with parallel database through SQL. The problem is
1) Mapreduce is not built for interactive apps. RDBMS are built for these classes of apps.
2) There are no good (meaning industry standard) distributed RDBMSs out there to power the parallel future you envision.
SQL or no SQL is a smaller problem. Any language used here could be SQL like to take advantage of the ease of use and the number of people out there who know SQL. However the open question is what will the parallel db look like is still open.
Some additional thoughts in response to comments:
1) When it first appeared, this post had a different title, which had an unpleasant whiff of MapReduce marketing. I honestly don’t know how that title got there, but it wasn’t what I put down when I submitted the text. Hopefully the fix to the title now gets the flavor a little closer to where it should be. (It also makes R. McLoy’s post a little confusing, since it refers to that earlier title which is now gone…)
2) For me, the most interesting thing about MapReduce (by far!) is that people are so interested in it. We’re only going to make progress in parallel computing with paradigms that programmers embrace. SQL and MapReduce are two successful examples (arguably the only two) in the parallel programming space. MapReduce is the new kid on that block, relatively speaking. So we should learn from that, channel the energy to more general directions, and make more progress. You have to bring people along with you on this stuff.
3) On other languages/models: Josh is right: the MS work on LINQ and co. — and the MSR work on C-Omega — are some of the most interesting things in the space of embedding declarative nuggets into traditional code, and moving toward a more parallelizable world. Definitely worth watching. In the comments on the first post, somebody mentioned Erlang, which is another very interesting and somewhat different design point in the parallel computing space. Also well worth looking at — I recommend the Erlang book, it’s a pleasure to work through, and eye-opening. Finally, I’m pretty fired up about our research work on Overlog, and the work we’re beginning on Lincoln. These languages push the data-centric style of SQL and MapReduce into a much richer programming model. The Overlog work is documented to some degree at https://trac.declarativity.net. The work on Lincoln is still in the pipe, and has an eye on attracting programmers, not just researchers. Watch that space.
4) Yuvamani: Yep, Google’s MapReduce is a batch system, because that’s what Google wanted at the time. Hadoop copied that. But it’s all relatively easy to change, with little impact on the programming model. BigTable and company are indeed very interesting. I expect an integration of low-consistency storage with other query processing languages in the next year or so. We’ll see which of the DBMS vendors gets there first, or if it happens by gluing a query processor over a system like that. Facebook’s Hive is one open-source step in that direction (though it’s got a pretty thin query processor at this stage). The question of how much query optimization to do under the covers is kind of an eternal tradeoff between complexity and control. We’ll see how it plays out in the emerging application domains.
5) Ronald: I have to agree. Mother Nature has been amazing people for millenia. But it’s been hard for us to systematize computational insights from that — so far. There are some pretty intriguing tidbits in the research about DNA computing and the like, but it will be quite some time before you can, say, implement Spore that way :-). In general, we need to place bets on both short-term and long-term research, and keep a clear eye on the potential of each.
Hi Joe,
It’s true that Map-Reduce does wonders for *offline* processing of data generated as a byproduct of activities on the web, phone etc. But it needs to be pointed out that social network data models create *online* data management problems that are not parallelizable. This is simply because user-friend-shareditem relationships generate networks of relationships that can’t be partitioned as easily as the hierarchical trees we were accustomed to tackling in the business world. There is a lot of talk of sharding/partitioning etc. Yet it remains true that the underlying data models in social networks combined with the ~100million users and their data represent a whole different class of *online* data management problem not addressed by map-reduce, column stores … and all the *offline* data management technologies that are correctly getting a lot of attention.
In the meanwhile, Twitter, Facebook, etc. have to face massive *online* data management problems not addressed by the vendors or the research community.
There are surprising implications when one does look deeper at the problem. Simple back of the envelope calculations re: scaling data items belonging to ~100million network-meshed users suggests that only natural way to partition the data is for users to have their own client side data stores and run queries locally when those queries mostly involve their small subset of data. Also the ratio of (CPU+disk ) to data quantity is much more favorable. Offloading subsetted queries is the only way for *online* social network data management to scale.
This creates the next generation of client-server architecture where ~100 million client-side databases (possibly on SQLite) each of ~100MB-10G need to be synchronized in real time with say the Facebook backend. Realistic scalable solutions in this area, i.e. web driven online data management for ~100mill users, are being completely ignored by the Oracles/MySQLs of the world and also by the database research community and yet these problems can be very lucrative …. Of course there is a lot of money in managing offline Big Data but who will help solve the pain of exploding online data management – what I’ve called Data 2.0 since 2005 ?
The current advances in memory with 2TB RAM in 2U box for ~30K may provide temporary relief but the underlying meshed-network data models need serious research and engineering and for now the
Nitin:
You make a couple interesting points. Google’s MapReduce definition results in a batch (“offline”) processing system as you say — because they defined Reduce to produce a sorted list of Reduce groups. Hadoop followed suit. Note though that if you change the assumptions a bit, you could allow pipelined (“online”) Reduce outputs. We did a bunch of research on Online Aggregation http://control.cs.berkeley.edu back in the 90′s in the SQL context, where you could get “early returns” from aggregation (reduce) tasks. That work has been extended by Jermaine, Dobra and their students in recent years. I agree that it’s time to apply it to MapReduce, to bring that programming model online. Another direction to pursue there is the extension of the MapReduce programming API to handle continuous data streams, akin to the work on TelegraphCQ http://telegraph.cs.berkeley.edu and related research projects.
Your other point is about the fact that some workloads don’t partition well, and you need to do replication. You’re right — that’s a trickier nut to crack. It arises in scientific computation a lot. I’m not will convinced that social nets are clearly in that bin. I worked with LinkedIn to parallelize a number of their analytic jobs on top of Greenplum using both SQL and MapReduce, and they partitioned quite smoothly.
There’s historically been a tendency to think in the following terms: “I need to parallelize Algorithm X, and it has a lot of intrinsic data sharing, so I need an innovative/expensive parallel architecture since data-parallelism won’t work”. In many cases, it can be more fruitful to think in these terms: “I have a cost-effective data-parallel infrastructure, so I need an innovative Algorithm X’ that approximates Algorithm X.” This mindset is taking hold in a number of quarters, and there’s quite a lot of sophisticated things you can do cheaply if you embrace that approach.
I think that the final winner between anti-RDBMS and parallel RDBMS will a hybrid system which aims at integrating MapReduce into RDBMS.
Actually, GreenPlum and HadoopDB are doing that.
Both of them are RDBMS advocates since they have lots of knowledge and experience through RDBMS research.
Especially, the hybrid system need SQL-like query analysis & optimization to manipulate distributed DBMSs with MapReduce.
In this point, I think that RDBMS advocates cannot help defeating anti-RDBMS advocates, unfortunately.
Nevertheless, they do not lead IT industry & market due to somewhat high cost.
To reduce cost, most of people want to take advantage of open sources.
Currently, I am a creator of open source “coord”(http://www.coordguru.com), which provides C++ MapReduce framework and distributed key-value store.
In the near future, I believe that such a plan will be achieved on coord project.