Thinking about Big Data — Part Two

In Part One of this series of columns we learned about data and how computers can be used for finding meaning in large data sets. We even saw a hint of what we might call Big Data at Amazon.com in the mid-1990s, as that company stretched technology to observe and record in real time everything its tens of thousands of simultaneous users were doing. Pretty impressive, but not really Big Data, more like Bigish Data. The real Big Data of that era was already being gathered by outfits like the U.S. National Security Agency (NSA) and the UK Government Communications Headquarters (GCHQ) — spy operations that were recording digital communications even though they had no easy way to decode and find meaning in it. Government tape libraries were being filled to overflowing with meaningless gibberish.

What Amazon.com had done was easier. The customer experience at Amazon, even if it involved tens of thousands of products and millions of customers, could be easily defined. There are only so many things a customer can do in a store, whether it is real or virtual. They can see what’s available, ask for more information, compare products, prepare to buy, buy, or walk away. That was all within the capability of relational databases where the relations between all those activities could be pre-defined. It had to be predefined, which is the problem with relational databases — they aren’t easily extended.

Needing to know your database structure upfront is like having to make a list of all of your unborn child’s potential friends… forever. This list must even include future friends as yet unborn, because once the child’s friends list is built, adding to it requires major surgery.

Finding knowledge — meaning — in data required more flexible technology.

The two great technical challenges of the 1990s Internet were dealing with unstructured data, which is to say the data flowing around us every day that isn’t typically even thought of as being within a database of any kind, and processing those data very cheaply because there was so much of it and the yield of knowledge was likely to be very low.

If you are going to listen to a million telephone conversations hoping to hear one instance of the words al-Qaeda that means either a huge computing budget or a new very cheap way to process all that data.

The commercial Internet was facing two very similar challenges, which were finding anything at all on the World Wide Web and paying through advertising for finding it.

The search challenge. By 1998 the total number of web sites had reached 30 million (it is more than two billion today). That was 30 million places, each containing many web pages. Pbs.org, for example, is a single site containing more than 30,000 pages. Each page then contains hundreds or thousands of words, images, and data points. Finding information on the web required indexing the entire Internet. Now that’s Big Data!

To index the web you first need to read it — all of it — 30 million hosts in 1998 or two billion today. This was done through the use of what are called spiders — computer programs that search the Internet methodically looking for new web pages, reading them, then copying and dragging back the content of those pages to be included in the index. All search engines use spiders and spiders have to run continuously, updating the index to keep it current as web pages come and go or change. Most search engines keep not only an index of the web today, they tend to keep all old versions of the index, too, so by searching earlier versions it is possible to look back in time.

Indexing means recording all the metadata — the data about the data –words, images, links and other types of data like video or audio embedded in a page. Now multiply that by zillions. We do it because an index occupies approximately one percent of the storage of the servers it represents — 300,000 pages worth of data out of 30 million circa 1998. But indexing isn’t finding information, just recording the metadata. Finding usable information from the index is even harder.

There were dozens of search engines in the first decade of the Internet but four were the most important and each took a different technical approach to finding meaning from all those pages. Alta-Vista was the first real search engine. It came from the Palo Alto lab of Digital Equipment Corporation which was really the Computer Science Lab of XEROX PARC moved almost in its entirety two miles east by Bob Taylor, who built both places and hired many of the same people.

Alta-Vista used a linguistic tool for searching its web index. Alta-Vista indexed all the words in a document, say a web page. You could search on “finding gold doubloons” and Alta Vista would search its index for documents that contained the words “finding,” “gold,” and “doubloons.” and presented a list of pages ordered by the number of times each search term appeared in the document.

But even back then there was a lot of crap on the Internet which means Alta-Vista indexed a lot of crap and had no way of telling crap from non-crap. They were, after all, only words. Pointedly bad documents often rose to the top of search results and the system was easy to game by inserting hidden words to tilt the meter. Alta-Vista couldn’t tell the difference between real words and bogus hidden words.

Where Alta-Vista leveraged DEC’s big computers (that was the major point, since DEC was primarily a builder of computer hardware), Yahoo leveraged people. The company actually hired workers to spend all day reading web pages, indexing them by hand (and not very thoroughly) then making a note of the ones most interesting on each topic. If you had a thousand human indexers and each could index 100 pages per day then Yahoo could index 100,000 pages per day or about 30 million pages per year — the entire Internet universe circa 1998. It worked a charm on the World Wide Web until the web got too big for Yahoo to keep up. Yahoo’s early human-powered system did not scale.

So along came Excite, which was based on a linguistic trick, which was to find a way to give searchers not more of what they said they wanted but more of what they actually needed but might be unable to articulate. Again, this challenge was posed in an environment of computational scarcity (this is key).

Excite used the same index as Alta-Vista, but instead of just counting the number of times the word “gold” or “doubloons” appeared, the six Excite boys took a vector geometric approach, each query being defined as a vector composed of search terms and their frequencies. A vector is just an arrow in space with a point of origin, a direction, and a length. In the Excite universe the point of origin was total zeroness on the chosen search terms (zero “finding,” zero “gold,” and zero “doubloons”). An actual search vector would begin at zero-zero-zero on those three terms then extend, say, two units of “finding” because that’s how many times the word “finding” appeared in the target document, thirteen units of “gold” and maybe five units of “doubloons.” This was a new way to index the index and a better way to characterize the underlying data because it would occasionally lead to search results that didn’t use any of the actual search terms — something Alta-Vista could never do.

The Excite web index was not just a list of words and their frequencies but a multi-dimensional vector space that considered searches as directions. Each search was a single quill in a hedgehog of data and the genius of Excite (Graham Spencer’s genius) was to grab not just that quill, but all the quills right around it, too. By grabbing not just the document that matched absolutely the query terms (as Alta-Vista did) but also all the terms near it in the multidimensional vector space, Excite was a more useful search tool. It worked on an index, it had the refinement of vector math and — here’s the important part — it required almost no calculation to produce a result since that calculation was already done as part of the indexing. Excite produced better results faster using primitive hardware.

But Google was even better.

Google brought two refinements to search — PageRank and cheap hardware.

Excite’s clever vector approach produced search results that were most like what searchers intended to find, but even Excite search results were often useless. So Larry Page of Google came up with a way to measure usefulness, which was by finding a proxy for accuracy. Google started with search results using linguistic methods like Alta-Vista, but then added an extra filter in PageRank (named for Larry Page, get it?), which looked at the top-tier results and ordered them by how many other pages they were linked to. The idea being that the more page authors who bothered to link to a given page, the more useful (or at least interesting if even in a bad way) the page would be. And they were right. The other approaches all fell into decline and Google quickly dominated on the basis of their PageRank patent.

But there was another thing that Google did differently from the other guys. Alta-Vista came from Digital Equipment and ran on huge clusters of DEC’s VAX minicomputers. Excite ran on equally big UNIX hardware from Sun Microsystems. Google, however, ran using free Open Source software on computers that were little more than personal computers. They were, in fact, less than PCs, because Google’s homemade computers had no cases, no power supplies (they ran, literally, on car batteries charged by automotive battery chargers), the first ones were bolted to walls while later ones were shoved into racks like so many pans of brownies, fresh from a commercial oven.

Amazon created the business case for Big Data and developed a clunky way to do it on pre-Big Data hardware and software. The search companies broadly expanded the size of practical data sets, while mastering indexing. But true Big Data couldn’t run on an index, it had to run on the actual data, itself, which meant either really big and expensive computers like they used at Amazon, or a way to use cheap PCs to look like a huge computer at Google.

The Dot-Com Bubble. Let’s consider for a moment the euphoria and zaniness of the Internet as an industry in the late 1990s during what came to be called the dot-com bubble. It was clear to everyone from Bill Gates down that the Internet was the future of personal computing and possibly the future of business. So venture capitalists invested billions of dollars in Internet startup companies with little regard to how those companies would actually make money.

The Internet was seen as a huge land grab where it was important to make companies as big as they could be as fast as they could be to grab and maintain market share whether the companies were profitable or not. For the first time companies were going public without having made a dime of profit in their entire histories. But that was seen as okay — profits would eventually come.

The result of all this irrational exuberance was a renaissance of ideas, most of which couldn’t possibly work at the time. Broadcast.com, for example, purported to send TV over dial-up Internet connections to huge audiences. It didn’t actually work, yet Yahoo still bought Broadcast.com for $5.7 billion in 1999 making Mark Cuban the billionaire he is today.

We tend to think of Silicon Valley being built on Moore’s Law making computers continually cheaper and more powerful, but the dot-com era only pretended to be built on Moore’s Law. It was actually built on hype.

Hype and Moore’s Law. In order for many of these 1990s Internet schemes to succeed, the cost of computing had to be brought down to a level that was cheaper even than could be made possible at that time by Moore’s Law. This was because the default business model of most dot-com startups was to make their money from advertising and there was a strict limit on how much advertisers were willing to pay.

For awhile it didn’t matter because venture capitalists and then Wall Street investors were willing to make up the difference, but it eventually became obvious that an Alta-Vista with its huge data centers couldn’t make a profit from Internet search alone. Nor could Excite or any of the other early search outfits.

The dot-com meltdown of 2001 happened because the startups ran out of gullible investors to fund their Super Bowl commercials. When the last dollar of the last yokel had been spent on the last Herman Miller office chair, the VCs had, for the most part, already sold their holdings and were gone. Thousands of companies folded, some of them overnight. And the ones that did survive — including Amazon and Google and a few others — did so because they’d figured out how to actually make money on the Internet.

Amazon.com was different because Jeff Bezos’ business was e-commerce. Amazon was a new kind of store meant to replace bricks and mortar with electrons. For Amazon the savings in real estate and salaries actually made sense since the company’s profit could be measured in dollars per transaction. But for Internet search — the first use of Big Data and the real enabler of the Internet — the advertising market would pay less than a penny per transaction. The only way to make that work was to find a way to break Moore’s Law and drive the cost of computing even lower while at the same time trying to find a better way to link search and advertising, thus increasing sales. Google did both.

Time for Big Data Miracle #2, which entirely explains why Google is today worth $479 billion and most of those other search companies are long dead.

GFS, Map Reduce and BigTable. Because Page and Brin were the first to realize that making their own super-cheap servers was the key to survival as a company, Google had to build a new data processing infrastructure to solve the problem of how to make thousands of cheap PCs look and operate like a single supercomputer.

Where other companies of the era seemed content to lose money hoping that Moore’s Law would eventually catch up and make them profitable, Google found a way to make search profitable in late 1990s dollars. This involved inventing new hardware, software, and advertising technologies. Google’s work in these areas led directly to the world of Big Data we see emerging today.

Let’s first get an idea of the scale involved at Google today. When you do a Google search you are first interacting first with three million web servers in hundreds of data centers all over the world. All those servers do is send page images to your computer screen — an average of 12 billion pages per day. The web index is held in another two million servers and a further three million servers contain the actual documents stored in the system. That’s eight million servers so far and that doesn’t include YouTube.

The three key components in Google’s el cheapo architecture are the Google File System, or GFS, which lets all those millions of servers look at what they think is the same memory. It isn’t the same, of course — there are subdivided copies of the memory called chunks all over the place — but the issue here is coherency. If you change a file it has to be changed for all servers at the same time, even those thousands of miles apart.

One huge issue for Google, then, is the speed of light.

MapReduce distributes a big problem across hundreds or thousands of servers. It maps the problem out to the servers then reduces their many answers back into one.

BigTable is Google’s database that holds all the data. It isn’t relational because relational doesn’t work at this scale. It’s an old-fashioned flat database that, like GFS, has to be coherent.

Before these technologies were developed computers worked like people, on one thing at a time using limited amounts of information. Finding a way for thousands of computers to work together on huge amounts of data was a profound breakthrough.

But it still wasn’t enough for Google to reach its profit targets.

Big Brother started as an ad man. In Google’s attempt to match the profit margins of Amazon.com, only so much could be accomplished by just making computing cheaper. The rest of that gap from a penny to a dollar per transaction had to be covered by finding a way to sell Internet ads for more money. Google did this by turning its data center tools against its very users, effectively indexing us in just the way the company already indexed the web.

By studying our behavior and anticipating our needs as consumers, Google could serve us ads we were 10 or 100 times more likely to click on, increasing Google’s pay-per-click revenue by 10 or 100 times, too.

Now we’re finally talking Big Data.

Whether Google technology looked inward or outward it worked the same. And unlike, say, the SABRE system, these were general purpose tools — they can be used for almost any kind of problem on almost any kind of data.

GFS and MapReduce meant that — for the first time ever — there were no limits to database size or search scalability. All that was required was more commodity hardware eventually reaching millions of cheap servers sharing the task. Google is constantly adding servers to its network. But it goes beyond that, because unless Google shuts down an entire data center, it never replaces servers as they fail. That’s too hard. The servers are just left, dead in their racks, while MapReduce computes around them.

Google published a paper on GFS in 2003 and on MapReduce in 2004. One of the wonders of this business is that Google didn’t even try to keep secrets, though it is likely others would have come up with similar answers eventually.

Yahoo, Facebook and others quickly reverse-engineered an Open Source variant of Map Reduce they called Hadoop (named after a stuffed toy elephant — elephants never forget), and what followed was what today we call Cloud Computing, which is simply offering as a commercial service the ability to map your problem across a dozen or a hundred rented sometimes for only seconds, then reduce the many answers to some coherent conclusion.

Big Data made Cloud Computing necessary. Today it is hard to differentiate between them.

Not only Big Data but also Social Networking were enabled by MapReduce and Hadoop as for the first time it became cost effective to give a billion Facebook users their own dynamic web pages for free and make a profit solely from ads.

Even Amazon switched to Hadoop and there are today virtually no limits to how big their network can grow.

Amazon, Facebook, Google, and the NSA couldn’t function today without MapReduce or Hadoop, which by the way destroyed forever the need to index at all. Modern searches are done not against an index but against the raw data as it changes minute-by-minute. Or more properly perhaps the index is updated minute-by-minute. It may not matter which.

Thanks to these tools, cloud computing services are available from Amazon and others. Armed only with a credit card clever programmers can in a few moments harness the power of one or a thousand or ten thousand computers and apply them to some task. That’s why Internet startups no longer even buy servers. If you want, for a short time, to have more computing power than Russia, you can put it on your plastic.

If Russia wants to have more computing power than Russia, they can put it on their plastic, too.

The great unanswered question here is why did Google share its secrets with competitors by publishing those research papers? Was it foolish arrogance on the part of the Google founders who at that time still listed themselves as being on leave from their Stanford PhD program? Not at all. Google shared its secret to build the industry overall. It needed competitors to look like there was no search monopoly. But even more importantly, by letting a thousand flowers bloom Google encouraged the Internet industry to be that much larger making Google’s 30 or 40 percent of all revenue that much bigger, too.

By sharing its secrets Google got a smaller piece of a very much bigger pie.

That’s, in a nutshell, the genesis of Big Data. Google is tracking your every mouse click and that of a billion or more other people. So is Facebook. So is Amazon when you are on their site or when you are on any site using Amazon Web Services, which probably encompasses a third of all Internet computing anywhere.

Think for a moment what this means for society. Where businesses in the past used market research to guess what consumers wanted and how to sell it to them, they can now use Big Data to know what you want and how to sell it to you, which is why I kept seeing online ads for expensive espresso makers. And when I finally bought one, those espresso machine ads almost instantly stopped because the system knew. It moved on to trying to sell me coffee beans and, for some reason, adult diapers.

Google will eventually have a server for every Internet user. They and other companies will gather more types of data from us and better predict our behavior. Where this is going depends on who is using that data. It can make us perfect consumers or busted terrorists (or even better terrorists — that’s another argument).

There is no problem too big to attempt.

And for the first time, thanks to Google, the NSA and GCHQ finally have the tools to look through that stored intelligence data to find all the bad guys. Or, I suppose, enslave us forever.

(Want more? Here’s Part Three.)

By Robert X. Cringely|July 7th, 2016|2016, Big Data, cloud computing, history, Industry, Software, Technology|43 Comments

43 Comments

Anibal July 7, 2016 at 7:25 am

Bob IS BACK !
Excellent work !
Thank you !
Robert Pitera July 7, 2016 at 7:33 am

I’ve read you for more years than I care to admit, Bob; back in the dead tree years I would always turn to your columns before reading anything else. I’ve even made investments (successfully so) based on what I’ve read in your works.

But this two parter was classic Cringely in the purest form. Excellent article; informative, a memory jogger and entertaining as well. It’s stuff like this that makes me tell people, “You have to read this guy, I’ve been following him for decades.”

Thanks! PBS should package this as a special; I’d enjoy watching you host and narrate it.
- Steve S July 7, 2016 at 10:20 am
  
  +1 This!
- MKKBY July 9, 2016 at 2:45 am
  
  Why are you a fan boy for this fossilized corpse? He’s written about technology that is 20 years old. You are only learning something because you’ve been too lazy to go to your public library and read what everyone in the tech business has long forgotten.
  –
  
  I come here every month or 2 for the entertainment value. What I get is a boring repetition of what elderly people did in the 50s thru 70s. Always disappointed that there is nothing current or forward looking.
  –
  
  Sad.
  - JJones July 14, 2016 at 12:02 am
    
    If you work on your attitude I’m sure your mom will let you get internet at home again and save you that trip to the library.
Eric July 7, 2016 at 7:42 am

Your ads went away when you bought something? How do I make that happen? It takes weeks after I buy something for the ads to go away.

And thanks for finally explaining what Hadoop is. I kept reading about it and never understood until now.
- Joe Shelby July 7, 2016 at 10:40 am
  
  Indeed. I just posted on FB in sharing this article: Not perfect though. I was still getting ads for UPS systems for a month after having bought one, but there we are. We also have the problem of lack of sync between what I search for and what my wife pays for on her account (and vice versa) – the systems still take a little time to connect those dots, so I’m still seeing (no, i’m not a dirty little prick) ads for kids underwear, a week after she bought a set for the constantly growing kid.
- Bob Loy July 7, 2016 at 7:06 pm
  
  Yeah, they aren’t QUITE ready for prime time! I live in Florida, and I have a good friend in Northern Indiana. He’s a fan of the Gary RailCats minor league baseball team. I visited the RailCats’ site a few weeks ago, and since then I’ve been seeing ads for tickets. Since I don’t own a Gulfstream or Learjet, I’m not sure exactly how they want me to respond to these ads …
Plyskeen July 7, 2016 at 8:06 am

…and yet the underlying problem remains the same – it doesn’t really matter how many ads you throw at them or how well fitting those ads are, customers won’t spend more than they can afford to (which is why we are seeing recently so many (non-IT, but some IT as well) companies’ revenues and profits decreasing instead of increasing, and people aren’t getting any richer (except for a few sectors, such as IT, and we will see for how much longer). Which, again, is a problem that is not solvable with Big Data (neither with your definition of it nor with mine).

“For the first time companies were going public without having made a dime of profit in their entire histories. But that was seen as okay — profits would eventually come.” …and how is this different from lately?

…and apologies for being such a gadfly, by the way. But -as Iñigo Montoya would put it-, you keep using that word, and I am not sure that it means what you think it means.
- Joe Shelby July 7, 2016 at 10:46 am
  
  Actually, relative to the history of modern investment (if you go back to the early stock markets of Amsterdam and London as the origin point), it IS a recent idea that you can go public without a profit, or even a business model that should ensure it. Railroads had to be profitable, then would sell stock for more money to expand. Same with steel mills, car companies, energy companies, even banks themselves. You were too risky an investment for the public otherwise, so you would continue to get private investment in the interim.
  
  The boom *started* that way, but the problem became internal stock valuation, a system which, just like the striped security funds ratings that created the 2008 recession, could be gamed if the ones doing the ratings are in on the deal in some way.
  
  Even the crash of the 20s was still based on investments in previously profitable companies, not companies with no record to show.
  
  So no, it isn’t an “always like that” situation. It really is very new and the result of deregulation and the idea of a “public” company selling to the public, rather than to investors who know (theoretically) what they’re doing.
  
  That investors don’t know what they’re doing, today or in 1928, is a different story, but there we are.
- Johnny July 8, 2016 at 4:21 pm
  
  “…it doesn’t really matter how many ads you throw at them or how well fitting those ads are, customers won’t spend more than they can afford to…”
  
  I think we all know this isn’t true, even if that may not disprove your overall point.
Fazal Majid July 7, 2016 at 8:32 am

There was nothing revolutionary about Map-Reduce, and by the time they published the paper, Google itself was well on its way to its next-generation Spanner architecture optimized for real-time processing, so getting rivals distracted with its already obsolete batch-oriented architecture could even be construed as misdirection, even if it was probably more of a recruitment tool than anything else.
- Robert X. Cringely July 7, 2016 at 9:21 am
  
  You make a very interesting comment, but whatever the technology Google is using this afternoon you’ve missed the point that the company essentially applies a tax to the entire Internet so growing traffic — even competitors’ traffic — helps Google. The tech may have changed but the function of it didn’t. What remains constant here is Google’s hegemony. So if you look at Google’s revenue growth and Internet traffic growth, they run in parallel. That strongly implies that Google has no upward bound on its growth potential, making it — at least for now — the safest of all technology investments. When Google’s market cap went past Apple’s that is why. Apple is saturating the smart phone market which can only slow growth while Google dominates a market that likely will NEVER slow.
Kirkwood July 7, 2016 at 10:26 am

Bob,

FYI, minor edit needed : in the bolded para on Google you have the text “When you do a Google search you are FIRST interacting FIRST with three million..” (my caps)
Lee July 7, 2016 at 10:59 am

Bob,

Best article set I’ve read in YEARS! Thank you. It makes a great follow-on to Steven Levy’s 1985 tome “Hackers: Heroes of the Computer Revolution”. Mr. Levy explains the rise of the Personal Computer era through events beginning with “The Model Railroad Club” at MIT in the ’50’s and shows a level of unprecedented detail of the creation of this era through the early to mid ’80’s. Now we have your summary of how the Internet Shared Data cum Information a.k.a. Big Data grew into the commodity that it is.

I truly liked how you explained the mechanism known as “the Cloud”, that I thought until today was a fad similar to CB Radios, is truly at the heart of the matter. It is easier now to forecast the future of our society because of the advances in psychological profiling and the computing power and algorithms to process the data that we leave behind.

The Government and other not-so-responsible agencies ingest ALL communication data and store it for later processing. That well of Big Data may not be showing us much today, comparatively speaking, but with continued advances in technology, mathematic algorithms, psychological understanding, and a fervent desire to articulate the unarticulatable, it will not be too much longer before the ability to predict our actions, likes, dislikes, and aberrations will soon be known and sold at a price.

I truly dread that day — not for the benefits that it will bring, but of the misuse that will occur as well.

Be well, my friend.
TxIBMer July 7, 2016 at 1:11 pm

This article, in my opinion, was MUCH better than the first. I had started to read about MapReduce and Hadoop prior to this article, really more lost in the technology, but did not really understand the significance until I read this article. That is good writing when you bring this much understanding in a single article.

It also explains why Google did not do well in the Electronic Medical Records business.

All of my kids have used Google Docs for their schoolwork for about 8 years now. I assume it’s a Microsoft-like seeding strategy that will have all the young MBAs clamoring for Google Docs at the office in 5 more years. Yet I assume that these documents written in the cloud are MapReduced and Hadooped as well, making them part of the giant public “search”. I noticed a lot of big companies have resisted the temptation to move to it.

It’s also ironic that Google Searches return better results on company sites than their own search engines.

I guess Microsoft will probably MapReduce and Hadoop everything in Windows 10 and then there’s no privacy anywhere anymore. I did “opt out” for my installation, but we’ll see if they really follow the rules or not.

Some additional questions, without an email tag or some personal identifier, how does Google separate people in a household? Especially if they share computers, use the same router, etc? We have five different people in our house with five very different points of view on life, so it much produce some really insane stuff for us (half doing “isidewith” Bernie Sanders while the other half “isidewith” Ted Cruz or Gary Johnson). My daughter searches for antique Volkswagon Beatles while I am looking at Porsche Turbos.

Finally, back to the big final point. How much of this search dries up when the money people discover that “no one can afford” what they are searching for? Does the filter finally access your credit status and return you an image of a 1999 Nissan Sentra and tell you “this is what you can afford”? Is there some social responsibility in only returning the options that would be realistically available to them? Part of the social unrest in society today is because people can finally see what the rich have (or can have) and they want it for themselves without having the bank account to get it. Those filters existed in pre-Internet society, but have pretty much been blown away since the late 1990’s.

Finally, I think it’s ironic that with Google we probably have built an “Oracle” that will answer people’s questions like the “Dr. Know” game at the end of AI, the movie. But we can’t call it Oracle…. Could “Dr. Know” have inspired the Google creators?
David July 7, 2016 at 2:21 pm

Yet again a clear, article on a complex often confusing and overwhelming topic. Thank you
speter July 7, 2016 at 5:58 pm

Map – Reduce: A large dataset is mapped into many queues based a common attribute for each queue. The queues are then reduced using a statistical process (count let’s say). A pattern is found using a huge, unstructured dataset. This workload is very nicely spread across multiple processors. Cringely has the cart before the horse IMHO, but one could argue whether its was the approach (cart), or the millions of low-cost servers handling the relatively easy reduce queue tasks (horse). Cheers, -Speter
Chris July 7, 2016 at 6:19 pm

I’m interested in this comment “So is Amazon when you are on their site or when you are on any site using Amazon Web Services, which probably encompasses a third of all Internet computing anywhere. ”

Are you saying Amazon is tracking every mouse click on sites that are hosted on AWS? I’m pretty certain this is not correct. The AWS Customer Agreement states: “We will not access or use Your Content except as necessary to maintain or provide the Service Offerings, or as necessary to comply with the law or a binding order of a governmental body.”
- JJones July 14, 2016 at 12:23 am
  
  I would like some clarification on this as well. IMHO the clause you quote doesn’t say anything that would preclude Amazon from using the metadata (i.e. accessing of ) vs the content itself.
  I suspect the collection of this metadata is also important to CloudFlare’s business model.
John July 7, 2016 at 8:49 pm

Bob thanks. Excellent work.
.
A slight clarification on the “cloud.” When you install 10,000’s to 1,000,000’s of servers to run an operation like Amazon or Google, you develop automated tools to install the operating systems, set up and manage the servers. Once you develop this ability you can install other applications on all those servers. One of the things you can install is a hypervisor and create a virtual server farm of spectacular size, easily and cheaply. When you buy a service from AWS or Azure, that is what you are getting — a virtual machine running on a hypervisor running on one of any number of servers. The same model used to create super cheap computing for big data applications can be applied to provide cheap virtual machines that customers can rent. With the big data storage technology cloud providers can provide low cost storage for all those rented virtual machines.
.
Big Data and Cloud share the same technology roots and many of the same tools. However cloud providers are not (or should not) touching or indexing the data on their customers rented virtual machines. There are still many issues with the cloud and business data security that need to be worked out. Suffice to say Amazon’s AWS and commerce services are separate and independent.
Haritha July 7, 2016 at 11:57 pm

Very clear view of article. I keep on reading your blog post.. This was still amazing. Thanks a lot for sharing this unique informative post with us..
ThatGuy July 8, 2016 at 3:04 am

Great read… two typos..
.
“It worked LIKE a charm on the World Wide Web”
.
“across a dozen or a hundred SERVERS sometimes for only seconds”
.
.
Not really directed at your article but it’s always annoyed me that we call it “Moore’s Law.” As if somehow it’s an immutable fact the CPU’s will double in power every two years. They doubled in power because of competition and Intel kept evolving the technology. If they stop working on it, CPU power stops doubling. If I’d been in charge it would have been named Moore’s Observation.
- TxIBMer July 8, 2016 at 12:41 pm
  
  Moore’s Law effectively has was broken about 5-7 years ago. CPU’s are no longer doubling in speed and circuits every 18 months. Now the focus has been moved to power consumption and reducing the energy used. Now the optimization is moving to the bus and out to the network. First we created “bus speed” SSD drives, and now we’re looking at 10 and even 20 gb/sec as normal data center speeds with bundled fibre doing 200 gb/sec. We recently downgraded our “Gigapower” at home because our fastest i7 processor could only do 948 mb/sec, so we couldn’t even use the full 1 gb/sec bandwidth. So, we downgraded to 300 mb/sec because most of the servers we access can’t even do 100 mb/sec. So, there you have it. The hardware problems are largely solved. The real problems are software now and process problems. With probably wide area network latency and “old systems” being the real bottlenecks in things.
- JJones July 14, 2016 at 12:38 am
  
  Just for fun, consider calling it “Moore’s Business Model”. CPUs doubled in power because that’s what would entice enough upgrades to keep enough engineers employed to do it again, and again.
- Ronc July 14, 2016 at 5:21 pm
  
  I guess we could call “If they stop working on it, CPU power stops doubling” ThatGuy’s Law. 🙂
JD July 8, 2016 at 10:09 pm

Does anyone know how Bing works? I got mad at Google about 2 years ago when it arrogantly screwed up the compose feature of gmail to the extent that it was unworkable (on replies couldn’t edit subject and typing window was made very small). On its forums, Google ignored the 99% complaint rate of its users. Decided to avoid the arrogant Google as much as possible.
….
In any event, I have been reasonably happy with Bing and glad to have left the Google search engine. Am wondering how Bing works.
- BillShepp July 9, 2016 at 4:53 pm
  
  I’d rather just not know than use Bing!
Week 25 | import digest July 9, 2016 at 3:13 am

[…] 1 covers mainly structured data from Stonehenge and the Domesday book up to dotcom era Amazon. Part 2 covers unstructured data, the rise of Google and advent of cloud […]
Dave July 9, 2016 at 5:06 am

Bob,

I really enjoyed reading your articles on Big Data. I think the true significance of Big Data is the meaning we can derive from the data, and how that will change our lives. But I’d like to emphasize that Big Data can do a lot more for us than optimizing advertising, or listening in on phone conversations. I expect to see advances in medical treatments based on analysis and synthesis of huge data sets (DNA included), advances in weather prediction and advances in transportation and communications.

As you walked us through history, there are three common themes that needed to mature in order to extract meaning from data: storage, compute and network. Storage started with writing, books and eventually digital storage. Compute started in the mind, then tabulation and eventually computation/processor power. Tying it all together is the communications medium: the Internet. What I’ve just described can also be referred to as “the Cloud.”

And here’s another buzzword for you: IoT (Internet of Things.) It, too, will be responsible for generating huge quantities of data, that will remain useless unless Big Data analytics can make sense of them (as an example, think wearable sensors for sports.)

Big Data, the Cloud or IoT, what’s fundamental is that it will require storage, compute and connectivity.
Hamerfan July 9, 2016 at 2:05 pm

Big Data = Google = Evil.
My $00.02

Good columns, though, Bob. Thanks for them.
Mary Biggs July 10, 2016 at 2:19 am

Quote: “Needing to know your database structure upfront is like having to make a list of all of your unborn child’s potential friends… forever. This list must even include future friends as yet unborn, because once the child’s friends list is built, adding to it requires major surgery.”

Bob — not quite right. You build two tables, one called PERSON, and one called RELATION.

PERSON: ID, name, address, Internet_name, etc. etc…..
RELATION: person_id1, person_id2, relation_type, etc. etc…..

…….then you can add your children when they are born, and your children’s friends when they turn up. You can also add your enemies, your children’s enemies, your wife, your ex-wife (this would be an edit to the relation_type from WIFE to EX-WIFE), and on and on and on……..

…….all without changing the original database structure.
- Ronc July 10, 2016 at 5:17 pm
  
  Perhaps Bob should have said “Needing to know your database structure upfront is hard. Finding knowledge — meaning — in data requires more flexible technology.” It sounds like he’s trying to distinguish between structured and unstructured data, but used a poor example of structured data anticipating future input. https://en.wikipedia.org/wiki/Unstructured_data
  https://en.wikipedia.org/wiki/Data_model
- JaneZj July 28, 2016 at 6:28 am
  
  In relation you need created-at and valid-until and only update the second when the relation is no longer valid. Otherwise you are going to loose all the historic data 🙂
The Man Who Rented Cars……You and Big Data ….Does This Man Have the Future “Wired?”….Econ Recon: The Four Ways to Create Wealth July 10, 2016 at 4:40 pm

[…] are three articles in this series “Thinking About Big Data,” Part 1 and Part 2 were posted this week. They are not short, but are well written, generally avoiding “geek […]
Worth Reading: thinking about big data — part two - 'net work July 15, 2016 at 8:53 am

[…] In Part One of this series of columns we learned about data and how computers can be used for findin… […]
This is always True – nullrend blogs July 19, 2016 at 8:00 am

[…] This is always True […]
big July 31, 2016 at 10:44 am

In many cases, few houses are very large that there is a
problem. Reduce that strain by leaving your arms comfortably your
sides while working on the computer, whilst your hands,
wrists, elbows, and forearms aligned at the same level.
Cordless milwaukee tools The Logitech V470 Cordless Optical Mouse for Bluetooth is really a
popular full-sized Bluetooth mouse that users may consider for use just as one i – Pad mouse.

The design is cool as well as the headset comfortable for extended use.
Companies could also make calls from your phone
which is not international calling enabled.
baofeng uv-82 range July 31, 2016 at 11:30 am

We discussed how bullying continues as adults and I shared a
narrative about how my partner had this happen to
her by other parents right in front of the children. I
did not desire to do the interview, but an excellent friend was the producer anf the husband convinced me although make sure I would be comfortable.
Walkie talkie 2015 While I was involved with protests on Pratt’s behalf I worked
for a really conservative newspaper editor who was also
one from the attorneys who first represented Pratt.

That really comes through, that each individuals have really developed very differently.
Walkie talkie reviews 2015 In the property space, Unitechoffers construction,
property development, management and consultancy services.
It’s our obligation to root him out and stop whatever plan he has inside the works.
Seth G July 31, 2016 at 3:50 pm

“In Google’s attempt to match the profit margins of Amazon.com, only so much could be accomplished by just making computing cheaper.”

Aren’t Amazon’s profit margins similar to grocery stores (i.e., 1-2%)?
wireless surveillance cameras outdoor August 10, 2016 at 4:21 am

One with the prominent Black – Berry mobiles is Blackberry Curve 8520.

All these wonderful option from Sony, will deliver you the most
effective possible video cameras for home surveillance.

And if this does occur you will have the ability to stop it quickly
make certain the person is caught.

For one, you’re looking for video cameras that don’t come with wires.
The Kodak Playsport manual estimates that total charging time will average 4 hours.
Arzu October 8, 2016 at 1:47 pm

Hi Bob! What a great article agan as ever. i feelmeverytime after read these articled myself as a part oh history of Information Technologies. I have read all of the articles and itbgave me again an opportunity to rethink about all of I’ve learnt in this field.
I’d like also to know your thought about such a search engines once well common and popular likes Hotbot, Lycos, Askjeeves, Northern Lighypts and so on.
Bob I think many times aboutsuch a cases like which companies and organizations thank to them made a biggest contributions in IT. For example we all know that Apple has built the furst PC (i know what is PC and whatis Mac by platforms) known as Mac, IBM made PC common to us, Xerox has built the concept of how the PC looks like nowdays, Netscape Browser, Sun has created so many protocols still we use them in Internet communications, Oracle has built very firstbDB, Microsoft with it’ OS made an IT so popular, Dennis Ritchie and co created UNIX and C++,Linux OS made a new horizon and so on.
I’d like to realise with your and all of friends thought which companies have made greatest contributions in developement of computers and Internet most of all?
For example Xerox Park we may say that invented a form factor and functionality of PC for manyndacades as we see now but has never mass produced it’ invention while Dell is known as a primary producer and supplier of PCs and related hgpardwere as well. So which of these two may be refered or named as a computyper company?
Some related histories maybe Compaq and HPs role in before and after an PC evolution era.
Can we say that companies like HP, IBM, Hitachi, Fujitsu,NEC and Crey as a biggest contributors in computing world?Do they deserve more than others?
Nearly biggest part of PCs and Servers work thanks to Intel’s genious CPUs and even the birth of PC is refered to as an Intel’s activities. But why Intel is not figured in many cases? All say that Intel is an semicundoctor company while as I think it’s also a true PC hgpardware company and more of that it’s a bhprain of it.
I know and I’ve read in your earlier articles that as you wrote there ‘if somebody has said those times that in ten years Inter and Mjpicrosoft will be the greatest players in an IT field nobody believe you'(I typed
what I remember and not an exact your words)
Thanks for reading my comment and believe me that if you answer I will be very happy with it because I always think about it.
Blesses!
Atul Mandal November 25, 2016 at 11:21 pm

Very Very innovative. Thanks for sharing.

The Decline and
Fall of IBM

Subscribe

Search I, Cringely

Browse Bob’s Archives