Posts Tagged Databases

NoSQL’s great, but bring your A game

Posted by on Saturday, 17 December, 2011

MongoDB might be a popular choice in NoSQL databases, but it’s not perfect — at least out of the box. At last week’s MongoSV conference in Santa Clara, Calif., a number of users, including from Disney, Foursquare and Wordnik, shared their experiences with the product. The common theme: NoSQL is necessary for a lot of use cases, but it’s not for companies afraid of hard work.

If you’re in the cloud, avoid the disk

According to Wordnik technical co-founder and vice president of engineering Tony Tam, unless you’re willing to spend beaucoup dollars on buying and operating physical infrastructure, cloud computing is probably necessary to match the scalability of NoSQL databases.

As he explained, Wordnik actually launched on Amazon Web Services and used MySQL, but the database hit a wall at around a billion records, he said. So, Wordnik switched to MongoDB, which solved the scaling problem but caused its own disk I/O problems that resulted in a major performance slowdown. So, Wordnik ported everything back onto some big physical servers, which drastically improved performance.

And then came the scalability problem again, only this time it was in terms of infrastructure. So, it was back to the cloud. But this time, Wordnik got smart and tuned the application to account for the strengths and weaknesses of MongoDB (“Your app should be smarter than your database,” he says), and MongoDB to account for the strengths and weaknesses of the cloud.

Among his observations was that in the cloud, virtual disks have virtual performance, “meaning it’s not really there.” Luckily, he said, you can design to take advantage of virtual RAM. It will fill up fast if you let it, though, and there’s trouble brewing if requests start hitting the disk. “If you hit indexes on disk,” he warned, “mute your pager.”

Foursquare’s Cooper Bethea echoed much of Tam’s sentiment, noting that “for us, paging the disk is really bad.” Because Foursquare works its servers so hard, he said, high latency and error counts start occurring as soon as the disk is invoked. Foursquare does use disk in the form of Amazon Elastic Block Storage, but it’s only for backup.

EBS also brings along issues of its own. At least once a day, Bethea said, queued reads and writes to EBS start backing up excessively, and the only solution is to “kill it with fire.” What that means changes depending on the problem, but it generally means stopping the MongoDB process and rebuilding the affected replica set from scratch.

Monitor everything

Curt Stevens of the Disney Interactive Media Group explained how his team monitors the large MongoDB deployment that underpins Disney’s online games. MongoDB actually has its own tool called the Mongo Monitoring System that Stevens said he swears by, but it isn’t always enough. It shows traffic and performance patterns over time, which is helpful, but only the starting point.

Once a problem is discovered, “it’s like CSI on your data” to figure out what the underlying problem is. Sometimes, an instance just needs to be sharded, he explained. Other times, the code could be buggy. One time, Stevens added, they found out a poor-performing app didn’t have database issues at all, but was actually split across two data centers that were experiencing WAN issues.

Oh, and just monitoring everything isn’t enough when you’re talking about a large-scale system, Stevens said. You have to have alerts in place to tell you when something’s wrong, and you have to monitor the monitors. If MMS or any other monitoring tools go down, you might think everything is just fine while the kids trying to have a magical Disney experience online are paying the price.

By the numbers

If you’re wondering what kind of performance and scalability requirements forced these companies to MongoDB, and then to customize it so heavily, here are some statistics:

  • Foursquare: 15 million users; 8 production MongoDB clusters; 8 shards of user data; 12 shards of check-in data; ~250 updates per second on user database, with maximum output of 46 MBps; ~80 check-ins per second on check-in database, with maximum output of 45 MBps; up to 2,500 HTTP queries per second.
  • Wordnik: Tens of billions of documents with more always being added; more than 20 million REST API calls per day; mapping layer supports 35,000 records per second.
  • Disney: More than 1,400 MongoDB instances (although “your eyes start watering after 30,” Stevens said); adding new instances every day, via a custom-built self-service portal, to test, stage and host new games.

For more-technical details about their trials and tribulations with MongoDB, all three presentations are available online, along with the rest of the conference’s talks.

Feature image courtesy of Tony Tam, Wordnik.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • Putting Big Data to Work: Opportunities for Enterprises
  • Migrating media applications to the private cloud: best practices for businesses
  • Infrastructure Q3: OpenStack and flash step into the spotlight



alt=''
border='0'
/>


GigaOM


Should there be a Pulitzer Prize for Twitter reporting?

Posted by on Wednesday, 30 November, 2011

The Pulitzer board, which administers the journalism awards named after newspaper publisher Joseph Pulitzer, has changed the terms of submission to require digital entries only, and has also changed the description of the “breaking news” category to stress that real-time reporting will be the main criteria for that award. That has led some — including the Nieman Journalism Lab at Harvard — to speculate that we could see a Pulitzer Prize for live-tweeting of the news. Whether that ever comes to pass or not, there is a case to be made that Twitter is the best tool for breaking news that the world of journalism has seen in a long time.

Although the Pulitzer board didn’t say why it made the changes to the criteria for the breaking-news award, it may have done so in part because the award went without a winner in 2011 — although there were three finalists nominated by the judges, including the Chicago Tribune and the Miami Herald, none were chosen.

Reporting that captures news “as quickly as possible”

The old version of the criteria said that the award would be given “with special emphasis on the speed and accuracy of the initial coverage, using any available journalistic tool, including text reporting, videos, databases, multimedia or interactive presentations or any combination of those formats, in print or online or both.” The new version of the criteria is substantially shorter, and doesn’t mention anything about the format that it is supposed to appear in or what kinds of tools the reporting should use. It simply refers to:

[R]eporting of breaking news that, as quickly as possible, captures events accurately as they occur, and, as times passes, illuminates, provides context and expands upon the initial coverage.

In a news release about the changes, the Pulitzer board also said that it was moving away from looking at print submissions for the category because “it would be disappointing if an event occurred at 8 a.m. and the first item in an entry was drawn from the next day’s newspaper.” Although recent winners have included online elements, most have focused on news packages that appeared in print.

If reporting that occurs “as quickly as possible” is the main criteria, then I think Twitter definitely fits the bill — or is at least a leading contender. Videos uploaded to YouTube or streamed from a news event like the “Occupy Wall Street” protests (as my colleague Janko described in his recent post on videographers becoming citizen journalists) are also clearly real-time, but nothing matches the speed that is possible with 140 character text messages and links on Twitter, and videos and photos often spread this way as well.

An obvious candidate: NPR’s Andy Carvin

As for who has demonstrated the kind of reporting prowess on Twitter that might justify a Pulitzer, many of those who have followed the events of the “Arab Spring” through his Twitter stream would probably nominate National Public Radio editor Andy Carvin, who has turned the network into a kind of real-time newswire. Although many criticize Twitter for broadcasting un-verified information, Carvin has shown that a rigorous approach to fact-checking and a knowledge of the players involved can make it a reporting tool as good as — if not better than — any other we have known.

There are other good examples as well, including New York Times reporter Brian Stelter’s use of Twitter (and his Tumblr blog) to cover the tornado in Missouri earlier this year, which also gave readers a look behind the scenes at his reaction to the events he was witnessing — another thing Twitter excels at. And other reporters have also made use of the network while on the ground in Tahrir Square in Egypt and elsewhere in the Arab World, as well as during the earthquake and subsequent tsunami in Japan earlier this year.

One of the things that might make Carvin and some of these other examples ineligible for a breaking-news Pulitzer is that the criteria specifically mentions “local reporting,” which means posting to Twitter about events in Egypt might not fit the bill — although some of the reporting that news organizations have done using Storify and other tools during the Occupy protests in Los Angeles and New York might qualify.

If nothing else, the Pulitzer board seems to have upped the ante for newspapers and other traditional media outlets who want to compete for the breaking-news award: if you are planning to just publish something the next day — or even post traditional stories to your website — and you’re not thinking about video or Twitter or Storify or some combination of those tools to cover the event, you can kiss your Pulitzer goodbye.

Post and thumbnail photos courtesy of Flickr users Rosaura Ochoa and

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • Content Farms: The Players, The Benefits, The Risks
  • Facebook and the future of our online lives
  • NewNet Q1: Content Farms and Niche Networks on the Rise



alt=''
border='0'
/>


GigaOM


Yes, VMware has a big data play

Posted by on Wednesday, 28 September, 2011

Almost everyone who has some product or strategy to optimize on “the cloud.” And now momentum is gathering around the next big technology trend to drive buzzwords: Big data. VMware is no exception, as I found out during an on camera interview with Steve Herrod, CTO at VMware, when he explained what VMware is doing to help databases and data analytics run on virtualized machines.

Herrod didn’t have any products to launch but he said the company is working on ways to run a Hadoop cluster on a virtualized machines, even machines that might be running other workloads. The idea of an elastic Hadoop cluster that can be used on-demand or perhaps in the evening for big analytics jobs, is compelling, and Herrod expects to share more information on this next year, maybe even at our Structure Data conference in March 2011. Check out the video below and see for yourself.



Watch this video for free on GigaOM

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • Infrastructure Q1: IaaS Comes Down to Earth; Big Data Takes Flight
  • Infrastructure Overview, Q2 2010
  • Big Data Marketplaces Put a Price on Finding Patterns



alt=''
border='0'
/>


GigaOM


That was fast! Fusion-io launches IO Turbine product

Posted by on Tuesday, 30 August, 2011

Fusion-io, a newly public company that makes flash memory components for servers, on Monday released the results of its acquisition of IO Turbine, a mere three weeks after it acquired the startup. The combination of the two companies has resulted in a product called ioCache, that will compete in the newly hot space of boosting the performance of virtualized apps.

IO Turbine’s software helps virtualized servers manage their access to Flash memory, and enabled companies to put workloads such as databases that require fast access to memory in the cloud. FlashSoft and others also are hoping to compete in this space as solid state drives and Flash memory become more common inside corporate data centers.

It made sense that Fusion-io, which is providing the hardware to build out distributed Flash memory would be interested in picking up a software layer that could make its hardware more functional. Companies such as Facebook are deploying servers with Fusion-io cards inside them, creating a network of distributed storage that has to be managed. IO Turbine’s software can do that. IO Turbine’s software works with any solid state drives that use Flash (or even hard drives) and Fusion-io will offer it bundled with its components or on a standalone basis.

David Flynn, CEO of Fusion-io, said the two companies had been working together for the last year or so, and once Fusion-io went public it had the resources to buy IO Turbine,” We didn’t try to go public to get bought, but to do the buying,” he said referring to a story I had written in March when Fusion-io filed to go public and I suggested the move was a feint to drive up interest so someone would acquire the company.

Instead it was part of a strategy to get people interested in Fusion-io as more than just a provider for webscale companies trying to improve performance on a small subset of Internet apps that have to access a lot of data really quickly. Flynn wants to convince enterprise buyers that performance Flash belongs in their servers. With IO Turbine’s software running on the Fusion hardware, enterprises can continue consolidating their IT resources by putting more virtual machines on a server without slowing down the VMs’ ability to access stored data.

Once companies realize that the transition to Flash Memory (while more expensive on a per gigabyte basis than hard drives) can save them in terms of server costs because they can run more applications on a server, or even host more applications in the cloud, Flynn believes that Fusion-io can break into the enterprise market. This new product is the first big step.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • Infrastructure Q2: Big data and PaaS gain more momentum
  • The Structure 50: The Top 50 Cloud Innovators
  • Infrastructure Q1: IaaS Comes Down to Earth; Big Data Takes Flight



alt=''
border='0'
/>


GigaOM — Tech News, Analysis and Trends


Verify Scalability Of Your Lodge Administration Software

Posted by on Saturday, 6 August, 2011

convert VOB to AVI

One thing that you will want to examine when looking at hotel administration software is the scalability of the system. The explanation for that is that most people will want to expand their resort enterprise and not have to purchase a brand new system every time. Having a software program system that may expand and adapt to your wants is very important as it is going to prevent cash and time.

Increasing and enhancing your hotel PMS

Having a hotel PMS that is very restricted could be disastrous if you’re desirous about expanding your business. This growth can be anything from including on more rooms to buying another resort to form a chain. If your system has a limit on the number of rooms it can deal with then you will see a huge downside when making an attempt to expanding. For this reason it’s best that you just look for property administration software program that’s modular and expandable. Modular programs will let you later add on extra features so to increase your business. You must also test that the system can deal with multiple property. There are some software program packages that are created solely for people who are going to have only one lodge and this can be dangerous for you if you discover that you could develop to have more than one hotel.

The database your lodge administration software program makes use of

The kind of database that your lodge management software is utilizing could be very important. Some of these programs will use older databases resembling Entry however these are the ones that you need to avoid. The rationale for that is that you probably have an older database you won’t be able to improve and develop it as easily as a more recent one. One type of newer database that you should have a look at having is SQL based mostly databases. One of these database is also a lot easier to make use of than older database types. This is as a result of SQL language that makes up the database which has been created to help folks manage their databases more effectively. One of many SQL databases that you could be use is Microsoft SQL server.

Resort management system that may assist a number of databases

Being able to assist multiple databases is another thing that your resort administration system ought to have the ability to do. If in case you have a large lodge business that has over a thousand rooms then it is suggested that you just use Oracle databases. This in fact doesn’t imply that you just can’t use a Microsoft database you probably have a big chain or hotel. Your hotel management system should be able to deal with a variety of various databases as everybody will need to use what they’re snug with. Moreover in case your hotel business grows you then might want to change databases from Microsoft to Oracle with out having to get new software. This is why it is best to at all times verify to see what the software program you want to purchase will support.

Scalability of resort management software program is very important for you if you assume that you may want to expand your enterprise within the future. Having software that may change with you’ll save you time and cash as you do not need to purchase new software program and have it installed.

 

About The Creator

Jacob has been writing articles online for nearly 9 years now. Not only does this writer focus on Computers and Technology, you can even try his newest website on how you can convert VOB to AVI with VOB to AVI converter which also helps people find the best VOB to AVI converter on the market.


Twitter to open source Hadoop-like tool

Posted by on Friday, 5 August, 2011

Attention webscale aficionados, Twitter says it is planning to open source Storm, its Hadoop-like real-time data processing tool. In a blog post Thursday, the microblogging network said it plans to release the Storm code on Sept. 19 at the Strange Loop event in St. Louis, Mo.

The question is — does the world need another real-time data processing tool? After all there are many tools like HStreaming (using Hadoop), the open source S4 and StreamBase, but the overall analytics market (if you can call it a market) is already fragmented. The Storm code comes from Twitter’s acquisition of BackType last month and seems to be an effort to get folks comfortable parsing data on Twitter.

The post does an excellent job laying out use cases for Storm and hints at more to come. While the code can deal with distributed nodes and huge amounts of data a la Hadoop or Map Reduce, Storm handles jobs that are “infinite.” It’s not for a data processing job with an end point, it’s good for streams of data and continual processing. From the post by Nathan Marz:

Here’s a recap of the three broad use cases for Storm:

  • Stream processing: Storm can be used to process a stream of new data and update databases in realtime. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
  • Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. The browsers will have a realtime view on what the trending topics are as they happen.
  • Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.

But wait! There’s more! At the end of the post we are assured that there’s more to Storm than the blog post has even defined, which we can learn more about next month at the Strange Loop event. From the post:

I’ve only scratched the surface on Storm. The “stream” concept at the core of Storm can be taken so much further than what I’ve shown here — I didn’t talk about things like multi-streams, implicit streams, or direct groupings. I showed two of Storm’s main abstractions, spouts and bolts, but I didn’t talk about Storm’s third, and possibly most powerful abstraction, the “state spout”. I didn’t show how you do distributed RPC over Storm, and I didn’t discuss Storm’s awesome automated deploy that lets you create a Storm cluster on EC2 with just the click of a button.

So for those anxious to test out a new method of crunching terabytes of real-time data on the fly, get thee to GitHub! And wait.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

  • Defining Hadoop: the Players, Technologies and Challenges of 2011
  • Infrastructure Overview, Q2 2010
  • Big Data Marketplaces Put a Price on Finding Patterns



alt=''
border='0'
/>


GigaOM — Tech News, Analysis and Trends