by Rakesh Sharma
Published on 16 January 2012
One of the more interesting lessons I learned at J-school was the importance of data. "Data provides context and understanding to a story," intoned a professor. "And, that is why you must know how to program or query databases."
Of course, he was right.
Data provides meaning to trends and occurrences around us. What's more, proliferation of public databases has made data accessible. However, as my professor rightly guessed, not all of us can program or query databases.
How, then, do we access data contained in such troves of information?
This week's review offers a solution. Grepsr is a managed data extraction service that scrapes websites for data. We will look at Grepsr's features, functionality, and see how it can be of use to you.
I had several questions related to the service.
To begin with, what does the service's name mean?
According to Amit Chowdhury, one of the founders, Grepsr is an action or a tool that "grep" (for grab -> extract -> process) data.
My second question was about the meaning of the term "managed" data extraction.
"The main difference is when we say "data extraction," we mean the data can be from anywhere (not only websites)," says Amit, a co-founder. "Managed data extraction is like managed hosting, where the customer need not worry about how servers are run," he says. "Our concept is pretty similar in that our customers do not need to worry about how many resources are being consumed for their extraction." In other words, Grepsr hides technical details and delivers data. "It is also managed in the sense that the user can schedule extractions and have new data synced directly to various services," adds Amit.
The idea for such a service seemed simple, common and yet intriguing. So, I asked Amit about the germ for this idea. "The idea came up after we saw some data extraction services and found their process was not streamlined or truly SaaS," says Amit. "Communication happened through emails and the whole process seemed like it could do with a simplified system for writing requirements, data extraction, and delivery."
And, Grepsr was born.
GETTING TO GRIPS WITH GREPSR
Grepsr really is easy. Extracting data is a matter of explaining requirements. You can do this either through text or uploading a file (or snapshot) or using a browser addon. The user is encouraged to add as much detail as possible while specifying requirements. The Grepsr staff gets to work immediately after the request is submitted and promise to service your request within a single day.
You can also use the service for sites without public APIs. "We extract publicly available information which Google or any other search engine can always crawl without legal issues," says Amit.
The final data is provided in four main raw formats: CSV, HTML, PDF and XML feeds. "With available formats, users can manage the presentation themselves," says Amit. He adds that they plan to add data presentation tools in the future.
While similar services charge per record or for blocks of record, Grepsr charges are flat fees. "While researching this service, we found that pricing was expensive and cumbersome," says Amit. This means that, in addition to per record prices, customers had to get a quote from individual providers. That said, the service has a strict control mechanism on how often customers can run customized crawlers. "How often crawlers are run depends on how much load the crawler might induce on the target website," says Amit. As an illustration, he says if the data is very less, say only a couple of hundred records, it could as frequently as 30 minutes to an hour. "However most of our customers just want the data extracted once every week".
THE BASICS: WHAT DOES IT LOOK LIKE?
In keeping with the theme of simplicity, the interface is clean and simple, with each tab corresponding to a particular system view including your projects and extractors.
The interface also allows users to visually mark the sections on screenshots to further their grabbing requirements. Another interesting feature is that when data is extracted as per schedule, it can be easily synched to FTP, DropBox, GoogleDocs or even pinged to the user's application as a HTP POST.
The folks at Grepsr promise a turnaround time of one day to service your requests. "Typically, we ask for a turnaround time of one day," says Amit. "However, we usually manage it in much less (usually a few hours)."He explains that this is because the Grepsr team has designed its backend in such a way that each extraction is "literally fill in the blanks." In addition, load for processing requests is distributed across multiple servers. Of course, tweaks are required to customize data formats and sources. However, with the added buffer time, quick results are almost guaranteed.
IS IT FOR YOU?
If you are looking for a simple data extraction tool and process, then this tool is definitely for you. It simplifies and unlocks the power of data for you.