Basic tools for scraping data
Data scraping development is a technical art that starts with big blunt instruments —to write, capture, parse, and store data— and later leads to surgical slicing and dicing.
Below are a few handy tools and techniques I've gathered. This is a working document that will change over time.
Narrow Google Searches
I use this trick all the time for my blog. Let's say you are looking for an article about Ruby Rake tasks on my blog. If you want to narrow the search, type this into Google.
rake task site:chrisjmendez.com
This will only pick articles from my website that mention "rake" or "tasks" from within https://www.chrisjmendez.com/.
Explicit Google Searches
This example is really useful while scraping Twitter. Suppose your social media manager uses bit.ly to encode links on Twitter. Let's say that during a Downtown LA campaign, she and her team posted a handful of Tweets with the hash "#dtla2017". Weeks later, you're trying to audit the tweets and don't want to bug everyone with your inquiry. Here's how to search within Twitter for anything with a bit.ly URL and a reference to the keyword "#dtla2017".
Step 1
Suppose you want to track a Twitter contest promoted by a radio station that used bit.ly or goo.gl shortened links. You can always run this into Google Search:
Example 1
site:twitter.com intext:bit.ly "#dtla2017 *"
Example 2
site:twitter.com intext:bit.ly "classicalkusc *"
Step 2
Once you have something working, the next step would be to create a Google custom search that will convert your search results into an RSS Feed.
Step 3
You can time-box your search (and feed) by adjusting the date parameter dateRestrict=
. More ».
Google Alerts
Google alerts is still a great way to get notifications based on keywords you specify. This is especially useful when monitoring a business competitor's moves or tracking your name online.
LiveAPI
LiveAPI is pretty new, but it shows a lot of promise. It's a tool designed to help turn any public data into an API. Read this article by the brilliant @melissjs.
Yahoo Pipes Clones
Yahoo Pipes was an incredible piece of software, and although it's no longer in production, there are a few clones worth looking into.
- Pipes Digital is promising.
- Superfeedr is another good alternative.
IFTTT
You can you If This Then That for Ebay.
- Datas crape Ebay
- Data scrape Craigslist
- Data scape Twitter
- Data scrape SongKick
- Data scrape stock quotes
- Data scrape the Scoop.it feed focused on for Artists Opportunities and publish it to Pocket through IFTTT.
Feedity
Feedity provides a service to scrape web pages into feeds.
Google API RSS
Google API RSS tool helps you create RSS feeds for Google Search Results.
Google Spreadsheets
You can go one step beyond Google API and start screen scraping using Google Spreadsheets