Hello, world
It’s finally time to start a Scrapinghub blog! In the upcoming months we expect to open our private beta to new customers, launch new services, add many new features and continue to contribute to open...
View ArticleAutoscraping casts a wider net
We have recently started letting more users into the private beta for our Autoscraping service. We’re receiving a lot of applications following the shutdown of Needlebase and we’re increasing our...
View ArticleFinding Similar Items
This post describes an approach to the problem of finding near duplicates among crawled items and how this was implemented at Scrapinghub. Near duplicate content is everywhere on the web and needs to...
View ArticleWhy MongoDB is a bad choice for storing our scraped data
MongoDB was used early on at Scrapinghub to store scraped data because it’s convenient. Scraped data is represented as (possibly nested) records which can be serialized to JSON. The schema is not known...
View ArticleIntroducing Dash
We’re excited to introduce Dash, a major update to our scraping platform. This release is the final step in migrating to our new storage back end and contains improvements to almost every part of our...
View ArticleLooking back at 2013
This time last year Pablo and I were chatting about the previous year and what to expect in 2013. I noticed that our team had almost doubled in size in the previous year and we wondered could that...
View ArticleAnnouncing Portia, the open source visual web scraper!
We’re proud to announce the developer release of Portia, our new open source visual scraping tool based on Scrapy. Check out this video: As you can see, Portia allows you to visually configure what’s...
View ArticleScrapinghub Crawls the Deep Web
“The easiest way to think about Memex is: How can I make the unseen seen?” — Dan Kaufman, director of the innovation office at DARPA Scrapinghub is participating in Memex, an ambitious DARPA project...
View Article
More Pages to Explore .....