Quantcast
Channel: The Scrapinghub Blog » Shane Evans
Browsing all 8 articles
Browse latest View live

Image may be NSFW.
Clik here to view.

Hello, world

It’s finally time to start a Scrapinghub blog! In the upcoming months we expect to open our private beta to new customers, launch new services, add many new features and continue to contribute to open...

View Article



Image may be NSFW.
Clik here to view.

Autoscraping casts a wider net

We have recently started letting more users into the private beta for our Autoscraping service. We’re receiving a lot of applications following the shutdown of Needlebase and we’re increasing our...

View Article

Image may be NSFW.
Clik here to view.

Finding Similar Items

This post describes an approach to the problem of finding near duplicates among crawled items and how this was implemented at Scrapinghub. Near duplicate content is everywhere on the web and needs to...

View Article

Image may be NSFW.
Clik here to view.

Why MongoDB is a bad choice for storing our scraped data

MongoDB was used early on at Scrapinghub to store scraped data because it’s convenient. Scraped data is represented as (possibly nested) records which can be serialized to JSON. The schema is not known...

View Article

Image may be NSFW.
Clik here to view.

Introducing Dash

We’re excited to introduce Dash, a major update to our scraping platform. This release is the final step in migrating to our new storage back end and contains improvements to almost every part of our...

View Article


Image may be NSFW.
Clik here to view.

Looking back at 2013

This time last year Pablo and I were chatting about the previous year and what to expect in 2013. I noticed that our team had almost doubled in size in the previous year and we wondered could that...

View Article

Image may be NSFW.
Clik here to view.

Announcing Portia, the open source visual web scraper!

We’re proud to announce the developer release of Portia, our new open source visual scraping tool based on Scrapy. Check out this video: As you can see, Portia allows you to visually configure what’s...

View Article

Image may be NSFW.
Clik here to view.

Scrapinghub Crawls the Deep Web

“The easiest way to think about Memex is: How can I make the unseen seen?” — Dan Kaufman, director of the innovation office at DARPA Scrapinghub is participating in Memex, an ambitious DARPA project...

View Article

Browsing all 8 articles
Browse latest View live




Latest Images