How we created uCrawler 1.0 — the AI-based automatic news scraper

The Beginning

In November 2017, we launched the News Robot news aggregator website.

News Robot is an automated system which gathers and analyzes news stories. It is tasked with scraping data from web media outlets, isolating top stories in various categories, and grouping them according to topic. The robot is fully autonomous. Neither us nor anyone else affects its daily operations — it does everything by itself, including picking the top stories of the day or breaking news of the last few hours. It is much more complex than regular aggregators: there are machine learning and neural networks at work under the hood.

We approached the Robot as a personal pet project, in part intended to test out new technologies. We implemented it entirely on our own, from front to back end, from microservices to machine learning algorithms, from cloud computing solutions to social network integration.

The most interesting part of this project for us, in a technical sense, was the semantic analyzer which uses the fastText library by Facebook. FastText allowed me to build semantic vectors for individual words and entire phrases, then compare them and make conclusions on the degree of similarity between them. For example, at some point News Robot started to realize that in certain contexts, "impeachment" and "resignation" are very close to each other.

Another prominent feature of the project is how cost-effective it is regarding resource demands. To achieve this, we had to completely rewrite several libraries (including fastText itself), and forgo the use of some very handy cloud services. Results were worth it, though: the entire system runs on one virtual machine using 1 CPU and 1GB of RAM.

One last notable characteristic of the News Robot is that it is not tied to any specific language. It works equally well in English, French, Russian, Arabic, and other languages.

A new idea

When browsing through various articles and comment threads, we noticed that many people have tried building aggregator systems like us. Sure, there were differences in subject matter or language, but overall, our goals were very similar. Then we decided to turn the News Robot into a tool that could help with achieving these goals. Basically, we were to develop the existing system into a platform, capable of being deployed and set up for any arbitrary set of sources (any topics, any language). The best thing was, it really seemed doable.

Firstly, as we've mentioned, the robot doesn't need any serious computing power to run. This meant that we could rent an affordable, dedicated virtual machine in a cloud for every new aggregator.

Secondly, the robot is not tailored for any single language or set of topics, and can be easily set up to scan, for example, cryptocurrency-themed news websites in French. It'd be completely pointless to try and build dozens of such platforms on our own. It isn't enough to simply put a service out there — you have to support it, which is an entirely different matter. What we could do though, is to help people to gather the data needed, put it through our algorithms, and pass the results back to clients.

To set up a platform like this, we had to seriously rework the existing News Robot's management interface — basically, make it more accessible, so that it could be used by anyone other than me. Furthermore, we significantly streamlined the process of adding new sources. Before, you had to specify at least 5 selectors using XPath query language (since you can't always expect to have an RSS feed, but need full text output for clustering).

Today, our crawler has algorithms which allow it to automatically determine the layout of the page content. It is capable of identifying the main title, illustrations, and full text of the article; it can also remove extraneous text, like "follow us on social networks". In half of the cases, there was no need to even specify XPath queries at all! In others, all it took for the data to be scraped correctly was adding one or two queries. As of now, we're able to add 50 to 100 sources in one evening.

A small note: crawler is very considerate while working with sources! It doesn't start too often, takes robots.txt into account, makes long pauses between page queries, caches the responses, and doesn't fetch the same data again. Probably because of that, it wasn't ever banned yet. All of this also affects the query speed: crawler is very fast, even when having some "sluggish" websites in its list.

Besides the management interface, there were a lot of issues I had to solve — e.g. DevOps. We had to quickly learn how to deploy a fully configured VM in a cloud, ready to work. Now this process is automated. Still, we only start it after a specific e-mail request.

We will leave out the rest of the problems and challenges that have presented themselves during development, and move on directly to the final result.

uCrawler 1.0

We named our new platform uCrawler. It was developed relatively quickly, but is still being continuously improved. We dedicate what time we can spare to work on the project.

uCrawler is offered on a SaaS basis. At the outset, the client is given access to a free, week-long demo. Adding sources is a one-click affair, picking from a pre-existing list. If there are sources lacking in the list, we promptly add them based on client requests.

As soon as the crawler starts its work, the client can begin offloading the results for use. There are several options for this: you can use an API to download data in JSON or XML formats, or set up an RSS feed, or simply browse the results using an auto-generated static webpage.

What do the results look like, exactly? Like a clustered, ranked newsfeed from the list of sources, containing all data collated by the crawler, including full text articles and images.

We are fully responsible for keeping the crawler working and stable. We realize that many clients wish to output the uCrawler's data directly to their website, or publish them immediately on their News Widgets or Telegram channels. That's why we monitor all working processes on every virtual machine under our care, and identify all "troublesome" sources. Recently, Cisco devices came under attack, which had an immediate impact on accessibility for a number of websites. We learned about this issue as soon as it had surfaced; every instance of uCrawler continued operating normally, despite losing some of their sources.

As soon as we started deploying first demo VMs, we started getting lots of suggestions to improve our feature set. Our resources are limited, though, so we can't realistically implement every good idea that comes up. Nevertheless, we started selecting the most frequent suggestions, to at least try and implement those. For instance, we added an option to filter the newsfeed based on keywords — a feature that turned out to be very popular.

uCrawler lets you create an unlimited number of filters. Each filter can be tweaked independently, with its own set of keywords. Also, the data output from a filter is clustered in the same way as the main newsfeed. You can download the data using an API as JSON or XML, or set up an RSS feed, or access a generated static webpage. The keyword search is implemented using Elasticsearch — we install it into uCrawler if a client needs the filtering feature. This approach forces us to use more expensive virtual machines (since Elasticsearch is written in Java), but it gives us nice results.

We decided to further improve on the success with Elasticsearch. uCrawler only remembers data for the last few days, and cannot be used for storing information long-term — but it can output the clustered data into Elasticsearch. In turn, Elasticsearch is able to do some interesting stuff with aggregation. We capitalized on this, and started a test run of our own little media analytics department. We taught uCrawler to fetch data from various counters on social networks (e.g. comments, reposts, likes), and compare them against the data from news websites. Also, we set up a dedicated website which allows you to generate reports for any query on the fly. At the moment, this feature is undergoing testing, while we put together the minimum necessary data set.

We have some other interesting developments in store, too: like "on-the-fly" text translation, automated rewriting, and primary source identification… You will probably come up with something else. In that case, be sure to contact us! We always love hearing new ideas.