We named our platform uCrawler
. It was developed relatively quickly, but is still being continuously improved. We dedicate what time we can spare to work on the project.
uCrawler is offered on a SaaS basis. At the outset, the client is given access to a free, week-long demo. Adding sources is a one-click affair, picking from a pre-existing list. If there are sources lacking in the list, we promptly add them based on client requests.
As soon as the crawler starts its work, the client can begin offloading the results for use. There are several options for this: you can use an API to download data in JSON or XML formats, or set up an RSS feed, or simply browse the results using an auto-generated static webpage.
What do the results look like, exactly? Like a clustered, ranked newsfeed from the list of sources, containing all data collated by the crawler, including full text articles and images. Here are some results from aggregating news on IT
We are fully responsible for keeping the crawler working and stable. We realize that many clients wish to output the uCrawler's data directly to their website, or publish them immediately on their Telegram channel (using IFTTT). That's why we monitor all working processes on every virtual machine under our care, and identify all "troublesome" sources. Recently, Cisco devices came under attack, which had an immediate impact on accessibility for a number of websites. We learned about this issue as soon as it had surfaced; every instance of uCrawler
continued operating normally, despite losing some of their sources.
As soon as we started deploying first demo VMs, we started getting lots of suggestions to improve our feature set. Our resources are limited, though, so we can't realistically implement every good idea that comes up. Nevertheless, we started selecting the most frequent suggestions, to at least try and implement those. For instance, we added an option to filter the newsfeed based on keywords — a feature that turned out to be very popular.
uCrawler lets you create an unlimited number of filters. Each filter can be tweaked independently, with its own set of keywords. Also, the data output from a filter is clustered in the same way as the main newsfeed. You can download the data using an API as JSON or XML, or set up an RSS feed, or access a generated static webpage. The keyword search is implemented using Elasticsearch — we install it into uCrawler
if a client needs the filtering feature. This approach forces us to use more expensive virtual machines (since Elasticsearch is written in Java), but it gives us nice results.
We decided to further improve on the success with Elasticsearch. uCrawler only remembers data for the last few days, and cannot be used for storing information long-term — but it can output the clustered data into Elasticsearch. In turn, Elasticsearch is able to do some interesting stuff with aggregation. We capitalized on this, and started a test run of our own little media analytics department. We taught uCrawler
to fetch data from various counters on social networks (e.g. comments, reposts, likes), and compare them against the data from news websites. Also, we set up a dedicated website which allows you to generate reports for any query on the fly. At the moment, this feature is undergoing testing, while we put together the minimum necessary data set.
We have some other interesting developments in store, too: like "on-the-fly" text translation, automated rewriting, and primary source identification… You will probably come up with something else. In that case, be sure to contact us! We always love hearing new ideas.