Spidercrab

Spidercrab -- news fetching and extraction technology

1. Management

Designed management scheme is a result of research performed on different technologies. For more information see section 2. Research.

1.1. Tutorial

1.1.1. Setup

Before you start using Spidercrab you must set up its config inside the Ocean Don Corleone. Open up Corleone's config.json and add a following two positions inside the "node_responsibilities" list:

{
"node_responsibilities": [
    ["spidercrab_master", {
            "local": false,
            "update_interval_s": 900,
            "graph_worker_id": "oceanic_crab"
        }],
    ["spidercrab_slave", {
            "local": true,
            "graph_worker_id": "oceanic_crab"
        }]
]
}

Of course you can set values as you want, but these values should work the best for now. Try your own "graph_worker_id" if you are sharing a database with someone (More details about config below).

1.1.2. What you should know

Spidercrab's config and options mechanism (two different things) is somehow complicated, but it can be described in a simple way.

Communication is developed on the basis of master-slave mechanism, which means that we divide the work of our graph workers to master and slave agents. What is important, Spidercrab is working in a way which can be described as somehow developed workflow, which means:

Master Spidercrab worker is registered unit inside the database and stores its own config there.
Slave Spidercrab worker is assigned to the selected master and does work based on master config setup.
- That means slaves are doing what master said, becouse their config will be merged with master's config.
- Merging means overwriting only those variables which are set in master's config!
- This in turn means that slave is able to set values of variables that are not defined in master's config! (They are left to slaves' choice)
While master is doing its work organizing tasks inside the database, slaves are able to perform those tasks anytime and from anywhere, if connection only allows.

1.1.3. Run master

Load some ContentSources. You can do it by using exemplary data scripts from scripts/ directory or whatever.

After having any data, you can run master:

cd graph_workers
./spidercrab_master.py

Master will register itself inside the database and will start organizing the tasks.

1.1.4. Run slaves

Whether master has done his work or not, you can run slaves:

cd graph_workers
./spidercrab_slave.py -n 5

In above case we run 5 slaves.

NOTE: This option (-n NUMBER, --number=NUMBER) can be also set inside the Ocean Don Corleone config under the "number" key.

1.2. Other use cases

More useful use cases.

1.2.1. Own config variables values of slave worker

As said in 1.1.2 slaves are able to use their own values for variables that are not defined by appropriate master. You can set up them in another config file and pass like this:

./spidercrab_slave.py -c path/to/config/file

Config file structure is shown in spidercrab.json.template file.

1.2.2. Exporting nodes of `ContentSources`

The most important option of the spidercrab_slave.py is to export a file with ContentSource nodes as properties dictionaries (in short: ContentSource nodes). Slaves will do that while picking tasks from database and updating a properties of ContentSources nodes (by fetching them from the net of course).

You can simply export them while running this script this way:

./spidercrab_slave.py -e ../data/my_own_export_file

Above generated file could be later used in database populating process, as described in 2.1 of Graph database management.

If you need your export quickly (and/or don't need to fetch news), it is useful to turn off news fetching by setting in a config (separate or Corleone) "do_not_fetch": 1, where 1 means in this case "true".

NOTE: This option (-e EXPORT_FILE_NAME, --export-cs-to=EXPORT_FILE_NAME) can be also set inside the Ocean Don Corleone config under the "export_cs_to" key.

1.2.2. Appending database with brand new `ContentSources`

If you want to append database with brand new ContentSources that are not yet present there you should prepare a file of raw text content sources urls (every link in one line), or get it from Ocean Don Corleone server.

Adding them to the database and flagging as pending to be updated can be achieved in below simple way:

cd graph_workers
./spidercrab_master.py -s ../data/my_sources_urls

Remember that those nodes are having no data. Simply run as many slaves as you can to update them.

NOTE: This option (-s SOURCES_URLS_FILE, --sources-urls-file=SOURCES_URLS_FILE) can not be set inside the Ocean Don Corleone config.

Think also about exporting them to a ContentSources nodes file which could be useful for later tasks:

cd graph_workers
./spidercrab_slave.py -e ../data/export_of_my_new_sources

2. Research

This section contains information gathered during research about considered news fetching and extraction technologies.

2.1. import.io

import.io gives us ability to extract data from websites (as they're advertising: without coding!).

import.io is available as point-and-click tool but also as an API for programmers. However, mentioned API means the server API, which is providing an authenticated client-server connection.

That means, we are able to send different queries to import.io server API and all processes will be performed on their side. This fact, I think, needs a discussion as to whether:

we want to offload our servers, by using import.io server API
we want to keep everything on our servers, by using another software

The first option sounds good, but the question is: what will happen when our system will reach the moment of gathering and processing a big data? Personally, I'm afraid that import.io has his limitations for free users and will request some sort of membership (or partnership?) from us. We should discuss this issue.

import.io is currently in public beta, so time will tell what they will do with their software.

From the technical side, import.io is very user-friendly from both the programmer an the GUI user side. (Knowledge base)

2.2. boilerpipe

boilerpipe library provides algorithms for data extraction around the main textual content of a web page.

Extraction is based on concepts of Boilerplate Detection using Shallow Text Features, which is quite effective data mining approach. We can see this by testing it directly throug their web API.

Boilerpipe is licensed under Apache License 2.0 and therefore can be included in our GPLv3 project.

We will use Python interface to Boilerpipe to extract important content from HTML pages.

2.3. Beautiful Soup

Beautiful Soup is a Python library that comes with more lower level HTML/XML extraction tools (than f. e. Boilerpipe). We will use it in more complicated and explicitly HTML/XML-related cases.

2.4. Universal Feed Parser

Universal Feed Parser is a Python package that comes with high-level tools for parsing various standards of feeds. It will be used for feed content extracting.

2.5. PyRSS2Gen

PyRSS2Gen is a Python library, that produces RSS 2.0 feed. It's worth considering, when we will be developing a destination feed provided by us in Android app.

2.6. Spycyroll

Spycyroll is a Python library that aggregates RSS feeds, hence it will be used for RSS aggregating (merging many RSS feeds into one) -- if needed.

2.5. Scrapy

Scrapy framework will be used to re-implement new version of webcrawler.

Spidercrab

1. Management

1.1. Tutorial

1.1.1. Setup

1.1.2. What you should know

1.1.3. Run master

1.1.4. Run slaves

1.2. Other use cases

1.2.1. Own config variables values of slave worker

1.2.2. Exporting nodes of ContentSources

1.2.2. Appending database with brand new ContentSources

2. Research

2.1. import.io

2.2. boilerpipe

2.3. Beautiful Soup

2.4. Universal Feed Parser

2.5. PyRSS2Gen

2.6. Spycyroll

2.5. Scrapy

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

1.2.2. Exporting nodes of `ContentSources`

1.2.2. Appending database with brand new `ContentSources`