-
Notifications
You must be signed in to change notification settings - Fork 0
Spidercrab
Spidercrab -- news fetching and extraction technology
Designed management scheme is a result of research performed on different technologies. For more information see section 2. Research.
Before you start using Spidercrab you must set up its config inside the Ocean Don Corleone. Open up Corleone's config.json and add a following two positions inside the "node_responsibilities" list:
{
"node_responsibilities": [
["spidercrab_master", {
"local": false,
"update_interval_s": 900,
"graph_worker_id": "oceanic_crab"
}],
["spidercrab_slave", {
"local": true,
"graph_worker_id": "oceanic_crab"
}]
]
}Of course you can set values as you want, but these values should work the best for now. Try your own "graph_worker_id" if you are sharing a database with someone (More details about config below).
Spidercrab's config and options mechanism (two different things) is somehow complicated, but it can be described in a simple way.
Communication is developed on the basis of master-slave mechanism, which means that we divide the work of our graph workers to master and slave agents. What is important, Spidercrab is working in a way which can be described as somehow developed workflow, which means:
- Master Spidercrab worker is registered unit inside the database and stores its own config there.
-
Slave Spidercrab worker is assigned to the selected master and does work based on master config setup.
- That means slaves are doing what master said, becouse their config will be merged with master's config.
- Merging means overwriting only those variables which are set in master's config!
- This in turn means that slave is able to set values of variables that are not defined in master's config! (They are left to slaves' choice)
- While master is doing its work organizing tasks inside the database, slaves are able to perform those tasks anytime and from anywhere, if connection only allows.
Load some ContentSources. You can do it by using exemplary data scripts from scripts/ directory or whatever.
After having any data, you can run master:
cd graph_workers
./spidercrab_master.pyMaster will register itself inside the database and will start organizing the tasks.
Whether master has done his work or not, you can run slaves:
cd graph_workers
./spidercrab_slave.py -n 5In above case we run 5 slaves.
NOTE: This option (-n NUMBER, --number=NUMBER) can be also set inside the Ocean Don Corleone config under the "number" key.
More useful use cases.
As said in 1.1.2 slaves are able to use their own values for variables that are not defined by appropriate master. You can set up them in another config file and pass like this:
./spidercrab_slave.py -c path/to/config/file
Config file structure is shown in spidercrab.json.template file.
The most important option of the spidercrab_slave.py is to export a file with ContentSource nodes as properties dictionaries (in short: ContentSource nodes). Slaves will do that while picking tasks from database and updating a properties of ContentSources nodes (by fetching them from the net of course).
You can simply export them while running this script this way:
./spidercrab_slave.py -e ../data/my_own_export_file
Above generated file could be later used in database populating process, as described in 2.1 of Graph database management.
If you need your export quickly (and/or don't need to fetch news), it is useful to turn off news fetching by setting in a config (separate or Corleone) "do_not_fetch": 1, where 1 means in this case "true".
NOTE: This option (-e EXPORT_FILE_NAME, --export-cs-to=EXPORT_FILE_NAME) can be also set inside the Ocean Don Corleone config under the "export_cs_to" key.
If you want to append database with brand new ContentSources that are not yet present there you should prepare a file of raw text content sources urls (every link in one line), or get it from Ocean Don Corleone server.
Adding them to the database and flagging as pending to be updated can be achieved in below simple way:
cd graph_workers
./spidercrab_master.py -s ../data/my_sources_urls
Remember that those nodes are having no data. Simply run as many slaves as you can to update them.
NOTE: This option (-s SOURCES_URLS_FILE, --sources-urls-file=SOURCES_URLS_FILE) can not be set inside the Ocean Don Corleone config.
Think also about exporting them to a ContentSources nodes file which could be useful for later tasks:
cd graph_workers
./spidercrab_slave.py -e ../data/export_of_my_new_sources
This section contains information gathered during research about considered news fetching and extraction technologies.
import.io gives us ability to extract data from websites (as they're advertising: without coding!).
import.io is available as point-and-click tool but also as an API for programmers. However, mentioned API means the server API, which is providing an authenticated client-server connection.
That means, we are able to send different queries to import.io server API and all processes will be performed on their side. This fact, I think, needs a discussion as to whether:
- we want to offload our servers, by using import.io server API
- we want to keep everything on our servers, by using another software
The first option sounds good, but the question is: what will happen when our system will reach the moment of gathering and processing a big data? Personally, I'm afraid that import.io has his limitations for free users and will request some sort of membership (or partnership?) from us. We should discuss this issue.
import.io is currently in public beta, so time will tell what they will do with their software.
From the technical side, import.io is very user-friendly from both the programmer an the GUI user side. (Knowledge base)
boilerpipe library provides algorithms for data extraction around the main textual content of a web page.
Extraction is based on concepts of Boilerplate Detection using Shallow Text Features, which is quite effective data mining approach. We can see this by testing it directly throug their web API.
Boilerpipe is licensed under Apache License 2.0 and therefore can be included in our GPLv3 project.
We will use Python interface to Boilerpipe to extract important content from HTML pages.
Beautiful Soup is a Python library that comes with more lower level HTML/XML extraction tools (than f. e. Boilerpipe). We will use it in more complicated and explicitly HTML/XML-related cases.
Universal Feed Parser is a Python package that comes with high-level tools for parsing various standards of feeds. It will be used for feed content extracting.
PyRSS2Gen is a Python library, that produces RSS 2.0 feed. It's worth considering, when we will be developing a destination feed provided by us in Android app.
Spycyroll is a Python library that aggregates RSS feeds, hence it will be used for RSS aggregating (merging many RSS feeds into one) -- if needed.
Scrapy framework will be used to re-implement new version of webcrawler.
