How to work with network documents (web data extraction)
A network document is a document, which is located on the Web and available via HTTP protocol. Such a document is usually a web-site, and its pages are the web-pages.
To start the data extraction from a network document, choose in menu Documents item Add a document. Enter the document’s name and tick the checkbox Network document. You will see additional fields where you need to specify the document’s URL. If there are several URLs, specify them in Aliases field. For example, if the URL is available with prefix WWW and without it, specify both URLs. While adding a network document, its home page is automatically added. Indicate its route in Path field, not specifying a domain.
Certain web-site data are only available to the authorized users. To make this data available to SmokeDoc, you should tick the checkbox Use authorization, while adding a document. Meanwhile you will see the additional fields where you should state the terms of authorization, the URL of authorized page and POST-data (login, password, etc.) The authorization’s condition is a regular expression. If the page’s content fits this expression, the authorization will be carried out. To reverse condition, put “!” symbol at the beginning of expression.
To add pages to the document, find in the documents list a needed document and click in column Pages the icon Plus. Input in the field that appears the list of URLs to web-pages. The added pages should be on the same domain with the document or on the Aliase-domain.
The pages can be also added to a network document with the help of directive EnqueueUrls, during the document’s processing with scripts. Please note, that you can’t add a page to the document more than once.
| How to work with local documents (CSV, XML, etc.) | Documents processing | Scripts composition for data extraction |
