Knowlesys Web Data Miner Studio
As a huge resource trove, the web now contains over 80 billion web pages which are still growing at an astonishing speed. In the web, there is a large amount of valueable information for you, e.g. list and contact information about prospective clients, price lists of competitive products, updated financial news, demand and supply information, and paper abstracts.
However, as key information always exists in a large number of HTML pages in form of semi-structural or free text, it is hard to use it directly.
The primary goal of Knowlesys is to overcome the difficulty in Web Data Mining. We have been researching in this area for 9 years with the experience of the same period in providing Web Data Mining service to numerous domestic and foreign clients. The Knowlesys Web Data Miner System develped based on that is now the leading solution in the world (beating rival bidders from the US for an international project) and has not been surpassed by any domestic one.
I. Main Functions
The main functions of Knowlesys Web Data Miner Studio are: extracting semi-structural and non-structural data from target web pages on the Internet in a massive and accurate manner based on users' own task configuration, converting them into structural records and saving them in local databases for internal use or external release, and rapidly accessing external information, as illustrated below:乐knowlesys思

Fig. 1 Concept map of Knowlesys Information Acquisition System
Apart from remote web pages, Knowlesys Web Data Miner System can process local web pages, remote or local text files.
The system is mostly applied to the following areas: opinion monitoring, brand monitoring, price monitoring, news acquisition from portals, industrial information acquisition, competition intelligence access, business data integration, market research, database marketing, etc. aw禁止er盗用
II. System Features
The greatest highlight of the system is the flexibility of the acquisition method and the accuracy
of the data acquisition. Flexibility: any complicated search and page layout can be processed.
Accuracy: as high as 99% - 100% of the result
♦ information auto-capture from target websites, acquisition of all sorts of data on HTML pages, e.g. text, URL, number, date, image.
♦ Users can define the sources and classify all information.-采3453舆情4533集-
♦ Images and all sorts of files can be downloaded.a33lcc乐a思aw
♦ Automatic login with user ID and password is supported.
♦ Command-line formats are supported. With the support of Windows task scheduler, target websites can be extracted periodically.
♦ Unique index for records is supported to avoid re-entry of identical information.
♦ The smart replacement function allows removal of all irrelevant part embedded in contents such as advertisements.
♦ Content of multi-page articles can be automatically extracted and combined.
♦ Automatic browse of next page is supported. a33lcc乐a思aw
♦ Forms can be submitted directly.
♦ Form submission simulation is supported.a33lcc乐a思aw
♦ Action scripts are supported.
♦ Mutiple data sheets from one page can be extracted.
♦ A number of ways of post data processing are supported.
♦ As data are directly entered into databases instead of files, there is no coupling between the website programs using such data or desktop programs.
♦ Structures of database tables are completely customized to make full use of the existing system. 26禁止9盗用0
♦ The information acquisition from multiple sections can be processed under the same configuration.
♦ Information integrity and accuracy are ensured and unreadable codes never appear. 26禁止9盗用0
♦ All mainstream databases are supported: MS SQL Server, Oracle, DB2, MySQL, Sybase, Interbase, MS Access, etc.
III. Operating Environment
Operating systems: Windows XP/NT/2000/2003/2008/7/8
CPU: Duo/Quad-core CPU, memery of 2.0 G Hz or above
Memoery: minimum 64M memory, 2G or above preferred (e.g. 8G/16G/32G/64G)
Hard disk: minimum 50M free disk space, 300G or 500G preferred
IV. Industrial Application
Knowlesys Web Data Miner System finds wide application in industries that focus on external information access.
Portals
It can:
automatically collect the lastest contents of specified websites (from hundreds to a thousand) every day (can collect more than 10,000 pieces of news from more than 1,000 web media a day);
automatically collect price information from specified shopping websites every day (product title, description, price, image, etc.)
Benefits:
significantly save the time and effort of people in collecting information from the Internet and enable them to be more concentrated on business issues;
integrate industrial information effortlessly;
quickly increase information volume and views of a website while raising its ranking at Google and Alexa;
implement the front-end collection sub-system of price comparison system with ease.
News Media
It can:
automatically collect news contents from specified websites as scheduled every day to enlarge sources and quantity of contents;
easily integrate news from different regions and industries to form topics;
collect and integrate special articles and BBS posts within an industry.
Benefits:
significantly save time of editors to enable them to exert more efforts in more important things;
quickly increase information volume and views of a website;
input massive information with ease
Enterprises
It can:
accurately collect local and foreign news, industrial news and technical articles in real time;
accurately collect news, personnel, product and price information of competitiors and suppliers in real time;数据抓取
accurately collect business intelligence from public sources in real time (industrial product prices, feedbacks from competitors' users and industrial news);
accurately collect information from major industrial BBS in real time to understand consumers' requirements and feedbacks and identify market trends and business opportunities;
accurately collect product sales leads and prospective clients' information from online public information
accurately collect product information (descriptions, prices, etc.), images and technical documents of over 10,000 products in an industry from online public information .
Benefits:
quickly access target business information in a large amount to improve a company's marketing capability; 数据挖掘
quickly integrate enertprise applications (ERP, CRM, etc.) and portals with contents on the Internet;
quickly establish high-capacity expertise database and improve a company's knowledge management;
save time for search news at all websites for internal staff
Governments & Armies
It can:
track and collect domestic, foreign and local news, policies and regulations, economic and industrial information in relation to the work of governments;
address the Internet information demands from important departments isolated from the Internet; WA乐_思L监测SJ
address the information acquisition and integration of master websites and all levels of local sub-sites of governments;
Benefits:
Meet internal staff's requirements for integrating real-time information from the Internet;
timely solve the problems of insufficiency and untimely update of the information on intranets and external networks of goverments;
improve user satisfaction of government websites by increasing the amount of information (e.g. news, demand and supply information);
significantly save the time and effort in collecting information on the Internet for the staff.
Advertising & Market Research
It can:
quickly capture business directories from public information in a large amount; 网页抓取
quickly capture all original information (e.g. information from blogs and BBS) from target websites in a large amount and put it into databases.
Benefits:
quickly create highly-reliable business directory databases for specific groups; WA乐思采集SJ
quickly create user feedback basic databases for analytic statistics and research;
monitor relevant information on blogs and BBS for branded clients.
Scientific & Technical Research
It can:
track and collect S&T information and news within and outside China;
integrate research data on the web pages of all websites, e.g. gene-related data published by NCBI NIH (US);
extract local text data
Benefits:
fully meet research staff's requirements for integrating and viewing S&T information in real time;a网页抓取
effortlessly acquire related research data from public reliable sources on the Internet; WA乐_思L监测SJ
save precious time and effort for research staff.
V. Functions of Different Editions
| Function | Standard edition | Professional edition | Enterprise edition |
| Microblog website extraction | ![]() | ![]() | ![]() |
| BBS extraction | ![]() | ![]() | ![]() |
| Blog website extraction | ![]() | ![]() | ![]() |
| News website extraction | ![]() | ![]() | ![]() |
Text file extraction | ![]() | ![]() | ![]() |
RSS/XML extraction | ![]() | ![]() | ![]() |
| Image website extraction | ![]() | ![]() | ![]() |
| Video website extraction | ![]() | ![]() | ![]() |
| Image website extraction | ![]() | ![]() | ![]() |
scheduled execution | ![]() | ![]() | ![]() |
static URL list extraction | ![]() | ![]() | ![]() |
dynamic URL list extraction | ![]() | ![]() | ![]() |
web page screenshot | ![]() | ![]() | |
direct POST search and extraction | ![]() | ![]() | |
online database website extraction | ![]() | ||
ordinary Windows window program extraction | ![]() | ||
simulate form completion for query and extraction | ![]() | ||
advanced data processing | ![]() | ||
Multi-language information extraction | ![]() | ||
maximum number of tables | 10 | 10 | infinite ![]() |
maximum number of fields | 60 | 100 | infinite ![]() |
maximum lines of data transform script | 100 | 200 | infinite ![]() |
maximum records extracted successively | 100,000 | 500,000 | infinite ![]() |
times of use | infinite
![]() | infinite ![]() | infinite ![]() |
number of websites | infinite ![]() | infinite ![]() | infinite ![]() |
number of website sections configured for free | 2 | 4 | 4 |
VI. Demo & download
View extraction effects online, for more information or solutions, please Submit request, or contact us now.


