Knowlesys

Knowlesys Web Data Miner Studio

As a huge resource trove, the web now contains over 80 billion web pages which are still growing at an astonishing speed. In the web, there is a large amount of valueable information for you, e.g. list and contact information about prospective clients, price lists of competitive products, updated financial news, demand and supply information, and paper abstracts.

However, as key information always exists in a large number of HTML pages in form of semi-structural or free text, it is hard to use it directly.

The primary goal of Knowlesys is to overcome the difficulty in Web Data Mining. We have been researching in this area for 9 years with the experience of the same period in providing Web Data Mining service to numerous domestic and foreign clients. The Knowlesys Web Data Miner System develped based on that is now the leading solution in the world (beating rival bidders from the US for an international project) and has not been surpassed by any domestic one.

I. Main Functions

The main functions of Knowlesys Web Data Miner Studio are: extracting semi-structural and non-structural data from target web pages on the Internet in a massive and accurate manner based on users' own task configuration, converting them into structural records and saving them in local databases for internal use or external release, and rapidly accessing external information, as illustrated below:乐knowlesys思


Fig. 1 Concept map of Knowlesys Information Acquisition System

Apart from remote web pages, Knowlesys Web Data Miner System can process local web pages, remote or local text files.

The system is mostly applied to the following areas: opinion monitoring, brand monitoring, price monitoring, news acquisition from portals, industrial information acquisition, competition intelligence access, business data integration, market research, database marketing, etc. aw禁止er盗用

II. System Features

The greatest highlight of the system is the flexibility of the acquisition method and the accuracy
of the data acquisition. Flexibility: any complicated search and page layout can be processed.
Accuracy: as high as 99% - 100% of the result

♦ information auto-capture from target websites, acquisition of all sorts of data on HTML pages, e.g. text, URL, number, date, image.
♦ Users can define the sources and classify all information.-采3453舆情4533集-
♦ Images and all sorts of files can be downloaded.a33lcc乐a思aw
♦ Automatic login with user ID and password is supported.
♦ Command-line formats are supported. With the support of Windows task scheduler, target websites can be extracted periodically.
♦ Unique index for records is supported to avoid re-entry of identical information.
♦ The smart replacement function allows removal of all irrelevant part embedded in contents such as advertisements.
♦ Content of multi-page articles can be automatically extracted and combined.
♦ Automatic browse of next page is supported. a33lcc乐a思aw
♦ Forms can be submitted directly.
♦ Form submission simulation is supported.a33lcc乐a思aw
♦ Action scripts are supported.
♦ Mutiple data sheets from one page can be extracted.
♦ A number of ways of post data processing are supported.
♦ As data are directly entered into databases instead of files, there is no coupling between the website programs using such data or desktop programs.
♦ Structures of database tables are completely customized to make full use of the existing system. 26禁止9盗用0
♦ The information acquisition from multiple sections can be processed under the same configuration.
♦ Information integrity and accuracy are ensured and unreadable codes never appear. 26禁止9盗用0
♦ All mainstream databases are supported: MS SQL Server, Oracle, DB2, MySQL, Sybase, Interbase, MS Access, etc.

III. Operating Environment

Operating systems: Windows XP/NT/2000/2003/2008/7/8
CPU: Duo/Quad-core CPU, memery of 2.0 G Hz or above
Memoery: minimum 64M memory, 2G or above preferred (e.g. 8G/16G/32G/64G)
Hard disk: minimum 50M free disk space, 300G or 500G preferred

IV. Industrial Application

Knowlesys Web Data Miner System finds wide application in industries that focus on external information access.

Portals

It can:
automatically collect the lastest contents of specified websites (from hundreds to a thousand) every day (can collect more than 10,000 pieces of news from more than 1,000 web media a day);
automatically collect price information from specified shopping websites every day (product title, description, price, image, etc.)

Benefits:
significantly save the time and effort of people in collecting information from the Internet and enable them to be more concentrated on business issues;
integrate industrial information effortlessly;
quickly increase information volume and views of a website while raising its ranking at Google and Alexa;
implement the front-end collection sub-system of price comparison system with ease.

News Media

It can:
automatically collect news contents from specified websites as scheduled every day to enlarge sources and quantity of contents;
easily integrate news from different regions and industries to form topics;
collect and integrate special articles and BBS posts within an industry.

Benefits:
significantly save time of editors to enable them to exert more efforts in more important things;
quickly increase information volume and views of a website;
input massive information with ease

Enterprises

It can:
accurately collect local and foreign news, industrial news and technical articles in real time;
accurately collect news, personnel, product and price information of competitiors and suppliers in real time;数据抓取
accurately collect business intelligence from public sources in real time (industrial product prices, feedbacks from competitors' users and industrial news);
accurately collect information from major industrial BBS in real time to understand consumers' requirements and feedbacks and identify market trends and business opportunities;

accurately collect product sales leads and prospective clients' information from online public information
accurately collect product information (descriptions, prices, etc.), images and technical documents of over 10,000 products in an industry from online public information .

Benefits:
quickly access target business information in a large amount to improve a company's marketing capability; 数据挖掘
quickly integrate enertprise applications (ERP, CRM, etc.) and portals with contents on the Internet;
quickly establish high-capacity expertise database and improve a company's knowledge management;
save time for search news at all websites for internal staff

Governments & Armies

It can:
track and collect domestic, foreign and local news, policies and regulations, economic and industrial information in relation to the work of governments;
address the Internet information demands from important departments isolated from the Internet; WA乐_思L监测SJ
address the information acquisition and integration of master websites and all levels of local sub-sites of governments;

Benefits:

Meet internal staff's requirements for integrating real-time information from the Internet;
timely solve the problems of insufficiency and untimely update of the information on intranets and external networks of goverments;
improve user satisfaction of government websites by increasing the amount of information (e.g. news, demand and supply information);
significantly save the time and effort in collecting information on the Internet for the staff.

Advertising & Market Research

It can:

quickly capture business directories from public information in a large amount; 网页抓取
quickly capture all original information (e.g. information from blogs and BBS) from target websites in a large amount and put it into databases.

Benefits:
quickly create highly-reliable business directory databases for specific groups; WA乐思采集SJ
quickly create user feedback basic databases for analytic statistics and research;
monitor relevant information on blogs and BBS for branded clients.

Scientific & Technical Research

It can:
track and collect S&T information and news within and outside China;
integrate research data on the web pages of all websites, e.g. gene-related data published by NCBI NIH (US);
extract local text data

Benefits:

fully meet research staff's requirements for integrating and viewing S&T information in real time;a网页抓取
effortlessly acquire related research data from public reliable sources on the Internet; WA乐_思L监测SJ
save precious time and effort for research staff.

V. Functions of Different Editions

Function

Standard edition

Professional edition

Enterprise edition

Microblog website extraction

BBS extraction

Blog website extraction

News website extraction

Text file extraction

RSS/XML extraction

Image website extraction

Video website extraction

Image website extraction

scheduled execution

static URL list extraction

dynamic URL list extraction

web page screenshot

 

direct POST search and extraction

 

online database website extraction

ordinary Windows window program extraction

   

simulate form completion for query and extraction

 
 

advanced data processing

   

Multi-language information extraction

   

maximum number of tables

10
10
infinite

maximum number of fields

60
100
infinite

maximum lines of data transform script

100
200
infinite

maximum records extracted successively

100,000
500,000
infinite

times of use

infinite
infinite
infinite

number of websites

infinite
infinite
infinite

number of website sections configured for free

2
4
4

 

VI. Demo & download

View extraction effects online, for more information or solutions, please Submit request, or contact us now.