Tag it:
Delicious
Furl it!
Spurl
NewsVine
Reddit
YahooMyWeb
Technorati

The Word & Web Vector Tool

screenshot5

The Word & Web Vector Tool is a flexible Java library for statistical language modeling and integration of Web and Webservice based data sources. It supports the creation of word vector representations of text documents in the vector space model that is the point of departure for many text processing applications (e.g. text classification or information retrieval). Furthermore, it offers convenient interactive methods to extract data from structured sources, such was HTML or XML files. Finally, it allows to integrate external data by using Webservice APIs in a mashup-like way (e.g. for geo-mapping).

The aim of the WVTool is to provide a simple to use, simple to extend pure Java library for text and webmining. It is tightly integrated with the RapidMiner Data Mining suit (formerly known as Yale) allowing to apply data to text and web data in a convenient way. The WVTool bridges a gap between highly sophisticated linguistic packages on the one side and proprietary or specialized partial solutions on the other side.

The key features are

  • 100% Java implementation
  • Simple API
  • Very easy to configure and extend
  • Flexible choice of processing steps
  • Integrates many preprocessing components
    (e.g. multi-lingual stemming and stop word lists)
  • Allows to integrate data from various sources (files, Websites, Webservice APIs)
  • Supports different file formats (XML, HTML, PDF, ...)
  • Integrates directly with RapidMiner
  • Wordnet support
  • Regular Expression based feature extraction
  • XPath based feature extraction from HTML and XML
  • Webservice integration
  • Basic web crawling
  • Regular expression based dictionaries
  • Preview function for interactive query deployment
  • Term N-Gram Support (phrases)
  • Basic Web Log Analysis

Download

You can use the WVTool either...

...as a plugin to RapidMiner , offering interactive configuration and visualization.

...as a standalone Java library (only word vector creation).


Documentation

The following documentation is available:

  • There is a pdf manual (pdf), covering the use of the WVTool in RapidMiner and as standalone library.
  • A good starting point for using the WVTool with RapidMiner are the examples provided with the plugin on the RapidMiner download page.
  • There are several examples of how to use the WVTool as Java library bundled with the WVTool Java library release.
  • You may also browse the javadoc documentation for the WVTool Java library  (html) and for the Text Plugin (html) online.
  • There are several tutorials that cover the use of the WVTool in Rapid Miner.

Source Code and License

The WVTool is provided under the GNU PUBLIC LICENSE (GPL). If you need another licensing scheme, please contact us. The source oode of the WVTool is splittet among the main WVTool Java library project and the WVTool RapidMiner plugin. The former is part of the WVTool library release and can be obtained from the WVTool sourceforge page. The latter one is provided as separate download and CVS module on the RapidMiner download page.

Screenshots

screenshot1

The WVTool can easily be configured and applied from RapidMiner. Texts for vectorization are provided by simply associating all files in a directory with a label. Transformations used at individual steps, are chosen interactively.

screenshot2

An even easier way to configure an example is to use a configuration wizard, that guides the user through several steps of specifying the input texts and the parameters for textprocessing.

screenshot3

A very useful finction to interactively check, whether the WVTool is corectly configured is to use the build in previewer. This function is especially helpful when deploying structured queries, e.g. based on XPath. 

screenshot4

Word vectors can be used for any kind of data mining. In the given example, texts are visualized by mapping them to a 2D area. The WVTool enables the user to access the original text files by selecting individual data points.


Contact

The WVTool was created and is maintained by Michael Wurst. If you have any questions or suggestions don't hesitate to contact  me.

Related Links

 
Valid XHTML 1.0 Strict Valid CSS!