I've been looking around for open-source libraries (preferably in Java, but not required) for extracting data and metadata from common file formats and Web formats. One project that looks very promising is Aperture. Do you know of any others that are ready or almost ready for prime-time use? Please let me know in the comments! Thanks.
For extracting data from Websites
http://simile.mit.edu/wiki/Piggy_Bank
Posted by: Prateek | October 09, 2007 at 01:21 PM
napLogic (http://www.snaplogic.org) is an Open Source data integration framework implemented in Python.
We combine data access and data transformation using a pipeline approach.
Databases, files, and RSS read/write are currently available, other sources and targets can easily be added.
Posted by: Mike Pittaro | September 12, 2007 at 04:19 PM
Take a look at http://hul.harvard.edu/jhove/
Mainly designed for file format validation, but also does a fair bit of metadata extraction.
It's LGPL
Posted by: bill | September 12, 2007 at 11:52 AM
An excellent list of RDFizers is at:
http://simile.mit.edu/wiki/RDFizers
Another list (with some repeats):
http://esw.w3.org/topic/ConverterToRdf
Aperture has a number of these built in but it is easy to add additional types.
I have also looked for other frameworks but have found none other than Aperture. It looks to be the only major one.
Posted by: David Peterson | September 11, 2007 at 05:25 PM
Nova,
The Sponger [1] component of the Open Source edition of Virtuoso is an in-built middleware layer for extracting metadata and re-purposing and RDF instance data for appropriately matched ontologies and schemas. This is how we make almost and Web Information Resource a bona fide RDF Data Source (on the fly). The architecture of the sponger is such that you can slot in 3rd party extractors. Thus, in our case we are able to integrate the likes of Aperture.
To see all of this in action simply look at:
1. http://www.openlinksw.com/blog/~kidehen/?id=1172
2. http://demo.openlinksw.com/DAV/JS/rdfbrowser/index.html (use this blog post as the Data Source URI for instance)
3. http://demo.openlinksw.com/isparql - visual SPARQL Query Tools (just put this blog post as the Data Source URI as per item 1)
The examples above are interacting with the Sponger's REST interface.
Posted by: Kingsley Idehen | September 11, 2007 at 09:49 AM