Distributed Information Systems Laboratory LSIR

Extracting data from template-based websites

Project Details

Extracting data from template-based websites

Laboratory : LSIR Master Completed




Description:

There are many website that are dynamically generated from a database and a template. Examples include e-commerce websites such as Amazon, ads such as craiglist.com or flight schedules such as swiss.com.

 

This project consists in implementing an algorithm that takes a set of template-generated pages from one given website, automatically learns the template and extracts the data from the template. The starting point is the publication titled “Extracting Structured Data from Web Pages”, Arasu, Stanford.

 

 

 

Tasks:

  • Implement the algorithm proposed in the cited publication
  • Run and analyze the success rate for a set of given websites
  • Propose improvements
  • Implement a crawler suited for the task

 

 

 

Requirements

 

  • Expertise in Java or Python
  • Previous work on unsupervised learning methods

 

This project will be jointly supervised by David Portabella (at http://db4all.com/)  and Zoltan Miklos


Site:
   
Contact: Zoltan Miklos