Extracting data from template-based websites

There are many website that are dynamically generated from a database and a template. Examples include e-commerce websites such as Amazon, ads such as craiglist.com or flight schedules such as swiss.com.


This project consists in implementing an algorithm that takes a set of template-generated pages from one given website, automatically learns the template and extracts the data from the template. The starting point is the publication titled “Extracting Structured Data from Web Pages”, Arasu, Stanford.





  • Implement the algorithm proposed in the cited publication
  • Run and analyze the success rate for a set of given websites
  • Propose improvements
  • Implement a crawler suited for the task






  • Expertise in Java or Python
  • Previous work on unsupervised learning methods


This project will be jointly supervised by David Portabella (at http://db4all.com/)  and Zoltan Miklos

Contact: Zoltan Miklos