White List Methodology
The QUATRO Plus Consortium developed a methodology for creating White Lists of high quality web resources, based on information retrieval and machine learning technologies that have been developed by NCSR and ISIEVE. The methodology can be applied as it is to any domain. The overall process is described in the following figure:

The process is initiated with the definition of a keyword set by a domain expert. This set is used from a specialized crawler engine to obtain the first set of results, which can amount to the millions (depending on the domain).
At the second phase, the set of results is analyzed in order to discard resources that are not actually relevant to the topics of interest. These could be expired or modified resources, resources that contain the keywords in advertisements, image tags etc but not in their core content and so on. The clean set of resources will generally contain thousands of URIs. A random sample from these is given to the expert for evaluation. The expert categorizes them just as approved or disapproved without having to provide any further details or remarks. The classification is used for retraining the crawling module to obtain a set of web resources that have a higher possibility of being classified as approved. After the retraining, the results amount to hundreds or lower thousands. The process of random sampling, expert classification, retraining and refined searching can be repeated as many times as necessary in order to achieve the success ratio that is required from the interested party.
Documented Results
Besides the experimental testing, the methodology was tested with a fully implemented use case, in order to examine its efficiency and accuracy. For this purpose, we collaborated with the Greek Adolescent Health Unit, a non-profit organization which aims to provide guidance and help to children, adolescents and their parents for frequent problems encountered in the specific age groups. The desired white list should contain resources that provide valuable and accurate information about eating disorders and nutritional problems encountered in minors and teenagers. The initial set of keywords was as generic as possible and contained the terms: children, adolescent, teenager, anorexia and obesity. Each step of the procedure gave the following results:
- Initial search: 1,250,000 URIs
- Content analysis: 2,000 URIs
- Set of randomly selected resources given to the expert for evaluation: 200 URIs
- Set of resources approved by the expert: 142 URIs (71% approved)
- Retraining and refined search: 220 URIs
- Second random sample of resources given to the expert: 50 URIs
- Approved resources: 41 (82%)
- Second retraining and refined search: 70 resources
- Third classification from the expert: 65 resources (~93%)
According to our results, each re-run of the core procedure offered better results by more than 10% in comparison to the previous run, along with a significant decrease of the set that must be examined by the human expert. The use case indicated that the desired accuracy can be obtained in relatively short time and with little human effort and time.
The results from the pilot implementation of the methodology were accepted by the Adolescent Health Unit.
