you are here: Home WP6: Software Platform and Tools

23 -June -2017
SISOBserver Journal
Our blog
Our Twitter
Our Facebook page
Our Mendeley account
Our Flickr page
Our Youtube profile
Our Meneame profile
Our LinkedIn group
Workpackage6: Software Platform and Tools PDF Print E-mail

WP Lead: UDE

Description

The general objective of this work package has been providing a technical platform, which integrates and implements the results from the work packages related to basic methodology. This work package integrates and implements the results from the work packages related to basic methodology ( WP2, WP3, WP4  and WP5)  and thus facilitates the analyses conducted in the three case study work packages (WP7, WP8 and WP9)

WP6 relations


The specific objectives of this work package have included:

  • Selection and evaluation of mining strategies
  • Specification of the crawling approach and integration of crawlers
  • Specification and configuration of a software platform
  • Preparation / transformation of data for SNA
  • Specification and modeling of roles and constellations (SNA)
  • Data analyses and evaluation
  • Model revision and software adaptation

Results

The results of this general objective are on the one hand a web based analysis workbench and on the other a web based crawling and data extraction tool, which can be used stand-alone or in combination with the workbench. The workbench integrates the tools needed for analyses as they are done in the case studies. It allows setting up and executing complete workflows starting from data import and preparation, going through several filtering and analysis steps of different kinds, and leading to results presented in the form of raw data (e.g. data tables or graph files) or as visualizations.

Strategies for crawling and data mining have been discussed in deliverable D6.1 and have been implemented in the crawling and data extraction tool, which is described in D6.4 and in the Data Extraction section of this web site.

The specification and configuration of the software platform has been discussed in all deliverables of this work package, with the first specification defined in deliverable D6.1, deliverables D6.2 and D6.3 showing details of specific aspects, and a summary of the final state of the platform in deliverable D6.4. This web site also contains an overview of the technical system  and details on the workbench.

Preparation and transformation of data for SNA have been focused conceptually in deliverables D6.1 and D6.2 and have been implemented in the form of the data extraction tool, which allows extracting structured information from semi-structured and unstructured data, and in the form of several tools integrated in the workbench, like simple data transformation tools or tools for extracting network information from texts.


The developed tools have been revised both throughout the whole project and in cooperation with all involved partners as well as in form of an evaluation workshop. Evaluation reports are given in deliverables D6.3 for the workbench and in D6.4 for the data extraction tool.

A report on the final state of the tools and an exemplary description of how they have been used in the project are given in deliverable D6.4.

SiSOB Workbech

The SiSOB Analysis Workbench is the main analysis tool developed during the SiSOB project. As such it is one of the  results of work package 6 . The workbench combines a web-based user interface for configuring analysis processes with server side processing of the analyses. In terms of available analysis functionalities the main focus during the SiSOB project has been on the implementation of network analysis techniques and on statistical analysis capabilities, but additional techniques can easily be integrated.

The User Interface

SiSOB interface

Figure 1

Figure 1 shows a screenshot of the workbench user interface. The user interface offers a menu bar on top, an overview of available analysis modules on the left hand side, a notification area on the right hand side, and the main workspace in the center. The workflows are configured using a visual pipes-and-filters representation. Each available analysis component is represented as one module or filter in the user interface. The user can add filters to the workspace using drag-and-drop from the module overview on the left hand side. To connect filters, a pipe can be drawn between them using again drag-and-drop between an output terminal of one filter (on the bottom of the filter) and an input terminal of another filter (on the top of the filter).

 

The available filters are categorized by their function. Categories are e.g. input, data transformation, analysis, and visualization of graphs or statistical information. Each module contains a short self-description and an ex-tended description, shown on demand by clicking on “show details” at the right button of the filter. Once the workflow is constructed, it can be executed by pushing the “execute” button at the menu bar on top of the page. The filter color is used to give the user feedback about the state of the analysis process: Filters are colored blue by default and stay in that color during execution until they start processing. On start of processing, filters are colored in yellow and after processing they are colored green (if the filter stops successfully) or red (if it stops with errors). After successful execution, the results are displayed on the right hand side of the workbench user interface. In case of an occurring error, a description of the problem is displayed instead of the results and the workflow is stopped.

The workbench also offers saving and loading of workflows and recollecting previous results. Results are always connected to the workflows with which they have been created. Thus it is always possible to load not only previously created results but also the workflow with which they have been created. This allows on the one hand reconstructing how the results have been created and on the other hand rerunning the workflow with or without modifications.

The presented example workflow shows a community detection analysis on a co-authorship network. The input selected in the “Data Uploader” is a network of authors and their publications. In the second filter the input data is converted into the graph format usually used for exchange of network data in the workbench. The third filter converts the given 2-mode author-publication network into a 1-mode co-authorship network. That network is then analyzed using first the “Centrality” filter for calculating the betweenness centrality values of the nodes and as next step the “Clique Percolation Method” for detecting communities of authors working together. The results of this analysis process are then visualized using the “Force Directed Clustering” technique. The visualization created in this example is depicted in Figure 2.


SisOB visualization example

Figure 2

Data Extractor

The data extractor is the stand-alone component of the SiSOB architecture in charge of retrieving information, mainly about researchers’ career, from the following different data sources types (see Figure 3):

  • (a) Structured data, that is, those sources of information where data is well structured, i.e. data is stored by means of XML files or in a relational database.  These websites usually provide services (SOAP, Rest, JSON, etc.) for information retrieving and the information format is, therefore, previously known. For instance, DBLP Computer Science Bibliography provides information on researchers through XML files and other bibliographic data sources such as Scopus, CiteSeer or Web of Knowledge.
  • (b) Semi-structured data, i.e. information coming from web pages where the data are well structured but this structure is unknown by the data extractor. This type of extraction is based on the recognition of the tags or fields within the data source.
  • (c) Non-structured data, which require Natural Language Processing (NLP) strategies to get information. Data is unformatted and, thus, this is the worst case and the least efficient. In this category we can find sources of information such as the web pages containing information about the scientific production, e.g. the researchers’ personal web pages, containing relevant information about their career and achievements.

Figure3. The three types of input data for SiSOB Data Extractor.

The main goal of the extractor is therefore to provide researcher with a tool for retrieving and processing academic data. Using a set of input data that can be either web pages of institutions or directly the CVs of a sample of researchers, the system is able to extract useful information discarding and filtering those data that are not relevant for the user. The extractor has been constructed as a web based environment being, thus, easily accessible and with a simple and intuitive interface. In addition this component can be connected with other systems with a Restful API. An example of integration is the presence of this tool inside the SiSOB workbench as a component for providing input information to the data analysts.

Figure 4. Architecture of SiSOB Data Extractor.

Figure 4 shows the multi-layered architecture of the Data Extractor. As explained before, the system can be used either directly, through its web interface, by researchers who want to find curricular information from a set of individuals, or connected to other systems, such as the SiSOB workbench, with a Rest-based service API. Below the external interfaces, the data crawling and processing layer is located. This layer contains the following modular components that can be run separately through the interfaces:

  • Crawler: From an input dataset with basic information about a set of individuals, e.g., full name, expertise field and, optionally, his/her university, this module searches through the web using search engines such as DuckDuckGo information about these researchers, locating the web locations where that information can be found and generating a CSV file with all these data. The results are reported with a score assigned to each location found which helps the users by suggesting the possibility of having found a false positive, in terms of some checks that are performed.
  • Email extractor: This module shares most functionalities with the previous one but however it can be seen as an specialized crawler, since it focuses on searching the individuals emails. Like in the crawler, each result includes a score indicating its reliability.
  • CV extractor: Taking as input the result of the crawler, this module downloads all the web pages found and the PDFs with each researcher’s curricular information. A process of filtering and cleaning is performed by this module mainly with the goal of providing the most suitable input to the text analyzer.
  • Text Analyzer: This module is in charge of extracting the curricular items which provide specific information about some aspect of a researcher career. It uses as base the GATE framework, which supplies NLP techniques. GATE has been customized in this module with a set of dictionaries and is used in combination with a set of heuristics. 

Both the CV extractor and the Text Analyzer use a set of heuristics that are compiled in the heuristics layer. Furthermore all the previous components use third-party open source API or frameworks as can be seen in the corresponding layer. Finally the SiSOB Data Extractor did not make sense without a set of data sources. Accordingly our system is fed by bibliographic sources (as explained before) and additionally the Crawler module uses the entire web through search engines. 

The SiSOB Data Extractor has been intensively improved and tested during the project, providing input information to all the case studies, i.e., Mobility (see deliverables 7.1, 7.2 and 7.3), Knowledge Sharing (deliverables 8.1, 8.3) and Peer Review (deliverables 9.1, 9.2 and 9.3). For instance, for the case of mobility, curricular information and emails of researchers from datasets of the National Institute of Health (10000 individuals) or of the Biotechnology and Biological Sciences Research Council (3500) were used as input for the crawler, as have been reported in deliverables of WP7.

Implications and future work

Due to the modular architecture of the workbench system it will be possible to enhance the analysis capabilities of the workbench easily after finalization of the general SiSOB system and even after the end of the SiSOB project. Publishing the SiSOB system as an open source project, which will be done till the end of the project, will open up the system also to researchers and developers who have not been involved in the project itself.

Some of the external experts involved in the evaluation workshop already expressed interest in using the SiSOB system in different contexts, in adding to it, and in introducing the tool in their respective communities. Also two ongoing research projects work on integrating the workbench, namely the Go-Lab  project co-funded by the European Union in the 7th Framework Programme and the KOLEGEA  project funded by the German Federal Ministry of Education and Research and by the European Social Fund.

The SiSOB data extractor has created interest among the community of researchers about innovation and research, since they see it as a media to get public information to accomplish their studies. In this sense, some researchers who not participate in the project have tested this tool and expressed their intention to keep using it.

Deliverables

Deliverable D6.1 Mining strategy and requirements specification for the software platform

Deliverable D6.2 First version of structural definitions

Deliverable D6.3 Configuration, test of the platform and first evaluation report

Deliverable D6.4 Final report and system

Milestones

MS 2   SiSOB System Prototype

MS 3   Final SiSOB System