1 Introduction
2 Getting started agGraphSearch
3 Workflow for searching strtuctures for leukemia.
Session information

Last modified: 2021-10-11 16:57:18
Compiled: Mon Oct 11 16:57:20 2021

1 Introduction

In the context of the semantic web, structured data or structured knowledge should be suitable toward more machine-readable. These structured data is often described by an RDF data model.

The agGraphSearch package provides a tool-set to handle such structured data. The main functions of this package are to map a lexical list of domain terms to the data and then extract a target subset of class-related conceptual hierarchy in a common entity-based manner. This aimed to support the construction of an initial model of domain-specific ontology from Linked Open Data (LOD)

This tutorial will provide the procedure to obtain structured data from Wikidata as a real case study.

Figure 1: Overview of the domain ontology construction

This package provides a methodology for extracting target domain concepts from a large-scale LOD system. In the proposed method, the class-related hierarchy of the domain concept by the occurrences of common upper-level entities and the chain of those path relationships is obtained. The proposed method was described in Figure 1.

Overview of the upper-level concept graph and analysis algorithm. The numbers in the nodes indicate the number of search entities that exist in the subordinate concepts.

Figure 2: Overview of the upper-level concept graph and analysis algorithm
The numbers in the nodes indicate the number of search entities that exist in the subordinate concepts.

As an example of class hierarchy extraction from LOD, this short tutorial provides a workflow to obtain and visualize conceptual hierarchies related to leukemia from wikidata using its some entity labels.

Overview of the workflow of the proposed method was described in Figure 3.

Figure 3: Overview of the workflow of the proposed method

This result is similar to the network graph obtained with wikidata graph builder.

2 Getting started agGraphSearch

Once agGraphSearch is installed, it can be loaded by the following command.

#install
if(!require("agGraphSearch")){
  install.packages( "devtools" )
  devtools::install_github( "kumeS/agGraphSearch" )
}

#load
library("agGraphSearch")

3 Workflow for searching strtuctures for leukemia.

3.1 Vocabularies related to leukemia.

In this tutorial, 3 terms related to leukemia are used as follows.

Acute lymphocytic leukemia (wd:Q180664)
Chronic eosinophilic leukemia (wd:Q5113976)
Philadelphia-positive myelogenous leukemia (wd:Q55790812)

The proper number of search terms is under researching. If the locally extraction of domain hierarchy is performed, Search terms should be used more than 3 terms at least.

terms <- c("acute lymphocytic leukemia",
           "Chronic eosinophilic leukemia",
           "philadelphia-positive myelogenous leukemia")
terms

#create a new folder
if(!dir.exists("01_Short_Out")){dir.create("01_Short_Out")}
saveRDS(terms, file="./01_Short_Out/SearchTerm.Rds")

3.2 SPARQL query (1) counting labels and class relations

Figure 4: Data model for the Wikidata class hierarchy

In this tutorial, the data model for class hierarchies in Wikidata will be mainly focused. It is shown in Figure 3. The class hierarchy of Wikidata is represented using the properties of subClassOf (wdt:P279) and instanceOf (wdt:P31) as a conceptual relationship between entities. In addition, the Wikidata entities are represented by IDs called QIDs. In this tutorial, in addition to QIDs, we used the property relations of representative name (rdfs:label) and alias (skos:altLabel), which represent links to label information of QIDs.

3.2.1 Check SPARQL query

ter00 <- terms[1]

#check Query
CkeckQuery_agCount_Label_Num_Wikidata_P279_P31(Entity_Name = ter00)

#Endpoint
agGraphSearch::KzLabEndPoint_Wikidata$EndPoint
#Graph id
agGraphSearch::KzLabEndPoint_Wikidata$FROM

#run SPARQL
#library(agGraphSearch)
#library(SPARQL)
res <- agCount_Label_Num_Wikidata_P279_P31(Entity_Name = ter00, 
                                           Dir="02_Short_Out")
res

#View table
#agTableDT(res, Width = "100px", Transpose = TRUE, AutoWidth=FALSE)

3.2.2 Counting labels and class relations with a for-loop

This program executes SPARQL with a for-loop.

Inputs are 3 terms.

#create an empty variable
m <- c()

#Run
for(n in 1:length(terms)){
#message(n)
m[[n]] <-agCount_Label_Num_Wikidata_P279_P31(Entity_Name = terms[n],
                                             Dir="02_Short_Out")
}

#convert list to data.frame
fm <- ListDF2DF(m)

# Extract only results with label and upper-level class
fm1 <- fm[c(fm$Hit_Label > 0),]
fm2 <- fm1[c(fm1$Hit_ALL > 0),]

#View the data
agTableDT(fm2, Width = "100px", Transpose = TRUE, AutoWidth=FALSE)

#dim(fm); dim(fm1); dim(fm2)

3.2.3 Assigning Label information to QID

Lab01 <- fm2$LABEL

#Check Query
CkeckQuery_agWD_Alt_Wikidata(Lab01[1])

#create an empty variable
Lab01_res <- c()

#run agWD_Alt_Wikidata
for(n in 1:length(Lab01)){
Lab01_res[[n]] <- agWD_Alt_Wikidata(Lab01[n])
}

#assign results to a new variable
QID <- as.character(unlist(Lab01_res))

#create a new folder
if(!dir.exists("02_Short_Out")){dir.create("02_Short_Out")}
saveRDS(QID, file="./02_Short_Out/SearchEntities.Rds")

3.2.4 Retry SPARQL by QID

#View query
CkeckQuery_agCount_ID_Num_Wikidata_QID_P279_P31(QID[1])

#create an empty variable
QID_res <- c()

#Try SPARQL with QID
for(n in 1:length(Lab01)){
QID_res[[n]] <- agCount_ID_Num_Wikidata_QID_P279_P31(QID[n])
}

#convert list to data frame
QID_res2 <- ListDF2DF(QID_res)

#check results
head(QID_res2)
dim(QID_res2)
colnames(QID_res2)

#All
table(QID_res2$Hit_All)
table(QID_res2$Hit_All > 0)
table(QID_res2$Hit_All_Parent > 0)
table(QID_res2$Hit_All_Child > 0)

#View the results
#agTableDT(QID_res2, Width = "100px", Transpose = TRUE, AutoWidth=FALSE)

3.3 SPARQL query (2) Excluding the particular relations

This step search for neighboring entities and properties, and then count their presence or absence. If the particular entity exists in the neighbor, the search entity is excluded. It is shown in Figure 4.

Ex. examples of neighboring entities - Family name (wd:Q101352) - movie (wd:Q11424)

Ex. examples of neighboring properties - sex or gender (wdt:P21) - located in the administrative territorial entity (wdt:P131)

Figure 5: Exclusion of non-applicable entities by relationships with the adjacent entity and the property

#For neighboring entities
#Check query
CkeckQuery_agCount_ID_Prop_Obj_Wikidata_vP( Entity_ID=QID[1], Object="wd:Q101352" )

#create an exclusion QID list without "wd:"
ExcluQ <- c("Q101352", "Q11424")
NumQ <- length(ExcluQ)
QIDdf <- data.frame(QID=QID)

#run SPARQL
for(m in seq_len(NumQ)){
#print(ExcluQ[m])

res <- c()
for(n in seq_len(length(QID))){
res[[n]] <- agCount_ID_Prop_Obj_Wikidata_vP(Entity_ID=QID[n], 
                                            Object=paste0("wd:", ExcluQ[m]))
}
res1 <- ListDF2DF(res)
eval(parse(text=paste0("QIDdf$", ExcluQ[m], " <- c(as.numeric(unlist(res1)) > 0)")))
}

#View the result
agTableKB(QIDdf)

#For neighboring properties
#Check query
CkeckQuery_agCount_ID_Prop_Obj_Wikidata_vO( Entity_ID=QID[1], Property="wdt:P21")

#create an exclusion list without "wdt:"
ExcluP <- c("P21", "P131")
NumP <- length(ExcluP)

#run SPARQL
for(m in seq_len(NumP)){
print(ExcluP[m])

res <- c()
for(n in seq_len(length(QID))){
res[[n]] <- agCount_ID_Prop_Obj_Wikidata_vO(Entity_ID=QID[n], 
                                            Property=paste0("wdt:", ExcluP[m]))
}
res1 <- ListDF2DF(res)
eval(parse(text=paste0("QIDdf$", ExcluP[m], " <- c(as.numeric(unlist(res1)) > 0)")))
}

#view the result
agTableKB(QIDdf)

3.4 SPARQL query (3) Examining the upper-level class relations

3.4.1 instanceOf

# instanceOf (wdt:P31)
CkeckQuery_agWD_ID_Prop_Obj_Wikidata_vO(Entity_ID=QID[n], Property="wdt:P31")

#create an empty variable
res3 <- c()

#run SPARQL
for(n in seq_len(length(QID))){
res3[[n]] <- agWD_ID_Prop_Obj_Wikidata_vO(Entity_ID=QID[n], Property="wdt:P31")
}

3.4.2 subClassOf

# subClassOf (wdt:P279)
CkeckQuery_agWD_ID_Prop_Obj_Wikidata_vO(Entity_ID=QID[n], Property="wdt:P279")

#create an empty variable
res4 <- c()

#run SPARQL
for(n in seq_len(length(QID))){
res4[[n]] <- agWD_ID_Prop_Obj_Wikidata_vO(Entity_ID=QID[n], Property="wdt:P279")
}

#convert list to data.frame
res3b <- ListDF2DF(res3)
res4b <- ListDF2DF(res4)
res <- rbind(res3b, res4b)

#remove rows with NA on "o" col
(res.na <- res[!is.na(res$o),])

#View the result
#agTableDT(res.na, Width = "100px", Transpose = FALSE, AutoWidth=FALSE)

3.5 SPARQL query (4) Searching for the upper-level concepts

3.5.1 Obtaining the upper-level concepts from the input terms

#create a new folder
if(!dir.exists("03_Short_Out")){dir.create("03_Short_Out")}

#create an empty variable
res5 <- c()

#run SPARQL; search the upper-level classes
for(n in 1:length(QID)){
  message(n)
  res5[[n]] <- PropertyPath_GraphUp_Wikidata(Entity_ID = QID[n], 
                                             Depth = 30)  
}

#check results
head(res5[[1]])
agTableDT(res5[[1]])

#Count rows
checkNrow_af(res5)

#Detect loop
checkLoop_af(res5)

#Save
saveRDS(res5,
        file="./03_Short_Out/Individual_upGraph.Rds",
        compress = TRUE)

An alternative way,

#run SPARQL with purrr::map function
res5m <- purrr::map(QID, 
                    PropertyPath_GraphUp_Wikidata, 
                    Depth = 30)

#check results
#Count rows
checkNrow_af(res5m)

#Detect loop
checkLoop_af(res5m)

3.5.2 Individual network diagrams

#create a new folder
if(!dir.exists("03_Short_Out_vis")){dir.create("03_Short_Out_vis")}

#create networks
for(n in 1:length(res5)){
#n <- 1
a <- agIDtoLabel_Wikidata(Entity_ID = QID[n])
if(is.na(a[,2])){a[,2] <- a[,3]}

Lab00 <- paste(a[,c(2, 1)], collapse = ".")
FileName <- paste0("agVisNetwork_", Lab00,"_", format(Sys.time(), "%y%m%d"),".html")

#run the network creation
agVisNetwork(Graph=res5[[n]], 
             Selected=Lab00, 
             Browse=FALSE, 
             Output=TRUE,
             FilePath=FileName)
Sys.sleep(1)

filesstrings::file.move(files=FileName,
                        destinations="./03_Short_Out_vis",
                        overwrite = TRUE)

Name <- paste0("./agVisNetwork_", 
               formatC(n, flag="0", width=4), 
               "_", Lab00, "_files")
if(dir.exists(Name)){file.remove(Name)}
}

#View the results
#browseURL(paste0("./03_Short_Out_vis/", dir("03_Short_Out_vis", pattern=".html")[1]))
#browseURL(paste0("./03_Short_Out_vis/", dir("03_Short_Out_vis", pattern=".html")[2]))
#browseURL(paste0("./03_Short_Out_vis/", dir("03_Short_Out_vis", pattern=".html")[3]))

3.5.3 Merged network diagrams

#Merge their graphs to one graph
res6 <- ListDF2DF(res5)

#check NAs
table(is.na(res6))

#Delete deplicates
res6d <- Exclude_Graph_duplicates(input=res6)

#check dim
dim(res6); dim(res6d)

#Save
saveRDS(res6d,
        file="./03_Short_Out/Merged_upGraph.Rds",
        compress = TRUE)

#run the network creation
if(TRUE){
FileName <- paste0("agVisNetwork_Merged", "_", 
                   format(Sys.time(), "%y%m%d"),".html")
agVisNetwork(Graph=res6d,
             Browse=FALSE,
             Output=TRUE,
             FilePath=FileName)
filesstrings::file.move(files=FileName,
                        destinations="./03_Short_Out_vis",
                        overwrite = TRUE)
}

#View the results
#browseURL(paste0("./03_Short_Out_vis/", FileName))

Figure 6: Merged network diagrams for search terms related to leukemia

3.5.4 Identification of the common upper-level entities using individual networks

The common upper-level concept is defined based on the edge list of triples obtained above.

##Graph data without the duplicates
#Number of entities
(E01 <- length(unique(c(res6d$subject, res6d$parentClass))))
#Number of labels
(E02 <- length(unique(c(res6d$subjectLabel, res6d$parentClassLabel))))
#Number of Triples
(E03 <- length(unique(res6d$triples)))

#Gathering the parent concepts
upEntity <- unlist(purrr::map(res5, function(x){unique(x$parentClass)}))

#calculate the frequency of common entities
Count_upEntity_DF <- countCommonEntities(upEntity)

#Count and view table
agTableDT(Count_upEntity_DF, Transpose = F, AutoWidth = FALSE)

#Count Freq
table(Count_upEntity_DF$Freq)

#extarct parentClass & parentClassLabel from the merged dataset
Dat <- data.frame(res6d[,c(colnames(res6d) == "parentClass" | 
                          colnames(res6d) == "parentClassLabel")], 
                  stringsAsFactors = F)
head(Dat)

#Delete the deplicates
Dat0 <- Exclude_duplicates(Dat, 1)
head(Dat0)
dim(Dat); dim(Dat0)

#define the common upper-level entities
dim(Count_upEntity_DF); dim(Dat0)
head(Count_upEntity_DF); head(Dat0)
Count_upEntity_DF2 <- Cutoff_FreqNum(input1=Count_upEntity_DF, 
                                     input2=Dat0, 
                                     By="parentClass", 
                                     Sort="Freq", 
                                     FreqNum=2)

#check the results
head(Count_upEntity_DF2, n=10)
table(Count_upEntity_DF2$Freq)

#save
saveRDS(Count_upEntity_DF2,
        file = "./03_Short_Out/Count_upEntity_DF2.Rds", compress = TRUE)
readr::write_excel_csv(Count_upEntity_DF2,
                       file="./03_Short_Out/Count_upEntity_DF2.csv")
#Count_upEntity_DF2 <- readRDS(file = "./03_Short_Out/Count_upEntity_DF2.Rds")

#Calculation of inclusion rate
QID <- QIDdf$QID

##QID
qid <- unique(res6d$subject, res6d$parentClass)
b <- setdiff(QID, qid)
b; length(b)

##rdfsLabel
#RdfsLabel <- unique(res6d$subjectLabel, res6d$parentClassLabel)

3.5.5 Results for the common upper-level entities

FileName <- paste0("./FrequencyGraph_", format(Sys.time(), "%y%m%d_%H%M"),".html")

pc_plot(Count_upEntity_DF2, 
        SaveFolder="03_Short_Out_vis", 
        FileName=FileName, 
        IDnum=3)

#View the results
#browseURL(paste0("./03_Short_Out_vis/", dir("03_Short_Out_vis", pattern="FrequencyGraph_")[2]))
#browseURL(paste0("./03_Short_Out_vis/", dir("03_Short_Out_vis", pattern="FrequencyGraph_")[1]))

3.6 Extraction of class hierarchies based on common entities

3.6.1 Set-up parameters

#Individual graphes
eachGraph <- readRDS("./03_Short_Out/Individual_upGraph.Rds")
head(eachGraph[[1]])
sapply(eachGraph, dim)

#Search entities
list1a <- readRDS("./02_Short_Out/SearchEntities.Rds")

head(list1a)
any(list1a == "wd:Q35120")

#Common entities
list2a <- readRDS("./03_Short_Out/Count_upEntity_DF2.Rds")

head(list2a)
dim(list2a)

list2b <- unique(list2a$parentClass)
head(list2b)
any(list2b == "wd:Q35120")

#Remove Q35120 from the common list.
list2b <- list2b[list2b != "wd:Q35120"]

#Inclusion of list1a and list2b
table(list1a %in% list2b)
table(list2b %in% list1a)

3.6.2 Calculation for expanded common upper-level entities and number of expanded steps

system.time(
  SearchNum <- agGraphAnalysis(eachGraph, 
                               list1a, 
                               list2b, 
                               LowerSearch=TRUE)
  )

head(SearchNum)
table(SearchNum$Levels)
sum(table(SearchNum$Levels))
table(SearchNum$Levels)
table(!is.na(SearchNum[,2]))

Session information

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] ja_JP.UTF-8/ja_JP.UTF-8/ja_JP.UTF-8/C/ja_JP.UTF-8/ja_JP.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] agGraphSearch_0.99.3 SPARQL_1.16          RCurl_1.98-1.5      
## [4] XML_3.99-0.8         EBImage_4.34.0       BiocStyle_2.20.2    
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2                  sass_0.4.0                 
##  [3] tidyr_1.1.3                 jsonlite_1.7.2             
##  [5] viridisLite_0.4.0           bslib_0.3.0                
##  [7] franc_1.1.3                 assertthat_0.2.1           
##  [9] BiocManager_1.30.16         highr_0.9                  
## [11] tiff_0.1-8                  yaml_2.2.1                 
## [13] pillar_1.6.2                lattice_0.20-44            
## [15] glue_1.4.2                  WikidataQueryServiceR_1.0.0
## [17] digest_0.6.27               colorspace_2.0-2           
## [19] htmltools_0.5.2             pkgconfig_2.0.3            
## [21] magick_2.7.3                bookdown_0.24              
## [23] purrr_0.3.4                 ratelimitr_0.4.1           
## [25] fftwtools_0.9-11            scales_1.1.1               
## [27] stringdist_0.9.8            jpeg_0.1-9                 
## [29] tibble_3.1.4                generics_0.1.0             
## [31] ggplot2_3.3.5               ellipsis_0.3.2             
## [33] DT_0.19                     BiocGenerics_0.38.0        
## [35] lazyeval_0.2.2              magrittr_2.0.1             
## [37] crayon_1.4.1.9000           filesstrings_3.2.2         
## [39] strex_1.4.2                 evaluate_0.14              
## [41] fansi_0.5.0                 tools_4.1.0                
## [43] data.table_1.14.0           lifecycle_1.0.0            
## [45] stringr_1.4.0               plotly_4.9.4.1             
## [47] munsell_0.5.0               locfit_1.5-9.4             
## [49] jsTree_1.2                  formattable_0.2.1          
## [51] networkD3_0.4               compiler_4.1.0             
## [53] jquerylib_0.1.4             rlang_0.4.11               
## [55] grid_4.1.0                  visNetwork_2.0.9           
## [57] htmlwidgets_1.5.4           igraph_1.2.6               
## [59] bitops_1.0-7                rmarkdown_2.11             
## [61] gtable_0.3.0                abind_1.4-5                
## [63] DBI_1.1.1                   R6_2.5.1                   
## [65] knitr_1.34                  dplyr_1.0.7                
## [67] fastmap_1.1.0               utf8_1.2.2                 
## [69] stringi_1.7.4               parallel_4.1.0             
## [71] Rcpp_1.0.7                  vctrs_0.3.8                
## [73] png_0.1-7                   tidyselect_1.1.1           
## [75] xfun_0.26

Short tutorial: a workflow to use agGraphSearch for leukemia terms

2021-10-11

Package

Contents