agGraphSearch, knitr, SPARQL, EBImage
Last modified: 2021-10-11 16:57:18
Compiled: Mon Oct 11 16:57:20 2021
In the context of the semantic web, structured data or structured knowledge should be suitable toward more machine-readable. These structured data is often described by an RDF data model.
The agGraphSearch package provides a tool-set to handle such structured data. The main functions of this package are to map a lexical list of domain terms to the data and then extract a target subset of class-related conceptual hierarchy in a common entity-based manner. This aimed to support the construction of an initial model of domain-specific ontology from Linked Open Data (LOD)
This tutorial will provide the procedure to obtain structured data from Wikidata as a real case study.
This package provides a methodology for extracting target domain concepts from a large-scale LOD system. In the proposed method, the class-related hierarchy of the domain concept by the occurrences of common upper-level entities and the chain of those path relationships is obtained. The proposed method was described in Figure 1.
As an example of class hierarchy extraction from LOD, this short tutorial provides a workflow to obtain and visualize conceptual hierarchies related to leukemia from wikidata using its some entity labels.
Overview of the workflow of the proposed method was described in Figure 3.
This result is similar to the network graph obtained with wikidata graph builder.
Once agGraphSearch is installed, it can be loaded by the following command.
#install
if(!require("agGraphSearch")){
install.packages( "devtools" )
devtools::install_github( "kumeS/agGraphSearch" )
}
#load
library("agGraphSearch")
In this tutorial, the data model for class hierarchies in Wikidata will be mainly focused. It is shown in Figure 3. The class hierarchy of Wikidata is represented using the properties of subClassOf (wdt:P279) and instanceOf (wdt:P31) as a conceptual relationship between entities. In addition, the Wikidata entities are represented by IDs called QIDs. In this tutorial, in addition to QIDs, we used the property relations of representative name (rdfs:label) and alias (skos:altLabel), which represent links to label information of QIDs.
ter00 <- terms[1]
#check Query
CkeckQuery_agCount_Label_Num_Wikidata_P279_P31(Entity_Name = ter00)
#Endpoint
agGraphSearch::KzLabEndPoint_Wikidata$EndPoint
#Graph id
agGraphSearch::KzLabEndPoint_Wikidata$FROM
#run SPARQL
#library(agGraphSearch)
#library(SPARQL)
res <- agCount_Label_Num_Wikidata_P279_P31(Entity_Name = ter00,
Dir="02_Short_Out")
res
#View table
#agTableDT(res, Width = "100px", Transpose = TRUE, AutoWidth=FALSE)
This program executes SPARQL with a for-loop.
Inputs are 3 terms.
#create an empty variable
m <- c()
#Run
for(n in 1:length(terms)){
#message(n)
m[[n]] <-agCount_Label_Num_Wikidata_P279_P31(Entity_Name = terms[n],
Dir="02_Short_Out")
}
#convert list to data.frame
fm <- ListDF2DF(m)
# Extract only results with label and upper-level class
fm1 <- fm[c(fm$Hit_Label > 0),]
fm2 <- fm1[c(fm1$Hit_ALL > 0),]
#View the data
agTableDT(fm2, Width = "100px", Transpose = TRUE, AutoWidth=FALSE)
#dim(fm); dim(fm1); dim(fm2)
Lab01 <- fm2$LABEL
#Check Query
CkeckQuery_agWD_Alt_Wikidata(Lab01[1])
#create an empty variable
Lab01_res <- c()
#run agWD_Alt_Wikidata
for(n in 1:length(Lab01)){
Lab01_res[[n]] <- agWD_Alt_Wikidata(Lab01[n])
}
#assign results to a new variable
QID <- as.character(unlist(Lab01_res))
#create a new folder
if(!dir.exists("02_Short_Out")){dir.create("02_Short_Out")}
saveRDS(QID, file="./02_Short_Out/SearchEntities.Rds")
#View query
CkeckQuery_agCount_ID_Num_Wikidata_QID_P279_P31(QID[1])
#create an empty variable
QID_res <- c()
#Try SPARQL with QID
for(n in 1:length(Lab01)){
QID_res[[n]] <- agCount_ID_Num_Wikidata_QID_P279_P31(QID[n])
}
#convert list to data frame
QID_res2 <- ListDF2DF(QID_res)
#check results
head(QID_res2)
dim(QID_res2)
colnames(QID_res2)
#All
table(QID_res2$Hit_All)
table(QID_res2$Hit_All > 0)
table(QID_res2$Hit_All_Parent > 0)
table(QID_res2$Hit_All_Child > 0)
#View the results
#agTableDT(QID_res2, Width = "100px", Transpose = TRUE, AutoWidth=FALSE)
This step search for neighboring entities and properties, and then count their presence or absence. If the particular entity exists in the neighbor, the search entity is excluded. It is shown in Figure 4.
Ex. examples of neighboring entities - Family name (wd:Q101352) - movie (wd:Q11424)
Ex. examples of neighboring properties - sex or gender (wdt:P21) - located in the administrative territorial entity (wdt:P131)
#For neighboring entities
#Check query
CkeckQuery_agCount_ID_Prop_Obj_Wikidata_vP( Entity_ID=QID[1], Object="wd:Q101352" )
#create an exclusion QID list without "wd:"
ExcluQ <- c("Q101352", "Q11424")
NumQ <- length(ExcluQ)
QIDdf <- data.frame(QID=QID)
#run SPARQL
for(m in seq_len(NumQ)){
#print(ExcluQ[m])
res <- c()
for(n in seq_len(length(QID))){
res[[n]] <- agCount_ID_Prop_Obj_Wikidata_vP(Entity_ID=QID[n],
Object=paste0("wd:", ExcluQ[m]))
}
res1 <- ListDF2DF(res)
eval(parse(text=paste0("QIDdf$", ExcluQ[m], " <- c(as.numeric(unlist(res1)) > 0)")))
}
#View the result
agTableKB(QIDdf)
#For neighboring properties
#Check query
CkeckQuery_agCount_ID_Prop_Obj_Wikidata_vO( Entity_ID=QID[1], Property="wdt:P21")
#create an exclusion list without "wdt:"
ExcluP <- c("P21", "P131")
NumP <- length(ExcluP)
#run SPARQL
for(m in seq_len(NumP)){
print(ExcluP[m])
res <- c()
for(n in seq_len(length(QID))){
res[[n]] <- agCount_ID_Prop_Obj_Wikidata_vO(Entity_ID=QID[n],
Property=paste0("wdt:", ExcluP[m]))
}
res1 <- ListDF2DF(res)
eval(parse(text=paste0("QIDdf$", ExcluP[m], " <- c(as.numeric(unlist(res1)) > 0)")))
}
#view the result
agTableKB(QIDdf)
# instanceOf (wdt:P31)
CkeckQuery_agWD_ID_Prop_Obj_Wikidata_vO(Entity_ID=QID[n], Property="wdt:P31")
#create an empty variable
res3 <- c()
#run SPARQL
for(n in seq_len(length(QID))){
res3[[n]] <- agWD_ID_Prop_Obj_Wikidata_vO(Entity_ID=QID[n], Property="wdt:P31")
}
# subClassOf (wdt:P279)
CkeckQuery_agWD_ID_Prop_Obj_Wikidata_vO(Entity_ID=QID[n], Property="wdt:P279")
#create an empty variable
res4 <- c()
#run SPARQL
for(n in seq_len(length(QID))){
res4[[n]] <- agWD_ID_Prop_Obj_Wikidata_vO(Entity_ID=QID[n], Property="wdt:P279")
}
#convert list to data.frame
res3b <- ListDF2DF(res3)
res4b <- ListDF2DF(res4)
res <- rbind(res3b, res4b)
#remove rows with NA on "o" col
(res.na <- res[!is.na(res$o),])
#View the result
#agTableDT(res.na, Width = "100px", Transpose = FALSE, AutoWidth=FALSE)
#create a new folder
if(!dir.exists("03_Short_Out")){dir.create("03_Short_Out")}
#create an empty variable
res5 <- c()
#run SPARQL; search the upper-level classes
for(n in 1:length(QID)){
message(n)
res5[[n]] <- PropertyPath_GraphUp_Wikidata(Entity_ID = QID[n],
Depth = 30)
}
#check results
head(res5[[1]])
agTableDT(res5[[1]])
#Count rows
checkNrow_af(res5)
#Detect loop
checkLoop_af(res5)
#Save
saveRDS(res5,
file="./03_Short_Out/Individual_upGraph.Rds",
compress = TRUE)
An alternative way,
#run SPARQL with purrr::map function
res5m <- purrr::map(QID,
PropertyPath_GraphUp_Wikidata,
Depth = 30)
#check results
#Count rows
checkNrow_af(res5m)
#Detect loop
checkLoop_af(res5m)
#create a new folder
if(!dir.exists("03_Short_Out_vis")){dir.create("03_Short_Out_vis")}
#create networks
for(n in 1:length(res5)){
#n <- 1
a <- agIDtoLabel_Wikidata(Entity_ID = QID[n])
if(is.na(a[,2])){a[,2] <- a[,3]}
Lab00 <- paste(a[,c(2, 1)], collapse = ".")
FileName <- paste0("agVisNetwork_", Lab00,"_", format(Sys.time(), "%y%m%d"),".html")
#run the network creation
agVisNetwork(Graph=res5[[n]],
Selected=Lab00,
Browse=FALSE,
Output=TRUE,
FilePath=FileName)
Sys.sleep(1)
filesstrings::file.move(files=FileName,
destinations="./03_Short_Out_vis",
overwrite = TRUE)
Name <- paste0("./agVisNetwork_",
formatC(n, flag="0", width=4),
"_", Lab00, "_files")
if(dir.exists(Name)){file.remove(Name)}
}
#View the results
#browseURL(paste0("./03_Short_Out_vis/", dir("03_Short_Out_vis", pattern=".html")[1]))
#browseURL(paste0("./03_Short_Out_vis/", dir("03_Short_Out_vis", pattern=".html")[2]))
#browseURL(paste0("./03_Short_Out_vis/", dir("03_Short_Out_vis", pattern=".html")[3]))
#Merge their graphs to one graph
res6 <- ListDF2DF(res5)
#check NAs
table(is.na(res6))
#Delete deplicates
res6d <- Exclude_Graph_duplicates(input=res6)
#check dim
dim(res6); dim(res6d)
#Save
saveRDS(res6d,
file="./03_Short_Out/Merged_upGraph.Rds",
compress = TRUE)
#run the network creation
if(TRUE){
FileName <- paste0("agVisNetwork_Merged", "_",
format(Sys.time(), "%y%m%d"),".html")
agVisNetwork(Graph=res6d,
Browse=FALSE,
Output=TRUE,
FilePath=FileName)
filesstrings::file.move(files=FileName,
destinations="./03_Short_Out_vis",
overwrite = TRUE)
}
#View the results
#browseURL(paste0("./03_Short_Out_vis/", FileName))
The common upper-level concept is defined based on the edge list of triples obtained above.
##Graph data without the duplicates
#Number of entities
(E01 <- length(unique(c(res6d$subject, res6d$parentClass))))
#Number of labels
(E02 <- length(unique(c(res6d$subjectLabel, res6d$parentClassLabel))))
#Number of Triples
(E03 <- length(unique(res6d$triples)))
#Gathering the parent concepts
upEntity <- unlist(purrr::map(res5, function(x){unique(x$parentClass)}))
#calculate the frequency of common entities
Count_upEntity_DF <- countCommonEntities(upEntity)
#Count and view table
agTableDT(Count_upEntity_DF, Transpose = F, AutoWidth = FALSE)
#Count Freq
table(Count_upEntity_DF$Freq)
#extarct parentClass & parentClassLabel from the merged dataset
Dat <- data.frame(res6d[,c(colnames(res6d) == "parentClass" |
colnames(res6d) == "parentClassLabel")],
stringsAsFactors = F)
head(Dat)
#Delete the deplicates
Dat0 <- Exclude_duplicates(Dat, 1)
head(Dat0)
dim(Dat); dim(Dat0)
#define the common upper-level entities
dim(Count_upEntity_DF); dim(Dat0)
head(Count_upEntity_DF); head(Dat0)
Count_upEntity_DF2 <- Cutoff_FreqNum(input1=Count_upEntity_DF,
input2=Dat0,
By="parentClass",
Sort="Freq",
FreqNum=2)
#check the results
head(Count_upEntity_DF2, n=10)
table(Count_upEntity_DF2$Freq)
#save
saveRDS(Count_upEntity_DF2,
file = "./03_Short_Out/Count_upEntity_DF2.Rds", compress = TRUE)
readr::write_excel_csv(Count_upEntity_DF2,
file="./03_Short_Out/Count_upEntity_DF2.csv")
#Count_upEntity_DF2 <- readRDS(file = "./03_Short_Out/Count_upEntity_DF2.Rds")
#Calculation of inclusion rate
QID <- QIDdf$QID
##QID
qid <- unique(res6d$subject, res6d$parentClass)
b <- setdiff(QID, qid)
b; length(b)
##rdfsLabel
#RdfsLabel <- unique(res6d$subjectLabel, res6d$parentClassLabel)
FileName <- paste0("./FrequencyGraph_", format(Sys.time(), "%y%m%d_%H%M"),".html")
pc_plot(Count_upEntity_DF2,
SaveFolder="03_Short_Out_vis",
FileName=FileName,
IDnum=3)
#View the results
#browseURL(paste0("./03_Short_Out_vis/", dir("03_Short_Out_vis", pattern="FrequencyGraph_")[2]))
#browseURL(paste0("./03_Short_Out_vis/", dir("03_Short_Out_vis", pattern="FrequencyGraph_")[1]))
#Individual graphes
eachGraph <- readRDS("./03_Short_Out/Individual_upGraph.Rds")
head(eachGraph[[1]])
sapply(eachGraph, dim)
#Search entities
list1a <- readRDS("./02_Short_Out/SearchEntities.Rds")
head(list1a)
any(list1a == "wd:Q35120")
#Common entities
list2a <- readRDS("./03_Short_Out/Count_upEntity_DF2.Rds")
head(list2a)
dim(list2a)
list2b <- unique(list2a$parentClass)
head(list2b)
any(list2b == "wd:Q35120")
#Remove Q35120 from the common list.
list2b <- list2b[list2b != "wd:Q35120"]
#Inclusion of list1a and list2b
table(list1a %in% list2b)
table(list2b %in% list1a)
system.time(
SearchNum <- agGraphAnalysis(eachGraph,
list1a,
list2b,
LowerSearch=TRUE)
)
head(SearchNum)
table(SearchNum$Levels)
sum(table(SearchNum$Levels))
table(SearchNum$Levels)
table(!is.na(SearchNum[,2]))
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] ja_JP.UTF-8/ja_JP.UTF-8/ja_JP.UTF-8/C/ja_JP.UTF-8/ja_JP.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] agGraphSearch_0.99.3 SPARQL_1.16 RCurl_1.98-1.5
## [4] XML_3.99-0.8 EBImage_4.34.0 BiocStyle_2.20.2
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 sass_0.4.0
## [3] tidyr_1.1.3 jsonlite_1.7.2
## [5] viridisLite_0.4.0 bslib_0.3.0
## [7] franc_1.1.3 assertthat_0.2.1
## [9] BiocManager_1.30.16 highr_0.9
## [11] tiff_0.1-8 yaml_2.2.1
## [13] pillar_1.6.2 lattice_0.20-44
## [15] glue_1.4.2 WikidataQueryServiceR_1.0.0
## [17] digest_0.6.27 colorspace_2.0-2
## [19] htmltools_0.5.2 pkgconfig_2.0.3
## [21] magick_2.7.3 bookdown_0.24
## [23] purrr_0.3.4 ratelimitr_0.4.1
## [25] fftwtools_0.9-11 scales_1.1.1
## [27] stringdist_0.9.8 jpeg_0.1-9
## [29] tibble_3.1.4 generics_0.1.0
## [31] ggplot2_3.3.5 ellipsis_0.3.2
## [33] DT_0.19 BiocGenerics_0.38.0
## [35] lazyeval_0.2.2 magrittr_2.0.1
## [37] crayon_1.4.1.9000 filesstrings_3.2.2
## [39] strex_1.4.2 evaluate_0.14
## [41] fansi_0.5.0 tools_4.1.0
## [43] data.table_1.14.0 lifecycle_1.0.0
## [45] stringr_1.4.0 plotly_4.9.4.1
## [47] munsell_0.5.0 locfit_1.5-9.4
## [49] jsTree_1.2 formattable_0.2.1
## [51] networkD3_0.4 compiler_4.1.0
## [53] jquerylib_0.1.4 rlang_0.4.11
## [55] grid_4.1.0 visNetwork_2.0.9
## [57] htmlwidgets_1.5.4 igraph_1.2.6
## [59] bitops_1.0-7 rmarkdown_2.11
## [61] gtable_0.3.0 abind_1.4-5
## [63] DBI_1.1.1 R6_2.5.1
## [65] knitr_1.34 dplyr_1.0.7
## [67] fastmap_1.1.0 utf8_1.2.2
## [69] stringi_1.7.4 parallel_4.1.0
## [71] Rcpp_1.0.7 vctrs_0.3.8
## [73] png_0.1-7 tidyselect_1.1.1
## [75] xfun_0.26