Skip to content

Commit de8c732

Browse files
committed
Merge branch 'dev' into 'main'
New release of docs See merge request hmc/hmc-public/unhide/documentation!7
2 parents ac056d2 + 81fb709 commit de8c732

12 files changed

+770
-697
lines changed

docs/_toc.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ parts:
99
numbered: False
1010
chapters:
1111
- file: introduction/about.md
12-
title: "About"
12+
title: "About & Mission"
1313
- file: introduction/implementation.md
1414
title: "Implementation overview"
1515
- file: introduction/data_sources.md

docs/dev_guide/architecture/07_deployment_view.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ Mapping of Building Blocks to Infrastructure
9999
: *\<description of the mapping>*
100100
:::
101101

102-
## Infrastructure Level 1 {#_infrastructure_level_1}
102+
## Infrastructure Level 1
103103

104104

105105
UnHIDE is deployed on [HDF-cloud](https://www.fz-juelich.de/en/ias/jsc/systems/scientific-clouds/hdf-cloud)

docs/diagrams/make_svgs.ipynb

+364-345
Large diffs are not rendered by default.

docs/diagrams/unhide_deployment_overview.d2

+10-10
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
title: UnHIDE deployment {
1+
title: Helmholtz Knowledge graph deployment {
22
shape: text
33
near: top-center
44
style: {
55
font-size: 75
66
}
77
}
88

9-
hdfcloud: HDF-Cloud{
9+
hdfcloud: JSC-Cloud{
1010
style: {
1111
font-size: 55
1212
}
@@ -55,16 +55,16 @@ hdfcloud: HDF-Cloud{
5555
}
5656
}
5757

58-
jena: Apache Jena {
58+
virtuoso: OpenLink Virtuoso {
5959
style: {
6060
font-size: 55
6161
}
6262
icon: https://icons.terrastruct.com/dev%2Fdocker.svg
6363
graph: UnHIDE Graph {
6464
icon: https://icons.terrastruct.com/azure%2FManagement%20and%20Governance%20Service%20Color%2FResource%20Graph%20Explorer.svg
6565
}
66-
sparql: Fuseki SPARQL API {
67-
icon: ./sparql.svg
66+
sparql: OpenLink SPARQL API {
67+
icon: ./virtuoso_logo.png
6868
}
6969
}
7070

@@ -94,11 +94,11 @@ hdfcloud: HDF-Cloud{
9494

9595
store -> indexer: reads from
9696
pipe -> store: stores data
97-
jena <-> store: store & retrieve graph
98-
solr <-> store: stores & retrieve index
97+
jena <-> store.UnHIDE Graph files: store & retrieve graph
98+
solr <-> store.SOLR Index: stores & retrieve index
9999
solr <- api: queries
100-
Jena.graph <- jena.sparql: queries
101-
jena.sparql <-> nginx: routes
100+
Virtuoso.graph <- jena.sparql: queries
101+
virtuoso.sparql <-> nginx: routes
102102
letsencrypt <-> nginx: encrypts
103103
web -> api: requests
104104
web <-> nginx: routes
@@ -126,4 +126,4 @@ Internet {
126126
domain3: sparql.unhide.helmholtz-metadaten.de
127127
}
128128

129-
hdfcloud.cloud.nginx <-> Internet: handles requests
129+
hdfcloud.cloud.nginx <-> Internet: handles requests
947 KB
Binary file not shown.

docs/diagrams/unhide_deployment_overview.svg

+331-329
Loading

docs/diagrams/virtuoso_logo.png

20.8 KB
Loading

docs/images/hzb-logo-a4-rgb.png

41.3 KB
Loading

docs/intro.md

+6-3
Original file line numberDiff line numberDiff line change
@@ -39,19 +39,22 @@ With the implementation of the Helmholtz-KG, unHIDE will create substantial addi
3939

4040
## Contributors and Partners
4141

42+
% [<img src="./images/hzb-logo-a4-rgb.png" alt="HZB" width=40% height=40%>](https://www.helmholtz-berlin.de/)
4243

43-
[<img style="vertical-align: middle;" alt="FZJ" src='https://github.com/Materials-Data-Science-and-Informatics/Logos/raw/main/FZJ/FZJ.png' width=20% height=20%>](https://fz-juelich.de)
44+
[<img style="vertical-align: left;" alt="FZJ" src='https://github.com/Materials-Data-Science-and-Informatics/Logos/raw/main/FZJ/FZJ.png' width=60% height=60%>](https://fz-juelich.de)
45+
![HZB](./images/hzb-logo-a4-rgb.png)
4446

4547

46-
## Acknowledgements
4748

49+
## Acknowledgements
4850

49-
[<img style="vertical-align: middle;" alt="HMC Logo" src='https://github.com/Materials-Data-Science-and-Informatics/Logos/raw/main/HMC/HMC_Logo_M.png' width=50% height=50%>](https://helmholtz-metadaten.de)
5051

5152
This project was developed and funded by the Helmholtz Metadata Collaboration
5253
(HMC), an incubator-platform of the Helmholtz Association within the framework of the
5354
Information and Data Science strategic initiative.
5455

56+
[<img style="vertical-align: middle;" alt="HMC Logo" src='https://github.com/Materials-Data-Science-and-Informatics/Logos/raw/main/HMC/HMC_Logo_M.png' width=50% height=50%>](https://helmholtz-metadaten.de)
57+
5558

5659
## References
5760
- [1] https://5stardata.info/en/

docs/introduction/about.md

+12-2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1-
# About UnHIDE
1+
# About UnHIDE and its mission
22

3-
![unhide_overview](../images/unhide_overview.png)
3+
## Mission
4+
5+
The efforts of the unHIDE initiative are one part of the efforts by the Helmholtz metadata collaboration (HMC) to improve the quality, knowledge management and conservation of research output of the Helmholtz association with respect and through metadata. This is accomplished by making research output `FAIR` through better metadata or differently formulated creating to a certain extend in a certain form of a semantic web encompassing Helmholtz research.
6+
7+
With the unHIDE initiative our goal is to improve the metadata at the source and make data providers as well as scientists more aware of what metadata they put out on the web, how and with what quality.
8+
For this we create and expose the Helmholtz knowledge graph, which contains open high-level metadata exposed by different Helmholtz infrastructures. Also such a graph allows for services which serve needs of certain stakeholder groups to empower their work in different ways.
9+
10+
Beyond the knowledge graph in unHIDE we communicate and work together with Helmholtz infrastructures to improve metadata, (or make it available in the first place), through consulting, help and fostering networking between the infrastructures and respected experts.
11+
12+
13+
![unhide_overview](../images/unhide_overview.png)

docs/tech/datapipe.md

+16-4
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Data pipeline
22

3-
In UnHIDE data is harvested from connected providers and partners.
4-
Then data is 'uplifted', i.e semantically enriched and or completed,
5-
where possible from aggregated data or schema.org semantics.
3+
In UnHIDE metadata about research outputs is harvested from connected providers and partners.
4+
Then the original metadata is 'uplifted', i.e semantically enriched and or completed,
5+
where possible from aggregated data or schema.org semantics as an example of how it can be.
66

77
## Overview
88

@@ -36,4 +36,16 @@ The second direction is there to provide full text search on the data to end use
3636
For this an index of each uplifted data record is constructed and uploaded into a single SOLR index,
3737
which is exposed to a certain extend via a custom fastAPI. A web front end using the javascript library
3838
React provides a user interface for the full text search and supports special use cases as a service
39-
to certain stakeholder groups.
39+
to certain stakeholder groups.
40+
41+
42+
The technical implementation is currently a minimal running version, by exposing each
43+
component and functionality through the command line interface `hmc-unhide` and then using
44+
cron jobs to run them from time to time. On the deployment instance this can be run monthly or
45+
weekly. In the longer term, the pipeline orchestration itself should become more sophisticated.
46+
For this one could deploy a workflow manager with provenance tracking like (AiiDA)
47+
or one with less overhead depending on the needs, also if one wants to move to a more event based system
48+
with more fault tolerance for errors of individual records or data sources. Currently,
49+
in the minimal implementation there is the risks that a not caught failure in a subtask
50+
fails a larger part of the pipeline. Which is then only logged, but has to be resolved in a manual way.
51+

docs/tech/harvesting.md

+29-2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,30 @@
1-
# Data harvesting
1+
# Data harvesting: extracting metadata from the web
22

3-
How does UnHIDE harvested data?
3+
How does UnHIDE harvested data?
4+
5+
Data harvesting and mining for the knowledge graph is done by `Harvester classes`.
6+
For each interface a specific Harvester class should be implemented.
7+
All Harvester classes should inherit from existing Harvesters or the [`BaseHarvester`](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/data_harvesting/base_harvester.py?ref_type=heads), which currently specifies that:
8+
9+
1. Each harvester needs a `run` method
10+
2. Can read from the [`config.yml`](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/data_harvesting/configs/config.yaml?ref_type=heads)
11+
3. Reads from a `<harvesterclass>.last_run` file the time the harvester was last run
12+
13+
Implemented harvester classes include:
14+
15+
| Name (Cli) | Class Name | Interface | Comment |
16+
|-------------|------------|-----------|---------|
17+
|sitemap | SitemapHarvester | sitemaps | Selecting record links from the sitemap requires expression matching. Relies on the advertools lib.|
18+
|oai | OAIHarvester | OAI-PMH | Relies on the oai lib. For the library providers, dublin core is converted to schema.org |
19+
|git | GitHarvester | Git, Gitlab/Github API | Relies on codemetapy and codemeta-harvester as well as gitlab/github APIs. |
20+
|datacite | DataciteHarvester | REST API & GraphQL endpoint | schema.org extracted through content negotiation.|
21+
|feed | FeedHarvester | RSS & Atom Feeds | Relies on the atoma library, and also only works if on the landing pages schema.org metadata can be extracted. Can only get recent data, useful for event metadata.|
22+
|indico | IndicoHarvester | Indico REST API | Directly extracts schema.org metadata through API, requires an access token |
23+
24+
Json-ld metadata from landing pages of records is extracted via the `extruct` library, if it cannot be directly retrieved through some standardized interface.
25+
26+
All harvesters are exposed on the `hmc-unhide` commandline interface.
27+
They store the extracted metadata per default in the internal data model [`LinkedDataObject`](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/data_harvesting/data_model.py?ref_type=heads).
28+
Which has a serialization with some provenance information, original source data and uplifted data and provides method for validation.
29+
30+
In a single central yaml configuration file called [`config.yml`](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/data_harvesting/configs/config.yaml?ref_type=heads), specifies for each harvester class the sources to harvest and harvester or source specific configuration.

0 commit comments

Comments
 (0)