데이터셋

DeepData-META

텍스트 Ver 1 관리자 2022.04.19

DeepData-META

개요
  • The data for extracting metadata from PDF domestic papers.
  • The data contains information in layout box extracted from each PDF paper with labels corresponding to metadata field types.
  • The information in each layout box are unique code, text, coordinates(x0, y0, x1, y1) of box, width of box, height of box and font size.
  • The file named as “train.txt” was constructed through the fully automatic inspection process. It contains a total of 5,241,746 labeled layout boxes for 295,306 papers in 503 journals. It was used as train set.
  • The file named as “valid.txt” was developed through the manual inspection process by several annotators. It contains a total of 155,629 labeled layout boxes for 9,895 papers in 503 journals.
  • The file named as “test.txt” was built through the manual inspection process. It contains a total of 159,925 labeled layout boxes for 10,119 papers in 503 journals. It was used as test set.

 

Data statistics

DataFile name#Jounal#Paper#Layout Box
Train settrain.txt

503

 295,306 

 5,241,746 

Valid setvalid.txt

 9,895 

 155,629 

Test settest.txt

 10,119 

 159,925 

DOI
10.23057/48
형식 TXT
  • In the files, each layout box is separated by a newline. And each paper is separated by two newlines.
  • The data structure of each layout box is as follows :

      "Unique code"(\t)"Metadata label"(\t)"Text"(\t)"x0 value "(\s)"y0

      value"(\s)"x1value"(\s)"y1value"(\s)"width value" (\s)"height value"(\s)"font size" 

 

MetaData Labels

No Metadata Fields Label

1

Title(in Korean) title_ko

2

Title(in English) title_en

3

Author Name(in Korean) author_name_ko

4

Author Name(in English) author_name_en

5

Author Affiliation(in Korean) ko_org

6

Author Affiliation(in English) en_org

7

Abstract(in Korean) abstract_ko

8

Abstract(in English) abstract_en

9

Keywords(in Korean) kwds_ko

10

Keywords(in English) kwds_en

11

DOI doi

12

Journal name journal

13

Out of Boundary O

데이터 정보

생산자 Korea Institute of Science and Technology Information (KISTI) 제공기관 Korea Institute of Science and Technology Information (KISTI)
라이선스 저작자표시-비영리 (데이터 이용동의) NTIS 과제 고유번호 1711149483
저작권 Korea Institute of Science and Technology Information (KISTI)
Cite as
Korea Institute of Science and Technology Information (KISTI) (2022) : DeepData-META. Version 1.0. Korea Institute of Science and Technology Information (KISTI). https://doi.org/10.23057/48.

데이터 이력

Version 1 2022-03-04, 10.23057/48