Hungarian National Corpus
|
The new, expanded version of the Hungarian National Corpus
with new features has been launched.
Click!
Please, use the new interface from now on.
Old HNC registrations are still valid.
Please, refer to our LREC 2014 publication presenting this new version.
|
This is the site of the old version of the Hungarian National Corpus.
Work on the Hungarian National Corpus (HNC) started
in 1998 at the Department of Corpus Linguistics of the Research Institute for
Linguistics of the Hungarian Academy of Sciences (HAS) under the
supervision of Tamás Váradi. The objective was to create a
100-million-word balanced reference corpus of present-day Hungarian.
From 2002 began a new effort to
extend the area of data collection to
the Hungarian language use of the whole Carpathian Basin
in Hungarian Language Corpus of the Carpathian Basin project.
Aim was to create a 15-million-word corpus of
Hungarian language beyond the borders of Hungary.
The truly national Hungarian National Corpus,
containing language variants form
Slovakia, Subcarpathia, Transylvania and Vojvodina also,
was introduced in November 2005.
The first Hungarian corpus covering language variants
from also beyond the border of Hungary have been completed as the result of
joint work
of the Hungarian Language Offices and the Department of Corpus Linguistics.
What is a corpus?
A corpus is a collection of written or spoken linguistical data. The texts are selected and classified according to certain criteria. A corpus does not necessarily contain whole texts and is not only a repository of texts: it contains their bibliographical data and marks the structural units (paragraphs, sentences).
HNC wishes to be a representative general-aim corpus of present-day standard Hungarian.
Automatic analysis
Relevant characteristic of the HNC
is the detailed morphosyntactic annotation.
Every wordform is annotated with
stem, part of speech and inflecional information.
This analysis is provided by using automatic methods with a
general precision of about 97.5%, i.e. 2.5% of all wordforms has an erroneous
analysis. Higher precision could only have been achieved by manual annotation,
which was not feasible for such a large amount of data.
How is it built up?
HNC currently contains 187.6 million words.
It is divided into five subcorpora by regional language variants,
and into five subcorpora by text genres also.
The subcorpus to be studied can be chosen by any combination of these.
That makes the HNC an appropriate tool to study
the differences not just between text genres but between language variants.
The HNC consists of following subcorpora
(size given in million words, rounded off to the nearest 100000 words):
|
Hungary |
Slovakia |
Subcarpathia |
Transylvania |
Vojvodina |
total |
|
press |
71.0 |
5.7 |
0.7 |
5.5 |
1.5 |
84.5 |
Texts from the news media make up almost half of the corpus, presenting a broad scale of dialects, both vertically and horizontally. |
literature |
35.5 |
1.4 |
0.4 |
0.8 |
0.2 |
38.2 |
Material of the Digital Literary Academy
(Digitális Irodalmi Akadémia)
was fully incorporated in the autumn of 2005. This makes the
literature subcorpus for Hungary. |
science |
20.5 |
2.3 |
0.7 |
1.6 |
0.3 |
25.5 |
The source of science texts for Hungary is the Hungarian Electronic Library (Magyar Elektronikus Könyvtár). |
official |
19.9 |
0.2 |
0.3 |
0.6 |
0.1 |
20.9 |
Regulations, laws, by-laws and parliamentary debates. |
personal |
17.8 |
— |
0.4 |
0.4 |
0.1 |
18.6 |
This subcorpus contains discussions of internet forums (forums of the biggest and oldest Hungarian Internet portal: index.hu, and several forums from Subcarpathia). This language variant is particularly interesting because it stands closest to spontaneous linguistic communication. In certain cases it is very similar to spoken communication. |
total |
164.7 |
9.5 |
2.5 |
8.9 |
2.0 |
187.6 |
|
Who can use this corpus?
Everybody can use the Hungarian National
Corpus who fills out the
registration form and
agrees to the conditions laid down there.
Frequency data
| stem | POS | count | count / 1000 words | | | stem | POS | count | count / 1000 words | | | stem | POS | count | count / 1000 words | | 1. | a | Det | 11128421 | 72.40 | | 34. | ki | Pre | 305480 | 1.99 | | 67. | között | NU | 159583 | 1.04 | |
2. | az | Det | 3716414 | 24.18 | | 35. | ami | Pro | 287999 | 1.87 | | 68. | első | Num | 158569 | 1.03 | |
3. | és | Con | 2544751 | 16.56 | | 36. | nagy | A | 281134 | 1.83 | | 69. | nap | N | 157310 | 1.02 | |
4. | hogy | Con | 2166004 | 14.09 | | 37. | mond | V | 276868 | 1.80 | | 70. | ad | V | 154537 | 1.01 | |
5. | A | Det | 2103970 | 13.69 | | 38. | mi | Pro | 275076 | 1.79 | | 71. | 99 | DIG | 154526 | 1.01 | |
6. | az | Pro | 1803814 | 11.74 | | 39. | maga | Pro | 263983 | 1.72 | | 72. | azonban | Con | 154150 | 1.00 | |
7. | nem | Adv | 1693748 | 11.02 | | 40. | mert | Con | 258962 | 1.68 | | 73. | sok | Num | 152907 | 0.99 | |
8. | is | Con | 1677108 | 10.91 | | 41. | én | Pro | 245386 | 1.60 | | 74. | ők | Pro | 151718 | 0.99 | |
9. | van | V | 1418113 | 9.23 | | 42. | -e | Clit | 237612 | 1.55 | | 75. | más | Pro | 151698 | 0.99 | |
10. | ez | Pro | 1204269 | 7.84 | | 43. | olyan | Pro | 232947 | 1.52 | | 76. | kérdés | N | 151477 | 0.99 | |
11. | egy | Num | 899832 | 5.85 | | 44. | jó | A | 232826 | 1.51 | | 77. | hanem | Con | 150702 | 0.98 | |
12. | Az | Det | 730287 | 4.75 | | 45. | több | Num | 232803 | 1.51 | | 78. | Ha | Con | 147117 | 0.96 | |
13. | meg | Pre | 592986 | 3.86 | | 46. | magyar | A | 229934 | 1.50 | | 79. | eset | N | 146803 | 0.96 | |
14. | kell | V | 499659 | 3.25 | | 47. | minden | Pro | 225130 | 1.46 | | 80. | elnök | N | 146500 | 0.95 | |
15. | csak | Adv | 477956 | 3.11 | | 48. | úgy | Adv | 221524 | 1.44 | | 81. | forint | N | 144629 | 0.94 | |
16. | lesz | V | 469189 | 3.05 | | 49. | pedig | Con | 216513 | 1.41 | | 82. | egyik | Pro | 143627 | 0.93 | |
17. | de | Con | 462508 | 3.01 | | 50. | új | A | 215765 | 1.40 | | 83. | kormány | N | 139493 | 0.91 | |
18. | már | Adv | 452814 | 2.95 | | 51. | tesz | V | 211798 | 1.38 | | 84. | akar | V | 138696 | 0.90 | |
19. | Ez | Pro | 447310 | 2.91 | | 52. | két | Num | 211077 | 1.37 | | 85. | ország | N | 137225 | 0.89 | |
20. | amely | Pro | 417945 | 2.72 | | 53. | 00 | DIG | 205993 | 1.34 | | 86. | kerül | V | 135554 | 0.88 | |
21. | ha | Con | 402593 | 2.62 | | 54. | ember | N | 198039 | 1.29 | | 87. | De | Con | 135062 | 0.88 | |
22. | még | Adv | 396207 | 2.58 | | 55. | Az | Pro | 194263 | 1.26 | | 88. | százalék | N | 132780 | 0.86 | |
23. | vagy | Con | 381098 | 2.48 | | 56. | után | NU | 190805 | 1.24 | | 89. | lát | V | 131866 | 0.86 | |
24. | mint | Con | 370507 | 2.41 | | 57. | Nem | Adv | 185338 | 1.21 | | 90. | törvény | N | 129485 | 0.84 | |
25. | szerint | NU | 369481 | 2.40 | | 58. | idő | N | 178374 | 1.16 | | 91. | 98 | DIG | 128540 | 0.84 | |
26. | el | Pre | 362004 | 2.36 | | 59. | majd | Adv | 177497 | 1.15 | | 92. | sor | N | 128311 | 0.83 | |
27. | tud | V | 356833 | 2.32 | | 60. | be | Pre | 175615 | 1.14 | | 93. | kap | V | 127841 | 0.83 | |
28. | s | Con | 356453 | 2.32 | | 61. | tart | V | 173048 | 1.13 | | 94. | fog | V | 127768 | 0.83 | |
29. | aki | Pro | 350819 | 2.28 | | 62. | rész | N | 170894 | 1.11 | | 95. | alap | N | 127632 | 0.83 | |
30. | év | N | 338213 | 2.20 | | 63. | most | Adv | 168334 | 1.10 | | 96. | 2 | DIG | 127461 | 0.83 | |
31. | sem | Adv | 329570 | 2.14 | | 64. | fel | Pre | 164467 | 1.07 | | 97. | itt | Adv | 127399 | 0.83 | |
32. | lehet | V | 310500 | 2.02 | | 65. | szó | N | 162929 | 1.06 | | 98. | hely | N | 124262 | 0.81 | |
33. | ő | Pro | 306621 | 1.99 | | 66. | 1 | DIG | 162486 | 1.06 | | 99. | vesz | V | 123583 | 0.80 | |
Partners
Morphological analysis is made by Humor from MorphoLogic Ltd.,
disambiguation is based on Thorsten Brants'
TnT tagger,
corpus processing tool used is the
IMS Corpus Workbench.
Supporters
Corpus creation - tender T 026091 of
OTKA,
browsable version - tender SZT-IS-7 of
IHM,
Hungarian Language Corpus of the Carpathian Basin project - tender NKFP/044/2002.
If you
have any comments, please let us
know.
Research Institute for Linguistics,
HAS 1998-2006.
|