Repository for Semantic Spaces

Since I am leaving the University of Tbingen, I will not longer be able to maintain this website.
In the future, I will be providing this Semantic Space Repository at my new website:
https://sites.google.com/site/fritzgntr/
Please contact me if you

  • want to work with semantic spaces in a language not provided here
  • have your own text corpus which you want to create a semantic space from
  • have any additional questions

fritzgntr [at] gmail.com

All the semantic spaces provided here are available in the .rda format for R, and designed for being used with the LSAfun package for R.
Load them into the R workspace with:

load("NAMEOFSPACE.rda")



For further information, see:

Gnther, F., Dudschig, C., & Kaup, B. (2015). LSAfun - An R package for computations based on Latent Semantic Analysis. Behavior Research Methods, 47, 930-944.

LSA-type and HAL-type semantic spaces in more languages (Spanish, Italian) will be provided in the future.

  • dewak100k.rda as a large German HAL-type space is already there!
    (May 22th, 2015)
  • EN_100k.rda as a large English HAL-type space is already there!
    (May 22th, 2015)
  • dewak100k_lsa.rda as a large German LSA-type space now available!
    (June 5th, 2015)
  • EN_100k_lsa.rda (finally) arrived! English LSA-type space, derived from a huge corpus (July 15th, 2015)
  • frwak100k, a French HAL-type space, is now online (September 22th, 2015)


If you are planning to use one these semantic spaces, please read the description beforehand. It is quite important that you know what you are working with if you want to rely on the results you get.
If you have any questions (for example if you want to have a look at the actual corpora, want to know about the exact parameter settings for the LSA algorithm, or really just anything else), feel free to contact me anytime (see the box above for contact info).


Click on the name of a semantic space to start the download

TASA

English LSA space, 300 dimensions

This LSA space was built from the TASA corpus, containing texts on a broad variety of topics.
This space uses a variety of texts, novels, newspaper articles, and other information, from the TASA (Touchstone Applied Science Associates, Inc.) corpus used to develop The Educator's Word Frequency Guide.
I am very thankful to the TASA folks for providing this corpus to the people at Boulder, Colorado, as well as to Morgen Bernstein, Donna Caccamise, Peter Foltz and the people from the NLP and LSA Research Labs in Boulder for providing me with this corpus.
--------------------------------------------------------------------------------------------------- IMPORTANT: Calculations on this LSA Space will NOT give the same results as the LSA homepage , due to different parameter settings in the creation of the LSA space.
See
Gnther, F., Dudschig, C., & Kaup, B. (in press). LSAfun - An R package for computations based on Latent Semantic Analysis. Behavior Research Methods.
for more information on this topic. ---------------------------------------------------------------------------------------------------

This LSA space was built from 37,651 different documents and contains 92,393 different terms.

EN_100k

English HAL-type space, 300 dimensions

--------------------------------------------------------------------------------------------------- Recommended for computations in English, since the corpus is several hundred times bigger than the other English corpora used here (also including the TASA space, which uses a quite small corpus) --------------------------------------------------------------------------------------------------- Created from a ~2 Billion word corpus, which was created by concatenating the British National Corpus (BNC), the ukWaC corpus and a 2009 Wikipedia dump (see here and here).
This space was built using a HAL-like moving window model, with a window size of 5 (2 to the left, 2 to the right), with the 100k most frequent words in the corpus as row words as well as content (column) words for the cooccurrence matrix. A Positive Pointwise Mutual Information weighting scheme was applied, as well as a Singular Value Decomposition to reduce the space from 100k to 300 dimensions.

This space therefore contains vectors for 100,000 different words.

EN_100k_lsa

English LSA-type space, 300 dimensions

--------------------------------------------------------------------------------------------------- Recommended for computations in English, since the corpus is several hundred times bigger than the other English corpora used here (also including the TASA space, which uses a quite small corpus) --------------------------------------------------------------------------------------------------- Created from a ~2 Billion word corpus, which was created by concatenating the British National Corpus (BNC), the ukWaC corpus and a 2009 Wikipedia dump. This corpus is divided into 5,386,653 individual documents. (see here and here).
This space was created from a term-document matrix with the 100k most frequent words in the corpus as rows and the ~5.4 million documents the corpus consists of as columns (as in LSA). Other than in LSA, a A Positive Pointwise Mutual Information weighting scheme was applied instead of the standard log-entropy weighting (this should however not have a large influence on the results). As in standard LSA, an SVD was applied to reduce the space from the ~5.4 million dimensions to 300 dimensions.

This space therefore contains vectors for 100,000 different words.

frwak100k

French HAL-type space, 300 dimensions

Created from the ~1.6 billion word frWaC corpus (see here ) This space was built using a HAL-like moving window model, with a window size of 5 (2 to the left, 2 to the right), with the 100k most frequent words in the corpus as row words as well as content (column) words for the cooccurrence matrix. Accents were not removed (so dj is stored as dj). Words from this corpus were not lemmatized. A Positive Pointwise Mutual Information weighting scheme was applied, as well as a Singular Value Decomposition to reduce the space from 100k to 300 dimensions.
This space contains vectors for 100,000 different words.

dewak100k

German HAL-type space, 300 dimensions

--------------------------------------------------------------------------------------------------- Recommended for computations in German, since the corpus is about 200 times bigger than the other German corpora used here --------------------------------------------------------------------------------------------------- Created from the ~800 million word sdeWaC corpus (see here ) This space was built using a HAL-like moving window model, with a window size of 5 (2 to the left, 2 to the right), with the 100k most frequent words in the corpus as row words as well as content (column) words for the cooccurrence matrix. A Positive Pointwise Mutual Information weighting scheme was applied, as well as a Singular Value Decomposition to reduce the space from 100k to 300 dimensions.
This space contains vectors for 100,000 different words.

dewak100k_lsa

German LSA-type space, 300 dimensions

--------------------------------------------------------------------------------------------------- Recommended for computations in German, since the corpus is about 200 times bigger than the other German corpora used here ---------------------------------------------------------------------------------------------------
Created from the ~1.5 million documents of the ~800 million word sdeWaC corpus metioned above (see here )
This space was created from a term-document matrix with the 100k most frequent words in the corpus as rows and the ~1.5 million documents the corpus consists of as columns (as in LSA). Other than in LSA, a A Positive Pointwise Mutual Information weighting scheme was applied instead of the standard log-entropy weighting (this should however not have a large influence on the results). As in standard LSA, an SVD was applied to reduce the space from the ~1.5 million dimensions to 300 dimensions.

This space contains vectors for 100,000 different words.

blogs

German LSA space, 300 dimensions

Created from 50,318 blog entries, extracted from the German Corpus December 2011, downloaded from HC Corpora. These entries were randomly selected from all the entries available in this corpus. Very long or very short entries were excluded. Every selected entry was used as a document for the LSA algorithm. Terms that appeared less than two times were excluded from further computations. Umlauts and the sharp s were replaced by their non-umlaut equivalents (e.q. by ae).

The LSA space created from this corpus contains 97,838 different terms.

blogs_cro

Croatian LSA space, 300 dimensions

Created from 49,317 blog entries, extracted from the Croatian Corpus August 2012, downloaded from HC Corpora. These entries were randomly selected from all the entries available in this corpus. Very long or very short entries were excluded. Every selected entry was used as a document for the LSA algorithm. Terms that appeared less than two times were excluded from further computations.

The LSA space created from this corpus contains 132,932 different terms.

blogs_en

English LSA space, 300 dimensions

Created from 47,757 blog entries, extracted from English Corpus December 2012, downloaded from HC Corpora. These entries were randomly selected from all the entries available in this corpus. Very long or very short entries were excluded. Every selected entry was used as a document for the LSA algorithm. Terms that appeared less than two times were excluded from further computations.

The LSA space created from this corpus contains 28,433 different terms.

blogs_nl

Dutch LSA space, 300 dimensions

Created from 50,039 blog entries, extracted from the Dutch Corpus, downloaded from HC Corpora. These entries were randomly selected from all the entries available in this corpus. Very long or very short entries were excluded. Every selected entry was used as a document for the LSA algorithm.

The LSA space created from this corpus contains 166,745 different terms.

blogs_ser

Serbian LSA space, 300 dimensions

Created from 49,283 blog entries, extracted from Serbian Corpus June 2013, downloaded from HC Corpora. These entries were randomly selected from all the entries available in this corpus. Very long or very short entries were excluded. Every entry was used as a document for the LSA algorithm. Terms that appeared less than two times were excluded from further computations.
Special characters were replaced as follows:

  • C,C >> C
  • c,c >> c
  • >> S
  • >> s
  • >> Z
  • >> z
  • >> Dj
  • d >> dj
  • D >> Dz
  • d >> dz


The LSA space created from this corpus contains 134,229 different terms.

blogs_stem

German LSA space, 300 dimensions

A version of the blogs.rda space where a word stemming procedure was applied before creating the semantic space. The wordStem() command of the R-package SnowballC was used for stemming, which uses the Porter stemming algorithm. The stemming algorithm for the German language was applied.

This LSA space contains 72,085 different terms.
To use this LSA space with the LSAfun package, it is recommended to use the same stemming algorithm for the input.

Example:

Cosine(wordStem("Elefanten",language="german"), wordStem("Trompeten",language="german"), tvectors=blogs_stem)

literature

German LSA space, 300 dimensions

Created from german books and letters, downloaded from the Project Gutenberg homepage. Those are mainly books and letters written in the 18th, 19th and 20th century, so their copyright is expired. The books and letters were automatically split into paragraphs, which results in 50,800 documents from which the LSA space was created.

The LSA space created from this corpus contains 74,636 different terms.

newspapers

German LSA space, 300 dimensions

Created from 54,959 newspaper articles, extracted from German Corpus December 2011, downloaded from HC Corpora. Mainly containing entries from abendblatt.de, derwesten.de, faz.net, fazfinance.net, focus.de, handelsblatt.com, haz.de, muensterschezeitung.de, ruhrnachrichten.de, spiegel.de, stern.de, sueddeutsche.de, tagesspiegel.de, welt.de and zeit.de. These articles were randomly selected from all the articlesavailable in this corpus. Very long or very short articles were excluded. Every article was used as a document for the LSA algorithm. Terms that appeared less than two times were excluded from further computations.

The LSA space created from this corpus contains 109,480 different terms.

blogs_beagle

German BEAGLE space, 1024 dimensions

Created from 50,318 blog entries, extracted from German Corpus December 2011, downloaded from HC Corpora. These entries were randomly selected from all the entries available in this corpus. Very long or very short entries were excluded. Every entry was used as a document for the algorithm. A Composite Model was used to create the BEAGLE Space (Context Information and Order Information).
For further information on BEAGLE, see:

Jones, M.N., & Mewhort, D.J.K. (2007). Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114, 1-37.

The BEAGLE space created from this corpus contains 103,599 different terms.

sdewac_hafu

German HAL-type space, 300 dimensions


Created from the ~800 million word sdeWaC corpus (see here )
This space was built using a HAL-like moving window model, with a window size of 5 (2 to the left, 2 to the right), with the 20k most frequent words in the corpus plus some hand/foot-related words as row words, and the 20k most frequent words as content (column) words for the cooccurrence matrix. A Positive Pointwise Mutual Information weighting scheme was applied, as well as a Singular Value Decomposition to reduce the space from 20k to 300 dimensions.
This space contains vectors for 20,064 different words.