LINGUIST List 2.821

Sat 23 Nov 1991

FYI: Brown and LOB Corpora

Editor for this issue: <>


Directory

  • Henry Kucera, Re: 2.809 Queries: Brown Corpus, Circassian, Croat, Socio
  • Steve Fligelstone, Re: 2.809 Queries: Brown Corpus, Circassian, Croat, Socio

    Message 1: Re: 2.809 Queries: Brown Corpus, Circassian, Croat, Socio

    Date: Thu, 21 Nov 91 09:49:23 EST
    From: Henry Kucera <HENRYbrownvm.brown.edu>
    Subject: Re: 2.809 Queries: Brown Corpus, Circassian, Croat, Socio
    This concerns the query re the Brown and LOB corpora: The Brown corpus (American English) is available to non-profit organizations (such as universities), essentially in two formats: text only (so called "untagged" version) on tape or diskettes from our friends at the Norwegian Centre for Humanistic Research, P.O. Box 54, University of Bergen, Bergen, Norway. The cost varies depending on format and the dollar exchange rate. It is in the range of $100 -$200. E-mail (for Bitnet) is: FAFSRVNOBERGEN. However, you would have to sign a written agreement (no copying, no commercial use, etc.). The size varies depending on format but the untagged uncompressed Brown corpus (without grammatical designators) is about 8mb. The "tagged" version of the corpus (which includes an annotation of every word by an expanded grammatical class-82 classes in all) is available from Text Research, 196 Bowen Street, Providence, RI 02906. Because of its size, it comes on mag. tape only (1600 or 6250 bpi, ASCII or EBCDIC) and its cost to academic institutions is $1,000.- The reason for the difference is that the tagged corpus provides much more information and carries a separate copyright. There are also some restrictions: no copying, no commercial use, etc. A written agreement must be signed by a responsible official of the Department or University Administration. Text Research has no connection with Brown University and has no e-mail address. However, you can either send e-mail to me for transmission or a fax to Text Research at 401-751-8958. The size of the tagged database is quite large--about 53mb. However, it can be fairly easily compressed by a skilled programmer. A large manual, giving a detailed description of tags, etc. is included. Incidentally, there are no discounts available for either the tagged or the untagged version. These are fixed prices. Non-academic use is possible only by obtaining a license from Text Research. As for the LOB corpus (British English): Both untagged and tagged versions are available, but only to non-profit institutions, from the address in Bergen given above. There are fairly severe restrictions on its use, as far as I remember (because of British copyright laws). I can't cite the prices right now but the Bergen people a pretty good in answering e-mail. Hope this helps. Henry Kucera.

    Message 2: Re: 2.809 Queries: Brown Corpus, Circassian, Croat, Socio

    Date: Thu, 21 Nov 91 16:48:41 GMT
    From: Steve Fligelstone <eia002cent1.lancs.ac.uk>
    Subject: Re: 2.809 Queries: Brown Corpus, Circassian, Croat, Socio
    Mark Sanderson asks about availability of tagged versions of the Brown and LOB (Lancaster/Oslo-Bergen) Corpora. The tagged LOB Corpus, along with several other widely used corpora can be obtained by writing to ICAME (International Computer Archive of Modern English) at this address: Knut Hofland, ICAME Norwegian Computing Centre for the Humanities Harald Harfagresgt. 31 Postboks 53 Universitetet N-5027 Bergen NORWAY email (earn/bitnet): fafkhnobergen The Brown Corpus is also available from this source, but not in tagged format. However, I understand that the tagged version may be obtained TEXT RESEARCH, 186 Bowen St., Providence RI 02906, U.S.A. There is furthermore a grammatically analysed (parsed as opposed to merely part-of-speech tagged) version of part of the Brown Corpus. This is referred to as the Gothenburg Corpus. For details contact: Gudrun Magnusdottir Sprakdata Goteborgs Universitet S-412 98 Goteborg Sweden Finally, here at Lancaster work is nearing completion (honestly!) on a parsed version of part of the LOB Corpus. Write to me if you want to be kept informed of its progress and availability. Steve Fligelstone UCREL Linguistics Department Bowland College Lancaster University GB-Lancster LA1 4XZ email: eia002uk.ac.lancaster Steve Fligelstone