SSP 1995 project summary:

[EPCC home] [SSP home] [2001 projects] [2000 projects] [1999 projects] [1998 projects] [1997 projects] [1996 projects] [1995 projects] [1994 projects] [1993 projects]

Thesaurus Linguae Graecae (TLG) Parallel Searching

The TLG is a collection of electronic texts available on CD-ROM. The texts run to some 60 million words and take up about 550 megabytes. They comprise most of Ancient Greek literature. Many authors are included. The texts are of interests to Classicists, historians, experts on literature and language, and to those studying biblical Greek and the Greek Church Fathers. The proposers main interest is in New Testament, but the project would assist all the other fields also.

At present the texts can be searched either on a specialist micro (Ibycus system) or on a 486 using special software (Lbase, TLG Workplace, Musaios, etc.). These searches are quite good for a small group of authors but take at least 40 minutes to search the whole corpus - and that is at best using the Ibycus. Parallel searching should allow the researcher a much faster time. The main aim of searching is to match a set of words or part-words. It is also sometimes useful to search via the index, especially where rarer words are involved.

Various search strategies could be used. The special software on the non-standard PC the Ibycus, gets good results at speeds better than is done with a 486. The search routines for that use boolean operators and and or and allow either whole or part words to be used. The texts contain accents and breathings. These need to be ignored when specifying the search pattern and when searching, but displayed when the results are shown, and put in the result file. A better wild card facility would improve searching, as would the inclusion of a not operator. Ibycus allows the span of a search to be modified between 3 and 200 characters. Searches need to overshoot line, sentence and some other divisions of the text.

Basically it is word searching of a huge text which is being done. There are non-parallel search routines such as grep and the commands in ecce which might provide some ideas on the sort of routines to include. Whether any of the providers of software for the TLG would collaborate or whether the project would wish to be tied to any such software groups would need to be explored.

The corpus of texts could be split into segments in some way thus allowing parallel searching to speed up the whole process very considerably. That would be the new factor here. It would be best to try not to make a solution dependent on one piece of hardware which might have a limited life or be available in only a few sites.

The search results would need to be passed to a system (on a 486?) which could display them and allow them to be included in Word for Windows 6 format using special fonts for Greek. Or some alternative to this would be needed, but this would be the current preference.

Irene Moulitsas worked on this project.

Compressed PostScript of the project's final report is available here (47219 bytes) .

Webpage maintained by mario@epcc.ed.ac.uk