Automatic Specialized vs. Non-specialized Sentence Differentiation

Automatic Specialized vs. Non-SpecializedSentence Differentiation

Iria da Cunhaβ,γ,α, M. Teresa Cabreα, Juan Manuel Torres-Morenoβ,γ,δ,Eric SanJuanγ , Jorge Vivaldiα, and Gerardo Sierraβ

α Institut Universitari de Linguistique Applicada - UPFRoc Boronat, 138 E-08018 Barcelona (Espana).

β Grupo de Ingenierıa Linguıstica - Instituto de Ingenierıa UNAMTorre de IngenierıAa Basamento, Ciudad Universitaria Mexico, D.F. 04510 Mexico.

γ Laboratoire Informatique d’Avignon - UAPV339 chemin des Meinajaries, BP91228 84911 Avignon Cedex 9, France

δ Ecole Polytechnique de Montreal - Departement de genie informatiqueCP 6079 Succ. Centre Ville H3C 3A7 Montreal (Quebec), Canada.

http://www.lia.univ-avignon.fr,http://www.iula.upf.edu,http:

//www.iling.unam.mx

Abstract. Compilation of Languages for Specific Purposes (LSP) cor-pora is a task which is fraught with several difficulties (mainly time andhuman effort), because it is not easy to discern between specialized andnon-specialized text. The aim of this work is to study automatic spe-cialized vs. non-specialized sentence differentiation. The experiments arecarried out on two corpora of sentences extracted from specialized andnon-specialized texts. One in economics (academic publications and newsfrom newspapers), another about sexuality (academic publications andtexts from forums and blogs). First we show the feasibility of the taskusing a statistical n-gram classifier. Then we show that grammatical fea-tures can also be used to classify sentences from the first corpus. Forsuch purpose we use association rule mining.

Key words: Specialized Text, General Text, Corpus, Languages for SpecificPurposes, Statistical Methods, Association Rules, Grammatical features

1 Introduction

Compilation of Languages for Specific Purposes (LSP) corpora, that is, corporaincluding specialized texts, is necessary to carry out several tasks, such as: termi-nology extraction, compiling specialized dictionaries, lexicons or ontologies. Thiscorpora compilation is human time effort consuming. Until now, professionals orspecialists have to decide if the text is specialized or not.

But what is a specialized text? [1] mentions some features to be considered inorder to answer this question: the text author, the potential reader, the structuralorganization and the lexical units’ selection. There are two types of variability inspecialized texts: horizontal determined by the subject and vertical determined

by the specialization level. With regard to the second one, as shown in [2], threespecialization levels can be considered: high (specialized writer and specializedreceiver), medium (specialized writer and semi-specialized receiver, that is, forexample, students) and low (specialized writer and non-specialized receiver, thatis, general public).

For example, articles in newspapers should be considered as a low specializa-tion level because they may deal with technical subjects, as, economics, medicineor law. However, they don’t share the ”conceptual and lexical control” of thedomain.

There are several theoretical works about differences between general andspecialized texts. Most of them consider that lexicon is the most discriminativefactor (besides being the most visible) to carry out this differentiation. It is well-known that terms (units of the lexicon with a precise meaning in a particulardomain [3]) show the specialized content of a subject; therefore, they appear intexts of their domain. But there are other features of specialized texts (as gram-matical features, both morphological and syntactic) that can be considered asspecific of these texts. Features as verbal flexion related to grammatical person,verbal tense or verbal mode have been underlined in some works [4]. Some au-thors, using small corpora, have established some grammatical phenomena thatmay differentiate specialized texts. In some cases, they have considered only avery limited number of features of a single category; in other cases, a scarce num-ber of texts has been analysed manually. [5] analyses the frequency of names andverbs into a general corpus and a specialized corpus. Some authors have studiedverbs into specialized French corpora [6–9]. The works of [10, 11] are the firstones where this subject is studied using a bigger corpus (two millions of words).They conclude that certain grammatical features, besides lexicon, have a strongpotential to differentiate specialized texts from non-specialized texts.

The aim of this work is to study automatic specialized vs. non-specializedsentence differentiation. The experiments are carried out on two corpora of sen-tences extracted from specialized and non-specialized texts. One in economics(academic publications and news from newspapers), another about sexuality(academic publications, and texts from forums and blogs).

This paper is organized as follows. In Section 2 we explain the methodology ofour work. First, in Section 3, we show the feasibility of the task using a statisticaln-gram classifier. Then, in Section 4, we show that grammatical features canalso be used to classify sentences from the economics corpus. Finally, section 5exposes the conclusions of the paper and the future work.

2 Methodology

We have compiled two corpora: a corpus including economic texts and a cor-pus including texts from the sexuality domain. Each one was divided into twosubcorpora: specialized vs. non-specialized (or general).

The economic corpus was divided as follows:

1. A sub-corpus including texts from the specialized domain of economics,mainly scientific papers, books, theses, etc. (with 292,804 tokens includedin 9,243 sentences).

2. A sub-corpus with non-specialized texts from the economics subsection ofSpanish newspapers (with 1,232,512 tokens corresponding to 36,236 sen-tences).

These texts have been extracted from the Technical Corpus of the Insti-tute for Applied Linguistics1 (IULA-CT) of the Universitat Pompeu Fabra ofBarcelona. It consists of documents in Catalan, Spanish, English, German andFrench, although the search through bwanaNet is at the moment restricted tothe first three of these languages. It contains texts of several domains (eco-nomics, law, computing, medicine, genome and environment) as well as textsfrom newspapers. All the texts are POS tagged. This corpus is accessible on-linevia http://bwananet.iula.upf.edu/. Further details on these resources areshown at [12].

The sexuality corpus was divided as follows:

1. A sub-corpus including texts from the specialized domain of sexuality, mainlyscientific papers, books, theses, etc. (with 127,903 tokens included in 6,368sentences).

2. A generic sub-corpus with texts from html pages, blogs and forums aboutsexuality (with 384,659 tokens corresponding to 31,475 sentences).

These texts have been extracted from the Sexuality Corpus of the Grupo deIngenierıa Linguistica (GIL)2 at the Universidad Nacional Autonoma de Mexico(UNAM) [13]. In this corpus, texts are divided into five levels:

1. Level 1: Texts from Google Scholar.2. Level 2: Texts from sexuality associations.3. Level 3: PDF texts.4. Level 4: Word and html texts.5. Level 5: Blogs and forums texts.

For our experiments we have used texts from level 1 (specialized) and textsfrom level 4 and 5 (non-specialized). All the texts were tagged with POS tags.

Both corpora (economic and sexuality corpora) contain specialized and non-specialized texts. However there is an important difference between them. Thefirst one includes academic or journalistic texts, so all the texts are well-writtenwith a defined style, since text’s authors are journalists or specialists from thedomain. The second one, the sexuality corpus, includes (mainly) academicaltexts as well (into the specialized sub-corpus), but it contains texts from blogsand forums about sexuality, where the sentences are not always well-written andsometimes they are not complete sentences. This is a more “ambiguous” corpus,more difficult to characterize, which is interesting for our experiments as well.

1 http://www.iula.upf.edu2 http://www.iling.unam.mx/

Finally, we are interested in working at sentence level instead of entire docu-ments. Indeed, documents can be classified using contextual information abouttheir structure or statistical information about their specific vocabulary. At sen-tence level, none of these informations can be used. Clearly, we target an appli-cation that can look for technical/non-technical statements inside any documenttype. We first show that this is possible, at least using a statistical n-gram ap-proach, then we study how grammatical information can be used to generateintuitive decision rules.

3 Sentence classification based on n-grams

We have developed an algorithm based on a ranking of n-grams. Two languagemodels (LM) are constructed: one LMspe over the specialized corpus and anotherLMgen over the non-specialized corpus.

3.1 Algorithm

The n-grams distance algorithm is simple. It is inspired by the methods used inDEFT [15]. A language model is generated using a sliding-window of n charac-ters, with n = 1, ..., 15. This produces two language models LMspe and LMgen.In the same way, we also consider the language model LMX generated by anunknown sentence X. To classify X we compute the distance (absolute value ofthe ranking) LMX||(LMspe; LMgen) and we choose the category closer to X.

3.2 Results

From the economics corpus we have randomly selected 9,000 sentences fromeach category (specialized and non-specialized). From the sexuality corpus wehave randomly selected 544 sentences from the non-specialized category and 635sentences from the specialized one. Therefore the experiment has been carriedout on a set of 1,179 sentences corresponding. We have used the 90% of bothcorpora for training and the 10% for test, replicating this split 30 times atrandom.

Table 1 includes results obtained by the n-grams algorithm over the eco-nomics corpus. Performances using the first 20,000 and 30,000 n-grams areshown. Table 2 contains results obtained over the sexuality corpus. In this case,n = 14 and the number of n-grams is 500,000. These results show that the use ofa higher n and a high quantity of n-grams has a positive influence on the results.These results (average F-score of 0.8715 over the economics corpus and 0.8258over sexuality corpus) are interesting, because they mean that a simple n-gramsdistance strategy is suitable to distinguish specialized and non-specialized textscorrectly.

Table 3 shows a sample of n-grams of both language models, ordered byrank and with the number of occurrences. The smaller is the ranking, the lessdiscriminant is the corresponding n-gram.

Table 1. Results of n-grams classifier over the economics corpus.

20K 6-grams 30K 6-grams

Precision Recall F-Score Precision Recall F-Score

GEN 0.6341 0.8312 0.7194 0.6744 0.8475 0.7511SPE 0.9532 0.8776 0.9138 0.9583 0.8955 0.9259

Average 0.7937 0.8544 0.8166 0.8164 0.8715 0.8385

Table 2. Results of n-grams classifier over the sexuality corpus.

400K 13-grams 500K 15-grams


GEN 0.7999 0.8121 0.8058 0.8102 0.8156 0.8128SPE 0.8370 0.8257 0.8312 0.8412 0.8361 0.8385

Average 0.8184 0.8189 0.8185 0.8257 0.8258 0.8257

Table 3. Sample of 15-grams of specialized vs. non-specialized model of language.

Rank n-gram (SPE) ocurrencies n-gram (GEN) ocurrencies

1 e 73472 e 57000... ... ... ... ...

254 sexual 1549 a de 1140... ... ... ... ...

272 e s 1444 sexual 1062... ... ... ... ...

1890 porno 247 de es 187... ... ... ... ...

2652 s pe 182 porno 142... ... ... ... ...

4351 a a l 123 orgasmo 92... ... ... ... ...

6767 el condon 86 iolencia 54... ... ... ... ...

7757 orgasmo 76 nfecci 55... ... ... ... ...

499999 uinaria porn 2 de una put 2

With regard to the sexuality corpus, the n-grams strategy maintains its per-formance, obtaining an average F-score of 0.8257, that is, a 0.0128 less than overthe economics corpus.

4 Grammatical Features for Specialized vs.Non-Specialized Sentences Differentiation

We have selected some linguistic features that may be characteristic of specializedtexts and non-specialized texts.

4.1 Feature description

We have used the features detected by [10] and [11]. Table 4 shows them. Thefull meaning of these POS tags can be seen on the following URL: http://www.iula.upf.edu/corpus/etqfrmes.htm. Some POS tags are produced by sube-specification of the full tag (ex. “A” is a subespecification of “AMS”, “AMP”,etc.). The machine learning approach that we have used is based on associationrules, one of the most-known methods to detect relations among variables intolarge symbolic (i.e. non numerical) data [14].

Table 4. Linguistic features used in our work.

POS Tag meaning

A DeterminerC ConjunctionD AdverbE EspecifierJQ Qualifier adjectiveJ AdjectiveN4 Proper nounN5 Common nounP PrepositionR PronounT DateVC Verb (participle)V1P Verb (first person, plural)V1S Verb (first person, singular)V2 Verb (second person)V VerbX Number

Table 5 shows an example of plain text and its corresponding generated testcorpus text. In bold we have marked the category GEN, which indicates thatthis sentence is classified as part of a non-specialized text. Observe that “Plaintext” section includes the sentence as found in the general corpus while the“Attributes generated from text” section includes just a list of the lemmas/tagsfound in such sentence.

Table 5. Example of economic plain text and attributes generated from text.

Plain textTras el acuerdo con los pilotos, la direccion de Alitalia concluyo ayerde madrugada la negociacion con los sindicatos del personal de tierra,que aceptaron 2.500 despidos (la propuesta inicial era de 3.500), lacongelacion de los salarios durante dos anos y el bloqueo del fondo deprevision social durante el mismo periodo, para evitar la quiebra de lacompanıa.

Attributes generated from textGEN ser congelacion despido prevision tierra dos direccion el tras paraquiebra periodo negociacion mismo piloto bloqueo = salario A AlitaliaC D de N4 N5 personal companıa fondo P R que JQ V propuesta num Xsocial con ayer aceptar madrugada sindicato concluir ano inicial duranteacuerdo y evitar

4.2 Association rules

We consider association rules of the form X ⇒ D, where X is a set of at most 5lemmas and/or tags, D is the decision: SPE for specialized and GEN for general.For a rule to be valid, X has to be included in more than 0.5% of the sentences(this is called the support of the rule) and more than 90% of these sentences thatinclude X have to be in category D (this is called the confidence of the rule).Since the right part of the rule is restricted to a few numbers of categories, weshall refer to these rules as decision rules. This kind of rules can be computedusing ”Apriori”, a standard GPL packages by Christian Borgelt (http://www.borgelt.net/apriori.html). Our experiments over the economic corpus showthat this strategy allows us to obtain 46,148 decision rules. It appears that:

– 60% of the rules induce category SPE, which means that there are moreimplicit decision rules among specialized texts than non specialized ones.

– 78% of the rules include at least one grammatical tag which shows that thisinformation is significant to distinguish between these two categories.

Here is a sample set of 10 rules randomly extracted from the total list ofdecision rules for the economic corpus. Rules are given in Prolog format: the de-cision is on the left and the two figures on the right give respectively the supportand the confidence of the rule.

SPE ← europea N4 JQ N5 (50, 100.0)SPE ← millones X JQ P (70, 100.0)GEN ← anunciar N4 P = (80, 98.3)GEN ← ayer uno R N4 (10, 100.0)SPE ← funcion C JQ D (12, 93.1)GEN ← Gobierno haber VC V (60, 100.0)GEN ← Espana que P = (100, 100.0)SPE ← embargo sin de N5 (70, 100.0)

SPE ← internacional a R N5 (12, 90.8)

GEN ← presidente en R JQ (80, 93.0)

Therefore each rule indicates that if a given set of lemmas and tags is includedin one sentence, there is a specific probability to classify the sentences as general(GEN) or specialized (SPE). As an example, the first rule may be read as follows:if the sentence under analysis includes the lemma “europea” and words withthe POS tags “N4”, “JQ” and “N5”, then such sentence may be classified asspecialized (SPE). The coverage of this rule is 50% with a 100% of precision.

4.3 Classifiers based on decision rules

Once this set of rules is available, it is possible to build a classifier that, given asentence, looks for the set of rules that match the sentence and chooses the rulethat has the highest confidence. One important feature of this type of classifieris that it indicates when it cannot take a decision.

As a variant of this basic classifier (Classifier 1) we have developed a variantthat only takes into accout those rules including at least one POS tag (Classifier2). In this way it is possible to evaluate the actual impact of using POS tags asa classifier atribute.

4.4 Results

To evaluate the results of both algorithms we have used classical precision, recalland F-Score measures.

Results of this algorithm over the economics corpus are shown in Table 6.

Table 6. Results of Classifier 1 over the economics corpus.

Precision Recall F-Score

GEN 0.7602 0.8671 0.8137SPE 0.8875 0.7239 0.8057

Average 0.8190 0.7890 0.8040

We have carried out another experiment over the economics corpus, usingfor the classifier (Classifier 2) only the association rules including at least onegrammatical feature (POS tag). This is a subset of 36,217 rules (78%). Resultsobtained by Clasifier 2 over the economics corpus are shown in Table 7.

This evaluation shows that elimination of rules exclusively based on lemmasdoes not significantly degrade classifier performance. In fact, is seems that itlightly improves the average F-score (from 0.8040 to 0.8051).

Table 8 includes results obtained by Classifier 1 over the sexuality corpus,using very short association rules (with 3 tokens and 1 token). Results show

Table 7. Results of Classifier 2 over the economics corpus.

Precision Recall F-Score

GEN 0.7582 0.8959 0.8213SPE 0.8749 0.7182 0.7889

Average 0.8166 0.8071 0.8051

that Classifier 1 performance is better over the economics corpus than over thesexuality corpus (with an average F-Score of 0.8040 and 0.7465, respectively).This fact would mean that grammatical (POS tags) and lexical features (tokens)included into the specialized texts in economics are quite different (that is, morediscriminant) to the ones included into the non-specialized texts. However, al-though these features allow Classifier 1 to discriminate between specialized andnon-specialized texts from the sexuality domain, they are less representative ofeach one of these corpora.

Table 8. Results of Classifier 1 over the sexuality corpus.

3 word rules 1 word rules


GEN 0.7573 0.6944 0.7245 0.7455 0.7371 0.7412SPE 0.7258 0.7843 0.7539 0.7478 0.7559 0.7518

Average 0.7416 0.7393 0.7392 0.7466 0.7465 0.7465

Our results show that both strategies (association rules and n-grams dis-tances) work better over the economics corpus than over the sexuality corpus.This is due to the fact the economics corpus is a “real” specialized non-specializedcorpus, including texts where all the sentences are well-written, they have a verywell-defined style and the order of grammatical tags are correct. This is nor-mal because the authors of these texts were specialists from the domain (in thecase of academical texts) or journalists (in the case of news from newspapers),respectively.

Obtained results with both strategies are good over the economics corpus,although results with n-grams distances are a bit better than using associationrules (0.8051 vs. 0.8385). Nevertheless, the association strategy has one advan-tage: the generated rules are humanly understandable and interpretable. Then-grams strategy offers only n-grams of characters, that is, unintelligible textualshort passages (as the information included in Table 3 shows).

However, the association rules strategy over the sexuality corpus does notobtain so good results as over the economics corpus (average F-score of 0.7465vs. 0.8051, respectively; that is a 0.0586 less). This is due to the fact that thespecialized and non-specialized sexuality corpora contains texts extracted from

very different sources (academic vs. forums and blogs) but the vocabulary theycontain is very similar. This situation makes the differentiation task more diffi-cult.

5 Conclusion and Future Work

The results we have obtained until now show that both strategies we have usedin this work (n-grams distances and association rules based on lexical and gram-matical features) are suitable to differentiate sentences from specialized and non-specialized texts. Results of the first experiment, employing a simple n-gramsdistance algorithm (generating language models for both corpora), show thatperformance using this strategy is high. Results of the second experiment, usinglexical and grammatical features, show that grammatical features are discrim-inant enough for this task. We have shown that both approaches are usefulto classify texts as specialized/non-specialized. The obtained F-scores for bothmethods are similar on the corpus from economics, but the classifier based onn-gram distances is clearly better when it is applied to the sexuality corpus. Suchresults seem to show that linguistic information is not as useful as foreseen. Butspecific characteristics of the texts included in the sexuality corpus may be theorigin of this behaviour. These texts come from a source quite different fromthe texts in economics, since they come mainly from blogs, forums, associations,etc. that produce non-structured texts (incomplete or even non-grammaticalsentences or wrong words). This requires additional experimentation in otherdomains as well as texts coming from equivalent sources.

We plan as well to develop an automatic tool able to detect sentences fromspecific domains (ex. medicine, economics, law, biology or physics), giving to theuser the option to choose between specialized and non-specialized texts.

We consider that our results constitute an innovative perspective to researchon domains related with terminology, specialized discourse and computationallinguistics, like for example automatic compilation of LSP corpora or optimiza-tion of search engines.

References

1. M. T. Cabre. Textos especializados y unidades de conocimiento: metodologıa ytipologizacion. In Garcıa Palacios, J.; Fuentes, M. T. (eds.). Texto, terminologıay traduccion. Salamanca: Ediciones Almar, pages 15–36, 2002.

2. Pearson J. Terms in context. John Benjamin. Amsterdam., 1998.3. M.T Cabre. La terminologıa. Representacion y comunicacion. Barcelona: IULA-

UPF, 1999.4. R. Kocourek. La langue francaise de la technique et de la science. Vers une lin-

guistique de la langue savante. Wiesbaden: Oscar Branstetter, 1991.5. L. Hoffmann. Kommunikationsmittel Fachsprache - Eine Einfuhrung. Berlin:

Sammlung Akademie Verlag, 1976.6. R. Coulon. French as it is written by French sociologists. Bulletin pedagogique des

IUT, (18):11–25, 1972.

7. H. Cajolet-Laganiere and N. Maillet. Caracterisation des textes techniquesquebecois. Presence francophone, (47):113–147, 1995.

8. M.C. L’Homme. Contribution a l’analyse grammaticale de la langue d’especialite :le mode, le temps et la personne du verbe dans quelques textes, scientifiques ecritsa vocation pedagogique. Quebec: Universite Laval, 1993.

9. M.C. L’Homme. Formes verbales de temps et texte scientifique. Le langage etl’homme, 2-3(31):107–123, 1995.

10. M.T. Cabre, C. Bach, I. da Cunha, A. Morales, and J. Vivaldi. Comparacionde algunas caracterısticas linguısticas del discurso especializado frente al discursogeneral: el caso del discurso economico. In XXVII Congreso Internacional deAESLA: Modos y formas de la comunicacion humana (AESLA 2009). CiudadReal: Universidad de Castilla-La Mancha, 2010.

11. M.T Cabre. Constituir un corpus de textos de especialidad: condiciones y posibil-idades. pages 89–106. Ballard, M.; Pineira-Tresmontant, C. (eds). Arras: ArtoisPresses Universite, 2005.

12. J. Vivaldi. Corpus and exploitation tool: IULACT and bwanaNet. In I Inter-national Conference on Corpus Linguistics (CICL-09), pages 224–239. CantosGomez, Pascual; Sanchez Perez, Aquilino (ed.) A survey on corpus-based research,Universidad de Murcia, 2009.

13. Medina A. and Sierra G. Criteria for the Construction of a Corpus for a MexicanSpanish Dictionary of Sexuality. In 11th Euralex International Congress. Vol. 2.Universite de Bretagne-Sud. Lorient, Francia, 2004.

14. A. Amir, Y. Aumann, R. Feldman, and M. Fresko. Maximal Association Rules: ATool for Mining Associations in Text. Journal of Intelligent Information Systems,5(3):333–345, 2005.

15. O. Stanislas, R. Mickael, C. Nathalie, R. Kessler, F. Lefevre, and J-M. Torres-Moreno. Systeme du LIA pour la campagne DEFT’10 : datation et localisationd’articles de presse francophones. In DEFT’10. Montreal, 2010.

16. R. Kocourek. La langue francaise de la technique et de la science. Wiesbaden:Oscar Branstetter (2nd ed., 1991), 1982.

Date post:	12-Nov-2023
Category:	Documents
Upload:	avignon-univ
View:	1 times
Download:	0 times

Automatic Specialized vs. Non-specialized Sentence Differentiation

Documents