Is Japanese gendered language used on Twitter ? A large scale study

This study analyzes the usage of Japanese gendered language on Twitter. Starting from a collection of 408 million Japanese tweets from 2015 till 2019 and an additional sample of 2355 manually classified Twitter accounts timelines into gender and categories (politicians, musicians, etc). A large scale textual analysis is performed on this corpus to identify and examine sentence-final particles (SFPs) and first-person pronouns appearing in the texts. It turns out that gendered language is in fact used also on Twitter, in about 6% of the tweets, and that the prescriptive classification into"male"and"female"language does not always meet the expectations, with remarkable exceptions. Further, SFPs and pronouns show increasing or decreasing trends, indicating an evolution of the language used on Twitter.


Introduction
Language and gender research studies have described Japanese as a language with particularly marked gender-differentiated forms (Okamoto and Shibamoto, 2016). Despite Japanese speaking practices and Japanese "women's language" (defined as joseigo ) have been the object of decades of research (Ide 1979(Ide , 1997Shibamoto 1985;Ide, Hori, Kawasaki, Ikuta and Haga 1986;Reynolds and Akiba, 1993), our understanding of Japanese women and men as speaking subjects is still largely influenced by a normative depiction of how they should or are expected to talk. These stereotypical features include different characteristic of speech styles, such as polite, unassertive, empathetic speech, as well as more specific features such as honorifics, self-reference terminologies, sentence-final particles (SFPs), indirect speech acts, exclamatory expressions and voice pitch level (Shibamoto Smith 2003). Despite these features are assumed to be used only in spoken Japanese, the present study focuses on the analysis of sentence-final particles (SFPs) and personal pronouns as used on Twitter. Contrary to other authors who make use of machine learning to estimate the gender of a Twitter account (see, e.g., Ciot et al., 2013;Sako and Hara, 2013), the aim of this study is about comparing the prescriptive use of SFPs and personal pronouns and their actual use on Twitter. It first focuses on the relative usage of these tokens on a dataset of about 400 millions of tweets collected from late 2015 till early 2019. Regardless of the gender of who actually wrote the tweets, the variation in time is observed and some clear trends emerge. In order to test whether the prescriptive assumption also holds on written Japanese on social media, a subsample of about 2500 accounts has been manually analyzed in order to attribute gender and category (i.e., politician, musician, actor, etc) of the account, by scrutiny of their complete timelines. It emerged that the relative usage of these tokens among males and females subgroups is different and statistically significant confirming the assumption. Further, interesting switching patterns, i.e. "male" markers used by females and viceversa, for some categories are also observed. These analyses have never been conducted so far to the best of authors' knowledge so we think that this paper can pave the way for the construction of new text-based Twitter gender classifiers for Japanese language that, so far, is quite a challenging and unsolved task (see, Ciot et al., 2013). This study also analyzes an unprecedented number of tweets and, by far, uses the most extensive human coded data set of Twitter accounts. The paper is organized as follows. The next section introduces the reader to the elements of gender markers in Japanese spoken language. Then, the data and the results of the historical analysis are presented. The study of the relative usage of gender markers for testing the prescriptive assumption follows next and finally a discussion closes the paper. gender-neutral particles like "ね" ne or "よ" yo have been characterized as mostly neutral . Table 1 summarizes SFPs as presented by different authors who have classified them either in three (M, N, F) or five categories: strongly masculine (SM), moderately masculine (MM), neutral (N) strongly feminine (SF), moderately feminine (MF). This classification is a simplified version in which no distinction has been done between strongly and moderately feminine/masculine particles.
The particles listed in Table 1 are based on several previous studies on SPFs and on grammar texts (Okamoto, 1992(Okamoto, , 1995Shibamoto Smith, 2003;Okamoto and Shibamoto Smith, 2004;Sturtz Sreetharan 2004;Hiramoto, 2010;Kawasaki and McDougall 2003). Yet the classification is much more complex and diversified than what presented in Table 1 (see, Kawaguchi, 1987;Okamoto and Sato, 1992, for an in depth discussion). It should be noted that Table 1 shows stereotypical gender categorization that is ideological and prescriptive in nature. The SFPs "やん" yan , "やない" yanai , " やん け" yanke are generally present in the Hanshinskai dialect spoken by people from Kansai area, while all the others are part of the Standard Japanese.
In addition to SFPs, also self-reference features appear to be used in a gender-differentiated way: the repertoire of first-person pronouns of men and women differs as shown in Table 2 and displays 1) a tendency in the use of more formal pronouns by women than by men in referring to themselves, and 2) the near absence of deprecatory/informal expressions for women for the first-person (Ide, 1990). Examples of gender-exclusive personal pronouns, as prescriptively classified by several authors, are "あたし" atashi (feminine), "俺"-"おれ" ore and "僕"-"ぼく" boku and the second person pronouns "君"-"きみ" kimi and "おまえ" omae (masculine) ( Ide, 1997;Ide and Yoshida, 1999;Shibamoto Smith, 2004). Table 2. Classification according to different authors. "*" not used in the analysis. " M " = masculine , " F " = feminine , " M & F " both genders.
Women's language is a space of discourse that reduces Japanese women to a knowable and unified group, objectifying them through their language use (Inoue, 2006). However, this does not mean that the Japanese express themselves through binary linguistic forms (Nakamura, 2014b). While most studies of Japanese language and gender have tended to focus on these normative usages without examining their implications for the real language practices of real speakers, others have shown that Japanese women and men do not necessarily conform to what they consider to be linguistic gender norms (Okamoto and Shibamoto, 2004;Okamoto and Shibamoto, 2016). It will be interesting to see, as the analysis in the following section will unfold, how these gender markers can be used in practice by both genders.
While in order to provide an exhaustive explanation of SFPs from a semantic and pragmatic point of view would require more space, a few examples (Table 3)  h. Iku yo ne (N) go-Pres "I am going". According to the above mentioned normative categorization (Table 2), women are supposed to use a higher level of formality, compared to men, when using first-person pronouns in formal contexts (e.g., "わたくし" watakushi for women vs "わたし" watashi for men), and avoid deprecatory words/expressions to refer to themselves or second person (Ide, 1990). These categorical differences in the repertoire of first-person pronouns, may lead to think that women use automatic expressions of defence and demeanor and are always polite in their speech (Ide, 1990). Since in Japanese SFPs are elements that mark attitude and/or emotion and therefore occur mostly in speech, with a greater variety found in informal than in formal speech (Narahara, 2005: 151), it may be argued that the current analysis of SFPs in online short written texts (i.e., tweets) is not compatible with their original use in oral communication. However we argue that the distinction between written and conversational discourse, as defined by Reynolds (1985:17), as "highly contrastive in respect to the potential of the addressee's immediate response", has to be reconsidered in the context of Social Network Sites (SNSs). Reynolds explains that in the former, the writer and the reader are spatially separated and coding/decoding is not simultaneous and the writer has no specific reader in mind. At the same time, the reader cannot be expected to participate in the speech act which the writer is performing. In conversation, on the other hand, the speaker is aware of the potential of the listener located within the immediate space of the speaker and the speaker is aware of the potential of the listener's immediate reaction. Most SNSs are now characterized by features that make them look closer to oral communication than written discourse. Nobody can argue today that SNSs do not allow immediate reactions by the potential listeners/readers. Immediacy, on SNSs, is one of the most appreciated features. Twitter, in our case, allows the "speaker" to talk to the universe as well as to an imaginary/ideal "hearer" or, in some cases, even a "real" one.
The aim of the present study is to focus on Japanese as it is used on Twitter, the most popular SNS in Japan with as much as 25.6 million monthly active users in 2019. The main purpose is to investigate how gendered language is used online, and how, considering that Twitter allows us to be anonymous, it may affect the way users (people) express themselves. This will be done by analysing the use of SFPs and first-person pronouns by a very high number of Twitter users as described in the next section. We will use the term token to refer to either the sentence-final particles (SFPs) and to the pronouns listed in Tables 1 and 2 when there is no need to distinguish among the two categories.

DATA COLLECTION AND RESEARCH METHODOLOGY.
The data used in this work come from two different repositories. The first larger repository of historical Twitter data used to monitor the usage of each token in time, is part of a collection of Japanese tweets collected through the Twitter search API in the period 24 August 2015 -22 February 2019 and consisting of 408 Million tweets. This repository was collected with a different scope within the iGenki project, a study intended to measure the expressed well-being on social networks, and using only the filter on language = "Japanese" and country = "Japan" . Twitter search API only provides a 10% sample of all tweets but Twitter does not disclose any information about the representativeness of this sample with respect to the whole universe. Nevertheless, according to our personal experience, also confirmed in Hino and Fahey (2019), the coverage of topics and keywords is quite accurate, so we can consider our repository a sufficiently large and representative sample of the Japanese language "spoken" on Twitter. From this repository we extracted the tweets containing sentences ending with the 29 SFPs listed in Tables 1 and 8 pronouns as listed in Table 2. We ended up with 26,737,077 tweets (about 6% of the total) written by 1,469,232 unique Twitter accounts.
As we are interested in mapping tokens to gender, and gender information on Twitter is not available most of the times, we extracted a sample of 4,000 accounts and asked human coders to classify them according to the following characteristics: gender (male/female/unknown = "cannot guess"/none="no gender, i.e., news outlet, etc") and type (individual/group/advertising/brand/media/none of the previous). The human coders have to go through the full Twitter timeline to discern the above two characteristics . In some cases, profiles have been restricted or closed and thus no 1 information could be retrieved. It should be taken into account that in Japanese the subject is generally omitted and first-person pronouns are used infrequently so, as can be noticed in Figure 2, it should be expected to observe a lower frequency with respect to SFPs.

RESULTS ON HISTORICAL DATA
In this section we will focus the analysis on the first set of 26,737,077 historical tweets for which the gender is not known a priori. Figure 1 shows the relative usage over time of each SFPs and pronouns, while Figure 2 shows the absolute frequency. For example, Figure 1 shows that the SFPs "ぞ" zo , "ぜ" ze , "さ" sa , "な" na (row 3, columns 2, 3, 4, 5) show a decreasing trend, as well as "わ" wa (top-left) which was most observed in 2015 (100%) with an evident decrease in use, as low as 40%, in 2020. The first-person pronouns "僕" boku (row 5, column 3) and "わたくし" watakushi instead, display an increasing pattern. Other tokens, e.g., "わし" washi (row 2, column 7) have a more stable pattern.  Right panel is a zoomed plot for tokens that appear less than 100,000 times. The lowest frequency is 540 for "わよね" wa yo ne .

TESTING THE PRESCRIPTIVE ASSUMPTION
We now focus on the selected sample of 2,560,596 tweets from 2,355 accounts for which human coding identified gender without ambiguity.
To measure the relative importance of the usage of each token in the texts we apply the concept of "text keyness" (see, e.g., Bondi and Scott, 2010) which is essentially a (signed) Pearson's chi-squared test statistics that evaluates a measure of discrepancy between the relative frequencies of each token in the two groups of "males" and "females". Showing the exact value of the test statistics per se is not that important for this analysis, so we use the terms "mostly female" (large positive value of the chi-squared statistics), "almost equal" and "mostly male" (large negative value of the chi-squared statistics) relative usage in the figures. Figure 3 shows the keyness of each token by gender over time. Lines in the upper part represent tokens mostly used by females, and viceversa. So, for example, the first-person pronouns "わたし" watashi and "あたし" atashi are more frequent among females, data that confirms what advocated by previous research (respectively as used in formal/plain female speech, and in plain/informal female speech) but the second has a decreasing trend, that is "あたし" atashi is used less and less (by female). On the other hand, "俺" ore and "僕" boku are mostly used by males. The SFP "じゃん" jan is equally used by the two groups, i.e., it is not a "key" word. But while, contrary to what has been prescribed by previous studies (see references in the second paragraph of this research) "かしら" kashira is, for example, considered a strongly feminine SFP, the results shown in Figure 3 do not display a clear-cut usage: "かしら" kashira is surprisingly not employed by women more than it is by men on the analysed tweet dataset. It is interesting to notice that there seems to be some sort of convergence from below and from above towards the centre, that is a tendency for females and male to use mroe neutral forms. Signs of neutralization in the Japanese language had been previously found also by Nakajima (1997) and Takasaki (1997).  Figure 4 shows the overall relative keyness of each token among males and females Twitter accounts. It is clear that "俺" ore is almost only used by males and "わたし" watashi by females, but not as much. Yet, the length of the bar "わたし watashi " is almost half of the one of "俺 ore ". We then analyzed the same tokens by the groups whose gender has been manually identified. These groups are: mangaka (cartoonist, Figure 5), musicians (include singers, Figure 6), actor/actress (and performers in general, Figure 7), politicians (Figure 8), athletes (different types of sport, Figure 9), tv show-related people (anchor persons, talent, tv guest, etc, Figure  10), and YouTuber/blogger ( Figure 11). Table 4 summarizes the findings using the following caption: "F➔N" ("M➔N") = slightly more among females (males) but statistically not significant (p-value > 0.05) . "F" / "M" = mostly among females (males), with Chi-squared statistics p-value between 0.01 and 0.05. "FF" / "MM" = almost only among females (males), with Chi-squared statistics p-value less than 0.01 (usually much less). From these results, it is evident that the prescriptive assumption of gender-makers is not always satisfied overall but only for some subgroups (last column of  Table 4. Summary results of relative usage by gender of the different SFPs and personal-pronouns according to different subgroups of accounts and "overall". "F➔N" ("M➔N") = slightly more among females (males) but statistically not significant (p-value > 0.05) . "F" / "M" = mostly among females (males), with Chi-squared statistics p-value between 0.01 and 0.05. "FF" / "MM" = almost only among females (males), with Chi-squared statistics p-value less than 0.01 (usually much less). In the column "Assumption": green color = "assumption verified", red color = "reverse assumption verified" for the overall sample of 2,355 accounts. Table 4 shows results of the relative usage (by gender) of the different SFPs and first-person pronouns, as they have been used on Twitter by different accounts and in the "overall" column.
The captions, here, show whether the token were mostly used by male (M or MM) or female (F or FF) accounts. The last column displays to which gender the specific token has been associated by previous research on Japanese gendered language, while the color confirms (green) or denies (red) it according to what found in the real conversations/utterances on Twitter. In the case of "よ" yo, " よね " yone", " かな " kana and " わ" wa, data present a general larger use by females than by males , thus proving the expectations wrong. This means that, 3 for example, " かな " kana , usually correlated to men, was found as more used by females on the analyzed tweets. Something similar (but opposite) happens with " さ " sa and " な " na. However, what is remarkable are those cases in which only one or two subgroups, compared to all the others in the same raw, show a usage by the opposite sex. Clear examples are: 1) " さ " sa , usually associated to male usage and confirmed in most columns, was found frequently used also by female musicians, bloggers and tv show-related persons; 2) "な " na , usually associated to mostly neutral and male usage, and confirmed in most columns, here results to be employed also by actresses and female musicians; 3) "け" ke and "ぞ" zo , normatively associated to males, appear to be substantially used also by the opposite sex but only in the case of politicians; 4) "ぜ ze, normatively associated to males, appears to be considerably used by female Athlets; 5) "やん" yan and "かい" kai, two SFPs consistently associated to masculine speech, are found to be strongly used respectively by female mangaka ( manga artists) and by bloggers and Youtubers.
As for first-person pronouns, while the usage of "俺" ore (including "おれ, its hiragana version) 4 and "僕" boku by male users is consistent with what presented by previous research, the following ones present an interesting variation in use in "online speech" on Twitter: A. "ぼく" boku ( hiragana version of "僕" boku ) originally associated to male speakers, has 5 been found in tweets by female mangaka and actresses. B. (conversely) "わたし" watashi , associated by previous studies to both male and female speech, here is found only among females. Smith (1992) and Reynolds (1993) write that women in positions of power appear to experience linguistic conflict and that they solve it either by 'defeminizing' their language or creating new strategies to cope with it. These observations are also reported by Takasaki (1996) on the different speech styles used by females in television interviews. By mixing 'women's language', 'men's language' ( danseigo ) and neutral forms, they enrich their speech and add more expression and colour to their account (Takasaki, 1996). Our analysis suggest that women politician may want to use "け" ke and "ぞ" zo in order to reject the dominant gender and sexual norms, empowering their speech while negotiating their identity and power. Something similar has been found in the use of first-person pronouns by junior high-school female students who used to negotiate their power through the use of the normative masculine "僕" boku (Miyazaki, 2004). As for the unexpected use of "やん" yan and "かい" kai , and of "ぼく" boku by mangaka , the reasons may be more complex and be related also to the use of "role language" ( yakuwarigo ) that the current research does not cover. 6 Such complex, contradictory, and unexpected processes can be understood only by following the naturally occurring linguistic practices of a specific community as has been done in this study on Twitter. However, only a close examination of language use in real contexts would enable us to fully understand the complicated and dynamic relationship between specific linguistics choices and their social meaning. The present study is mostly descriptive as it takes into account only the presence of SFPs and first-person pronouns. A more extensive semantic analysis will be the object of further investigation.

CONCLUDING REMARKS
This work presented the first large scale study of Japanese gendered language on Twitter on both historical data (408 million tweets from 2015 till 2019), and a large sample of 2,355 manually scrutinized Twitter accounts. This study has shown that the real use of gendered language on a SNS such as Twitter does not always meet the expectations according to ideological linguistic norms thus reinforcing what stated by Okamoto and Shibamoto Smith (2016: 22): "while Japanese linguistic gender norms may appear to be firmly established in society, closer examination reveal highly diverse views about and uses of gendered speech, suggesting that normative meanings are contested and that the indexical fields of gendered speech forms are potentially broader and variable".
The analysis also revealed that the use of gendered language changes through time and also across different subgroups of Twitter accounts, indicating either a strategic (politicians) or a further stereotyped ( mangaka ) usage. The use of gendered language on SNS is a relatively unexplored area in the field of sociolinguistics and there are few studies on the subject. The authors hope that this work will contribute to raise interest and inspire further studies on the subject. This study also paves the way for new machine learning Twitter accounts classifiers.