There’s much food for thought in this post from Mike Wesch about how affordable 3D printing might influence the way we construct our identities. The video “Why I Love My 3D printer” is also pure gold.
About a month ago I introduced my new Gawk script metrify.awk, which generates a wide range of Twitter metrics for a given Twapperkeeper/yourTwapperkeeper hashtag or keyword archive. Even as I was writing those posts, though – and certainly while playing with the language metrics I discussed in my last post –, I started to find a few areas where metrify could provide even more information on the dataset. So, the time has come for a first service release which upgrades metrify.awk to add some more functionality (and fix a few inconsistencies along the way). This is a revision rather than a full rewrite of the script, so let’s call it metrify 1.2; it’s now available for download here, where it replaces the older version.
As before, the new version of metrify.awk is called as follows:
gawk -F , -f metrify.awk time=”[year|month|day|hour|minute]” [divisions=x,y,z,…] [skipusers=1] input.csv >metrics.csv
(divisions defaults to ‘90,99’ – i.e. a 90%/9%/1% split of the userbase – if it is not specified).
In this post, I won’t go from scratch through the entire range of metrics that metrify.awk generates; my original four-part post is still sufficient for that purpose. Rather, I’ll focus only on the major changes in this new revision, which relate mainly to part two of that series (and I’ve noted the updates in those posts as well, to avoid confusion): the metrics over time.
Changes to Metrics over TimeThe first table generated by metrify shows the metrics over the chosen timeframe (e.g. day or hour), but it now contains a number of additional data points. The changes only concern the columns which contain metrics for the various user percentiles which are defined with the ‘divisions’ argument. Rather than providing information only on the number of users from each percentile which are actively participating during each timeframe (expressed as a percentage of the total number of currently active users), as metrify 1.0 did, revision 1.2 provides a number of further metrics:
Here’s a comparison of the relevant output columns between versions 1.0 and 1.2:
metrify.awk 1.0 metrify.awk 1.2 number of current users from least active x% (< u tweets) lowest x% users (<= u tweets) % of current users from least active x% (< u tweets) number of tweets from least active x% (< u tweets) % of tweets from least active x% (< u tweets) number of current users from > x% group (> u-1 tweets; a of n users) users > x% (> u tweets; a of n users) % of current users from > x% group (> u-1 tweets; a of n users) tweets from > x% group (> u-1 tweets; a of n users) % of tweets from > x% group (> u-1 tweets; a of n users) number of current users from > y% group (> v tweets; b of n users) users > y% (> v tweets; b of n users) % of current users from > y% group (> v tweets; b of n users) tweets from > y% group (> v tweets; b of n users) % of tweets from > y% group (> v tweets; b of n users)
(with the default settings, x% would be 90% and y% would be 99%; a, b, u, v, and n would depend on the dataset).
So, it now becomes possible not only to track what percentage of the total number of currently active users are from each of the percentiles we have defined, but also what percentage of the total volume of tweets during each period is contributed by each of the user percentiles. By way of example, here’s a comparison of those metrics for the #egypt dataset during February 2011:
Active users in the 90/9/1 user percentiles as percentage of total active userbase
Tweets by users in the 90/9/1 user percentiles as percentage of total current tweet volume
Unsurprisingly, the two charts move together – the greater the presence of a specific user group in the total active userbase, the greater their contribution to the current tweet volume – but only the second chart also tells the story of just how dominant the most active one per cent of users really is. Towards the end, they still only constitute slightly less than 20% of the total userbase participating during the final days of February – but more than half of all tweets posted at that time originate from them.
(At a later stage, I may also add functionality to track the use of different tweet types over time, by the different percentiles – but that’s a feature for metrify 1.5 or so.)
Other ChangesThe only other notable change in this new revision is that the third of the tables generated by metrify.awk, which describes the participating users themselves, has gained a further column, ‘percentile’. This contains a simple descriptor of which of the various percentiles a user has been placed in, and thereby allows for an easier filtering of the list (using Excel’s data filter functions). For the standard 90/9/1 division of the userbase, fields in the column would contain one of the following four options for each user:
Additionally, and less obviously, I’ve also rewired how users are tracked through the dataset. In principle, this should be a very simple process: each user has both a unique numerical Twitter user ID, and a unique alphanumeric username. However, for some esoteric reason the user IDs returned by the Twitter search and streaming APIs, which Twapperkeeper uses to retrieve its datasets, do not always match, especially for older archives (or perhaps for older accounts?); the same user may have two completely different user IDs (thanks for John O’Brien for the details on this). This means that using the user IDs to track user activities in the dataset is unreliable. Usernames, however, may also be changed by the user at any point – @KRuddMP could become @KRuddPM when you least expect it. (Sorry, couldn’t resist!)
Still, as this doesn’t happen all too often, and given the unreliability of the numerical user IDs, metrify does use (lowercase) usernames as its internal tracking ID. The final output itself shows usernames in their properly capitalised form as we’ve first encountered it in tweets by the users themselves (they may also have chosen to change that capitalisation at a later date, though; we’re not checking for that), wherever possible; for users who are only mentioned, but don’t themselves tweet actively, we use the capitalisation which we first encounter.
Finally, one caveat remains: as before, metrify will take quite some time to process a large dataset, and is likely to run out of memory if it’s trying to generate full user metrics for such datasets. (There doesn’t seem to be any way to allocate more memory to Gawk – or to the shell it runs in –, so there’s little I can do to fix this.) Where full, detailed per-user metrics aren’t required, use the skipusers=1 command-line argument, and Gawk will only output the number of tweets contributed by each user, and the percentile they’ve been allocated to on that basis. And it will take a lot less time to do so.
So much, then, for this service update of metrify.awk. In a follow-up post in a few days, I’ll show how metrify metrics can also be imported into Gephi to turbo-charge our network visualisations of Twitter @reply and retweet networks…
Another brief announcement: along with our CCI colleague Larissa Hjorth, Axel and I are looking forward to editing a special issue of the Journal of Broadcasting & Electronic Media (JOBEM) on the theme “Emerging Methods for Digital Media Research”, due for publication in March 2013. If you work in a related area, please consider submitting an abstract by the March deadline. Details follow below.
Emerging Methods for Digital Media Research
Special Themed Issue of the Journal of Broadcasting & Electronic Media (JOBEM), March 2013.
Guest Editors:
Jean Burgess (QUT)
Axel Bruns (QUT)
Larissa Hjorth (RMIT)
ARC Centre of Excellence for Creative Industries & Innovation (http://cci.edu.au/)
Editor: Zizi Papacharissi
With the rise of ‘big data’, locative media, and smartphones, existing media and communication studies methods are being recombined, reconfigured and replaced alongside their objects of study. This special issue of JOBEM seeks to expose new research methods for understanding the changing nature of the content industries, the impact of digital media on the practices of creative workers, and the experiences and practices of everyday users of digital media technologies.
We welcome papers based in the humanities and social sciences that reflect on, discuss or critique current methodological trends in digital media research, shedding light on the following questions:
1. Where are the emerging methodological gaps – are there pressing research problems that require the development of new methods, techniques and tools?
2. Where are there needs for new combinations of methods, within or across disciplines?
3. What are the implications for future pedagogical models in internet, media and communication studies, including doctoral education and other forms of research training?
We especially welcome papers grounded in the experience of conducting empirical digital media research. However we will give preference to papers that contextualise, historicise, and reflect on current methodological trends; rather than simply report on the applications or results of new methods.
Abstracts of 250 words are due by 31 March, 2012. Depending on the number of abstracts received, we may shortlist submissions at this stage. Please email your abstract and a list of 3 or 4 suggested peer reviewers to: jobem.edm@gmail.com.
Full articles of no more than 7000 words should be submitted on or before 1 August, 2012 at: http://mc.manuscriptcentral.com/hbem (select “Special Issue: Emerging Digital Methods” as a manuscript type). Manuscripts should conform to the guidelines of the Journal of Broadcasting & Electronic Media.
OK, this may be a somewhat esoteric subject for researchers who mainly work with Twitter data from specific countries and cultures, but over the past few weeks I’ve been working on a paper that analyses Twitter activities in the #egypt and #libya hashtags – and as part of that work, I’ve been interested in exploring the interactions between users tweeting in Arabic and users tweeting in other languages (mainly in English). Unfortunately, there’s no reliable means of identifying the language of specific tweets, or of the users who post them; while the Twitter API provides an ISO language code (e.g. ‘en’ for English, ‘no’ for Norwegian, etc.) for each tweet, this is drawn simply from the overall language setting of the user’s account, and not specific to each individual tweet itself. For users who alternate between languages in their tweeting, all tweets will be tagged with their chosen language code; for users who haven’t bothered to change their Twitter profile settings away from the default English, all their tweets will be tagged ‘en’, regardless of their actual language.
So far, so unhelpful. Further, short of running every tweet through some form of automatic language recognition tool (using Google Translate or a similar mechanism, for example) – which would be extremely time-consuming for Twitter archives upwards of a few thousand tweets – it is prohibitively difficult to identify the exact language of each tweet, not least also because of the 140 character limit of tweets. In theory, if we had word corpora for all major languages, we could cross-check each tweet against those corpora to see what words from what language occur most frequently – but again, that process would be extremely time-consuming, and would probably have serious difficulties with the abbreviations and contractions which Twitter users commonly employ to stay within that limit.
A much simpler approach – which does generate somewhat less conclusive results, though – works by examining the character sets used in tweets. This is able to make only relatively broad distinctions, but it’s good enough for what I’m trying to achieve with my #egypt/#libya datasets: here, a quick qualitative look at the data suggests that the major division is between Arabic tweets and tweets in English (and to some extent in other European languages) – so the main challenge is to distinguish between Latin and Arabic character sets. This we can do, even just with a basic Gawk script.
Twitter datasets as they are generated by our standard hashtag tracking solution, yourTwapperkeeper, are available in UTF-8 encoding, leaving virtually all characters and character sets intact. Each character is assigned a specific character code, and for historical reasons, the basic characters of the Latin script (unaccented letters, standard punctuation marks, etc.) retain their traditional ASCII codes, with values below 128; beyond that range, we’re moving into accented letters, more unusual punctuation marks, and non-Latin character sets. Sadly, our preferred tool for processing yourTwapperkeeper datasets, Gawk, doesn’t cope all that well with advanced UTF-8 characters – it copes fine with single-byte character codes (i.e. below 256), but not with multi-byte character codes (above 255; it reads these as multiple single-byte characters). At least on a Windows PC, there doesn’t seem to be any way to change that behaviour, either.
However, that’s still good enough for our immediate purpose of distinguishing between Latin and non-Latin (i.e. mainly English and Arabic) tweets. As it turns out, Gawk consistently sees Arabic characters as a sequence of two codes: of either 216 (Ø) or 217 (Ù), followed by another character with a code above 127. So, for a basic distinction between tweets using Latin and tweets using non-Latin scripts, we simply need to count the number of high-ASCII characters (with a code above 127) which Gawk sees in each tweet, and to set a threshold below which a tweet is still classified as ‘Latin’ (to allow tweets that use accented characters or ‘fancy’ quotation marks to be classed as Latin). Through trial and error, I’ve found that a threshold of 20 (i.e. ten Arabic or other non-Latin characters) seems to work reasonably well: few tweets in languages using the Latin alphabet will be miscounted as ‘non-Latin’, even if they contain a number of umlauts or accented characters, while tweets in Arabic, Hebrew, Greek, Chinese, Korean, and other non-Latin alphabets are reliably recognised.
We could use this to mark up the language of every line in a yourTwapperkeeper archive – but that’s not necessarily very useful or interesting. Instead, the script below operates on a user-by-user basis: for each user, it counts the number of their tweets which were above the ‘non-Latin’ threshold, and also calculates a language_ratio value: the percentage of their tweets which used non-Latin characters. The script accepts an optional ‘tolerance’ parameter, to set the ‘non-Latin’ threshold: a typical way to use it would be
gawk -F , -f userlanguage.awk tolerance=20 input.csv >output.csv
(tolerance defaults to zero if it isn’t set).
# userlanguage.awk - Extract stats on the language use of each user, as metrics for network visualisation in Gephi # # this script takes a Twapperkeeper CSV/TSV archive of tweets, and calculates for each user a ratio # indicating how many of their tweets were in non-Latin charactersets # # output is in a format ready to be imported as a node list into the Gephi Data Laboratory # on import, note that new data columns must be imported as 'float' type # # the script skips the first line, expecting that it contains header information # # script expects an optional numerical "tolerance" parameter, to set how many high-ASCII (non-Latin) characters a tweet may contain while still counted as Latin script # set tolerance to ~20 to treat most accented European languages as Latin (note that Gawk will count some UTF-8 characters as two or more high-ASCII characters) # default value for tolerance is 0 # # expected data format: # text,to_user_id,from_user,id,from_user_id,iso_language_code,source,profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,created_at,time # # output format: # nodes,id,label,user_tweets,user_highASCII_tweets,language_ratio # (language_ratio is a value between 1 = no Latin tweets and 0 = 100% Latin tweets) # # Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au BEGIN { getline if(!tolerance) tolerance = 0; # highASCII tolerance level: default 0 for(char = 0; char < 256; char++) { charnum[sprintf("%c", char)] = char } print "Nodes" FS "Id" FS "Label" FS "user_tweets" FS "user_highASCII_tweets" FS "language_ratio" } { nodename[tolower($3)] = $3 node[tolower($3),"tweets"]++ highASCII = 0 for(char = 1; char<=length($1); char++) { if(charnum[substr($1, char, 1)] > 127) highASCII++ # count number of high ASCII (>127) characters in tweet; note: some UTF-8 characters count as multiples } if(highASCII > tolerance) node[tolower($3),"highASCII"]++ } END { for(name in nodename) { print name FS name FS nodename[name] FS node[name,"tweets"] FS node[name,"highASCII"] FS node[name,"highASCII"] / node[name,"tweets"] } }
The resulting data can be used in a number of ways. For one, we might divide the total userbase into three groups: users who mainly used Latin characters (with a language_ratio below 0.33); users who mainly used non-Latin characters (language_ratio > 0.66); and users posting in a mix of languages (language_ratio between 0.33 and 0.66). If we further combine this grouping with the distinctions between lead users, highly active users, and less active users which the metrify.awk script makes possible, we now have the ability to examine the prevalence of different languages across these different groups – for #egypt during February 2011, this is what results, for example:
An interesting result: while ‘Latin’ (in this case, mainly English-speaking) users dominate overall, they’re mainly found amongst the less engaged 90% of users – they’re making or retweeting a small number of hashtagged comments about the situation in Egypt during February. The most engaged one per cent of users contain a much larger percentage of Arabic (i.e. non-Latin) speakers, as well as a sizeable proportion of users tweeting in a mix of languages and character sets.
(Note: of course, speakers of languages such as Chinese, Korean, Japanese, Greek, Hebrew, Russian, etc. will be included in the ‘non-Latin’ group here, and speakers of many European languages other than English will be counted amongst the ‘Latin’ group. In many cases, this will be a problem, and our approach here doesn’t allow for easy distinctions between, say, English and French, or Arabic and Hebrew. For our present purposes, however, that’s a negligible problem – few ‘non-Latin’ languages other than Arabic, and few ‘Latin’ languages other than English, are present in the #egypt dataset to any significant extent.)
Additionally, the output of userlanguage.awk is also designed to be easily imported into Gephi as an additional source of data on the users in the network. Assuming we’ve already created a network (for example showing @replies and retweets) for your dataset, using the Twitter usernames (normalised to lower case) as node IDs, we can now use the Data Laboratory to import the language data into the nodes table, as additional columns. Here, it’s important to make sure the numerical metrics generated by userlanguage.awk (user_tweets, user_highASCII_tweets, language_ratio) are imported as columns of the ‘Float’ type, in order to be able to use them effectively in Gephi.
(I’ll say much more about importing Twitter metrics data into Gephi in a future blog post – stay tuned.)
Once imported, these metrics are now available to be used for various purposes: as a means of sizing or colouring nodes in the network, or as criteria for filtering it. To finish off for now, here’s a simple example, which shows @replies and retweets in the #egypt hashtag during February 2011. I’ve used the language_ratio value as the guide for the colour scale here: blue indicates a language_ratio close to zero (predominantly tweeting in Latin characters); green a language_ratio close to one (predominantly tweeting in non-Latin characters); with a gradient of colours between them. Connections between users are coloured according to the language ratio of the sender. (Full graph here – PNG, 9 MB.)
There’s an obvious language divide here – English- and Arabic-speaking users are mainly tweeting amongst themselves. But there are also a good number of connections across the divide – and for these, given the graph above, the most active #egypt participants are disproportionately responsible: mixed-language users are much more likely to be found in that group than in any of the others.
And that’s it for now – more on my language analysis of #egypt and #libya when the paper gets published, and more on using Twitter metrics in Gephi in a future post!
Tania Lewis, a Chief Investigator on the ARC discovery project ‘The role of lifestyle television in transforming culture, citizenship and selfhood: China, Taiwan, Singapore and India’ recently returned from Mumbai, the home of Bollywood and a major entertainment TV hub, where she has been conducting household interviews with research associate Kiran Mulenhalli, extending on previous interviews she has done with lifestyle and reality TV producers in Delhi and Mumbai. In November and December 2011, Kiran and Tania conducted 18 audience interviews, primarily with households but also with individuals, from a broad range of economic, occupational, ethnic, religious and class backgrounds (though mainly ‘middle class’), and including a wide range of ages (family interviews often included children and also grandparents).
Alok's family
In the household interviews we often spent two or three hours in the homes of the participating families, watching and talking about television together, sharing a meal with the families and discussing their lifestyles and consumer practices. We were interested in seeing how television viewing (particular of reality and lifestyle formats) fitted in with, reflected and influenced their broader lifestyles and their values. Most families watched a range of reality and lifestyle advice shows from Big Boss (the Indian version of Big Brother) to spiritual advice shows (eg yoga gurus on morning television). Some of the more popular shows currently on in India (and that we discussed with the families) include the Indian version of Who Wants to Be a Millionaire, which was widely praised by informants for its educational dimensions and for its positive portrayal of ‘the poor’ (Sushil Kumar, a recent winner on the show, for instance, was from a poor family from Bihar and has since gone on to be an ambassador for a Government programme, the Mahatma Gandhi National Rural Employment Guarantee Act, which aims to support poor rural families in the north). We also, somewhat ironically, watched MasterChef Australia on the large digital flat screen TV of one wealthy Muslim family who described it as one of their favourite shows, pointing to the growing popularity of cosmopolitan tastes around food and fashion more broadly among the Indian middle class.
Sushila's family
Such trends might suggest a globalization and homogenization of Indian lifestyles and consumption, and certainly many of the families spoke of the rapidly growing consumer culture that had engulfed India over the past decade and the growing role of western/global brands in people’s everyday lives from the ubiquitous (and very cheap) Maggi noodles, consumed it seems by everyone, to Nivea face whitening products, used widely by both men and women. While these kinds of consumption practices can, on the surface be seen as ‘western’ they are also Indianised in all kinds of ways and our informants often discussed their lifestyles and consumption as being influenced by a mix of Indian and ‘western’ traditions. And indeed the settings in which we conducted the interviews often reflected this, with television sets placed next to religious shrines and furnishings displaying a hybrid mix of global and local styles. And while reality TV is very popular in Mumbai, one of the most popular shows, that many families mentioned (and that we watched with a few families in situ), was a down home variety-style/ advice show starring a Gujarati actor/TV personality, probably little known outside of north-west India—reflecting the ongoing potency of highly localized televisual and lifestyle cultures even in cosmopolitan Mumbai.
Twelve months ago Brisbane, and the South East Queensland region, were just about to begin the long process of recovery from the major floods which affected Toowoomba, the Lockyer Valley, Ipswich, and Brisbane itself. One of the more positive stories to emerge from the crisis, though, was how social media were used as a tool for sharing news and information about the disaster, and for assisting locals with organising the (significantly volunteer-driven) relief and recovery effort.
To document these uses – especially of Twitter, though Facebook was also important –, we’ve now released a major research report through the ARC Centre of Excellence for Creative Industries and Innovation, as an outcome from our overall efforts in researching the uses of Twitter and developing tools and methods for such research, which we’re sharing over on the Mapping Online Publics site. The report is available here.
We’re also about to embark on a new three-year ARC Linkage project in partnership with the Queensland Department of Community Safety and Eidos Institute, to further develop social media emergency communication strategies and tools for sourcing valuable situational information from social media streams. More on this as it develops over the next few months and years…
SnurbIn my new role as Deputy Director of the ARC Centre of Excellence for Creative Industries & Innovation (CCI for short), I’m excited to be leading the team that’s organising our most ambitious PhD and Early Career Researcher activity to date – the CCI Winter School, to be held in balmy Brisbane in late June this year. It’s a selective but free event (you or your institution only need to cover your travel), involving a fairly small group of promising PhD students and early career researchers from around the world. If you’re in the northern hemisphere and looking for a 2012 summer research school, why not consider being adventurous and coming down under instead? Axel and I will both be on hand as mentors, along with a bunch of other fabulous people.
Applications close on 31 January – don’t miss out!
CCI’s 2012 Winter School (coinciding with summer in the northern hemisphere) offers selected doctoral students and early career researchers a week-long program of interdisciplinary study, collaboration and social interaction in the broad area of creative industries and innovation research, drawing on the Centre’s expertise in media, cultural and communication studies, economics, education, policy and law, in relation to the creative economy.
We welcome applications from emerging scholars working on related topics including, but not limited to:
Participants will work with leading researchers, engage in intensive workshop activities and receive direct feedback and individual mentoring on their own work. Social activities will provide additional opportunities for participants to get to know each other and form collaborative relationships that will last for years to come.
For all the info, lists of mentors, an indicative program and the online application form, visit the CCI Winter School website.
In my new role as Deputy Director of the ARC Centre of Excellence for Creative Industries & Innovation (CCI for short), I’m excited to be leading the team that’s organising our most ambitious PhD and Early Career Researcher activity to date – the CCI Winter School, to be held in balmy Brisbane in late June this year. It’s a selective but free event (you or your institution only need to cover your travel), involving a fairly small group of promising PhD students and early career researchers from around the world. Applications close on 31 January 7 February – don’t miss out!
CCI’s 2012 Winter School (coinciding with summer in the northern hemisphere) offers selected doctoral students and early career researchers a week-long program of interdisciplinary study, collaboration and social interaction in the broad area of creative industries and innovation research, drawing on the Centre’s expertise in media, cultural and communication studies, economics, education, policy and law, in relation to the creative economy.
We welcome applications from emerging scholars working on related topics including, but not limited to:
Participants will work with leading researchers, engage in intensive workshop activities and receive direct feedback and individual mentoring on their own work. Social activities will provide additional opportunities for participants to get to know each other and form collaborative relationships that will last for years to come.
For all the info, lists of mentors, an indicative program and the online application form, visit the CCI Winter School website.
This is a guest post by Joy Danjing Zhang
Zhang Yimou is China’s most internationally renowned film director; he is best known for the epic wuxia films Hero and House of Flying Daggers. Zhang’s cinematic output includes the acclaimed art house productions Raise the Red Lantern, Red Sorghum, Judou, and To Live as well as social documentary genres like The Story of Qiu Ju and Not One Less. Zhang Yimou is also credited with mounting operatic performances in Beijing’s Forbidden City and more recently choreographing the opening and closing ceremonies of the 2008 Beijing Olympics.
Another less acknowledged string to Zhang’s bow is his directorial work in staging outdoor theatrical spectacles. This was the topic of my Master’s thesis, Impressions of China: Zhang Yimou’s outdoor theme productions, completed in 2011 (supervised by Professor Michael Keane at QUT).
In recent years outdoor theatrical spectacles have become modular formats in which local cultural resources are mixed. Zhang Yimou is among a number of successful film directors sought after by local governments who wish to ‘add creativity’ to existing traditional cultural resources. Others to follow Zhang’s lead include Chen Kaige and Feng Xiaogang. Indeed, such cultural displays highlight the new branding of Chinese national identity, reflected in the slogan ‘soft power’ (ruan shili).
Impression West Lake is an ambitious production under Zhang Yimou’s branding. An outdoor performance, it combines traditional music, dance, pop culture and visual displays. Some would say that it symbolizes a renewal of Chinese culture in regional tourism. It is one of the Impression Series, a series of similar but different cultural formats. Impression West Lake is performed at West Lake in Hangzhou, China.
Entertainment performances in tourism locations are not new in China. What is newer is the introduction of film directors and celebrities. The first production, Impression, Liu Sanjie was first performed in 2003 in an outdoor scenic setting on the Li River with a background of mountains in Yangshuo County of Guilin City in South China. In 2006, Zhang Yimou mounted Impression Lijiang set at the bottom of Jade Dragon Snow Mountain in Lijiang, Yunnan province. It is still in production performing twice a day, most days of the year. The first performance created the format—a blend of local culture, folk stories and spectacular scenery. The transferring of the series in different tourism places within China illustrates the formatting of the Impression model into local tourism development strategies.
Impression West Lake is a unique metropolitan out-door performance on a natural stage setting located in one of the most famous tourism destinations in China. Impression West Lake was the third instalment. Emphasising Han folk stories, innovative and technical stage effects and the beauty of the Hangzhou urban landscape, the atmosphere is further enhanced by special effects and A list celebrities. The musical score, available as a CD, was created by the Japanese new age artist Kitaro while the theme song was performed by Jane Zhang (Zhang Liangying), a finalist in the massively popular TV talent show, Super Girls.
The narrative is itself fairly minimalist, echoing Zhang Yimou’s film style. It portrays the legend of the White Lady Snake and Xu Xian, a love story well known in Chinese history. This story is associated with Hangzhou local culture. The tragic Chinese story of the so-called Butterfly Lovers is recounted in five episodes: Making Acquaintance, Falling in Love, Parting, Memory and Impression.
The term ‘impression’ aptly captures the characteristics of these productions. The Oxford English Dictionary (2007) provides four definitions of ‘impression’. Firstly, it is ‘the image or feeling a person or thing gives to someone’s mind, as regards its strength or quality’ (OED p. 1341). The live theme performance is an ‘experience’ that draws emotion from audiences. Secondly, impression means ‘a mark left by pressure.’ This corresponds to an experience which leaves a memory such that a person might seek to buy a souvenir or memento. Thirdly, impression is interpreted as a model or a mould that can be replicated. Moreover, Zhang Yimou is an ‘impresario’, a person who arranges performances in theatres. The Impression Series reflect Zhang’s cinematic legacy. Zhang Yimou and his production teams have built a reputation for quality outdoor spectacles. And that reputation has continued; it has attracted international collaboration in content, financial investments from the private sectors and favoured policies from regional governments.
In my Master’s thesis I argued that personal celebrity endorsement is replacing political propaganda heroes in promoting an alternative image of China. Zhang Yimou and Impression West Lake function as a dual branding mechanism that combines ‘people marketing’ and ‘place marketing’ for the development of a ‘created in China’ cultural commodity as well as for the generation of positive economic outcomes.
My study identified how natural resources linked with a local tourism industry are articulated into cultural products and how this is differentially experienced by local and international visitors. ‘Cultural experience’ strategies such as Impression combine global marketing and local cultural forces. In the case of the Impression series we see how a creative entrepreneur like Zhang Yimou offers a better model to promote ‘soft power’ than governmental propaganda strategies.
Impression West Lake encapsulates the rise of the model creative entrepreneur, assisted by local government authorities. Even though government policy-makers facilitate cultural infrastructure and provide incentives, they ultimately rely on the entrepreneur’s creative vision and understanding of the market.
Of course there are downsides. My thesis considered critical issues; not only questions of Chinese cultural identity and issues of authenticity, but also intellectual property and copyright, the working conditions of cultural workers, fundamental tensions between high art and commercial exploitation, the tension between ‘public’ culture and ‘private’ profit, and ultimately the sustainability of cultural and creative products in the context of formatting.
The study provides insights into the future direction of China’s cultural exports – represented by the slogan ‘Created in China’. Impression West Lake has achieved global reach. Furthermore there are new additions to the series: Impression Hainan, and Impression Da Hong Pao. By exploiting the format business model, the idea has become central to local tourism development plans with support from regional government and private investors. Feng Xiaogang is set to launch a new extravaganza Dreamy Beibuwan (Menghuan Beibuwan) in October 2011. Feng’s outdoor mega show embraces rich history in a stunningly picturesque backdrop. Menghua Beibuwan will be set among the outdoor landscape of Beibuwan, recreating the ancient Maritime Silk Road and the voyage. Menghuan Beibuwan is a collaborative effort between Feng and the Fangchenggang local government. Another celebrated director, Chen Kaige, is preparing to open Xi Yi in Dali next year to commemorate the voyages of navigator Zheng He to the Western Ocean (now called the Indian Ocean).
Zhang Yimou’s creative production model illustrates how traditional Chinese cultural values are being constantly re-converted, reformatted and adapted. The question that remains to consider is: Is such cultural formatting the best model for success? What is an appropriate balance between replication and adaption? Is there a point where creativity disappears in the rush for market success or is creativity in essence just adaptation?
Bio:
Joy Danjing Zhang is from Shanghai and will be starting her PhD at QUT in 2012.
It’s difficult to believe that one year ago, significant parts of Brisbane were inundated by floodwaters; thankfully, there has been no repeat of the flood crisis this year. One of the few good news stories to emerge from the disaster was the – overall, very successful – way in which social media such as Twitter and Facebook were used during the event, both by key emergency authorities and by everyday users, from directly affected local residents to onlookers further afield.
Particular kudos in this must go to the Queensland Police Service Media Unit, which – not quite from a standing start, but certainly without much time to prepare a comprehensive strategy for its social media crisis communication approaches – delivered timely, informative, and level-headed updates on the flood crisis as it unfolded. Its Facebook followers grew, literally overnight, by a factor of ten, and @QPSMedia also became the single most visible account participating in the #qldfloods Twitter hashtag.
We’ve presented some analyses of the use of Twitter during the crisis in various contexts during 2011 – including the Eidos Institute symposium at the Queensland State Library in April, and various conference presentations later in the year. In time for the first anniversary of the floods, we are now releasing a major report on #qldfloods and @QPSMedia through the ARC Centre of Excellence for Creative Industries and Innovation, where we are based.
Co-authored by Axel Bruns, Jean Burgess, Kate Crawford, and Frances Shaw, the report takes a comprehensive look at overall patterns of Twitter activity in #qldfloods, as well as analysing in much greater detail the contents both of the #qldfloods update stream itself and of the conversation specifically surrounding @QPSMedia. (We are especially indebted for this to our colleague Frances Shaw, who carried out the tedious task of coding those tweets.)
The report is available for download here. More information is also available from the CCI Website, which has the full press release, too.
We’re hoping that this report will make a useful contribution to the further development of social media crisis communication strategies in emergency services and media organisations. It’s also a useful starting-point for our ARC Linkage project in partnership with the Eidos Institute and the Queensland Department of Community Safety (DCS), which will further investigate the use of social media in crisis communication and work with the DCS to develop its social media activities.
(Report cover image by Angus Veitch on Flickr. Used under a Creative Commons BY-NC licence.)
The ARC Centre of Excellence is about to mount its first Winter School. For all of you people in the northern hemisphere this is winter down under in Brisbane. It’s not cold and it doesn’t rain. It’s a fantastic opportunity to engage in some of the best one-on-one with leading mentors in creative industries and new media. The Winter School offers selected doctoral students and early career researchers a week-long program of interdisciplinary study, collaboration and social interaction in the broad area of creative industries and innovation research, drawing on the Centre’s expertise in media, cultural and communication studies, economics, education, policy and law, in relation to the creative economy.
Visit www.cciwinterschool.org to apply! (Deadline: 31 Jan 2012)
We’ve got a few busy years ahead of us, it seems. In addition to the ARC Linkage project on social media and crisis communication which was awarded to us (the QUT Mapping Online Publics team along with our CCI colleague Kate Crawford, the Queensland Department of Community Safety, and the Eidos Institute), which we’ll carry out during 2012-14, we’ve also had word in December that another project application has been successful.
Titled “The Impact of Social Media on Agenda-Setting in Election Campaigns:
Cross-Media and Cross-National Comparisons”, that project will study the use of social media in a series of election campaigns which are coming up over the next few years (2012-15) – including the Queensland state election and the US presidential election this year (and I’m tempted to throw in the French presidential election as well, just for fun), and elections in Sweden, Norway, and Australia which are coming up in 2013 and 2014.
The project is led by Gunn Enli at the University of Oslo, and also involves Eli Skogerbø at Oslo, Hallvard Moe from the University of Bergen (currently visiting the CCI), Christian Christensen at the University of Uppsala, and Kevin Wallsten at California State University. It’s funded by the Norwegian Research Council, who have awarded us the impressive sum of 9.9m NOK (a still impressive 1.5m in Australian Dollars). Here’s the project overview:
The Impact of Social Media on Agenda-Setting in Election Campaigns: Cross-Media and Cross-National ComparisonsThe project has as its primary objective to establish new and unique knowledge on the interaction and inter-media agenda-setting between social media and mainstream media in different cultural and political settings. The findings of the project will provide empirical insights into the development of hybrid public spheres, and contribute to refining and revising theories on political communication in cross-national environments.
The project will establish a high quality international research network, involving some of the leading scholars on social media, internationally as well as in Norway, Sweden, USA and Australia. The publications from the project will contribute to the ongoing international scholarly debate on the role of social media in public communication across the world.
Social media not only serve as arenas for debate and discussion, they are also increasingly integrated in inter-media agenda-setting, as they serve as input to the mainstream media. Political actors as well as citizens use them in order to draw attention to issues and manage their public images. The increasing cross-mediality between the social media and the mainstream media can be described in terms of creating "hybrid public spheres" in which the social and mainstream media overlap and interact. The project takes a cross-media and cross-national approach, by researching political communication in election campaigns in Australia, Norway, Sweden and USA.
The project has one overall and three sub-RQs:
Those of you who have followed our adventures in Twitter research for some time now will know that we’ve relied to a significant extent on Joe John O’Brien III’s excellent Twapperkeeper as a tool for capturing tweets. Twapperkeeper (as a stand-alone, free Web-based service) no longer exists in its original form, however – though some of its functionality for creating Twitter archives appears to have been subsumed into the for-pay services available as premium offerings from Hootsuite – and so we’ve been getting the occasional inquiry about what to do now.
Some months ago, I published a quick post to outline how we’ve transitioned from Twapperkeeper(.com) to the open-source solution yourTwapperkeeper, which offers comparable functionality as a Web package which users are able to install on their owns servers, and the start of a new year seems like a good point to reiterate this, as well as to add a few further pointers. So:
Hope this helps. Happy Twapperkeeping!
Update: revision 1.2 of metrify.awk is now available (still at the link below), and introduces some further functionality, which is outlined here.
This is the final instalment of my four-part introduction to the metrify.awk script for generating detailed metrics for specific Twapperkeeper/yourTwapperkeeper hashtag archives. Over the last couple of posts, we’ve mainly dealt with overall stats for the hashtag, as well as for specific, definable percentiles of more or less active users. Finally, now, it’s time to look more closely at patterns within the overall userbase.
User Metrics
For this, we’re using the final (and by far the largest) data table which metrify.awk generates. To produce a full table, by the way, the skipusers=1 command-line argument must not be specified this time around – otherwise the only per-user data which metrify.awk will output is each user’s number of tweets. With skipusers off, on the other hand, we get a great deal more – but a word of warning: for large datasets, processing times can also increase quite considerably. For each user, metrify.awk tracks which user percentile they’ve been assigned to, how many tweets they’ve sent and received (in the form of public @replies or retweets – note that this does not include any non-hashtagged tweets, which would not be included in the original dataset, of course), as well as how these sent and received tweets break down into our by now familiar categories:
as well as
(with these metrics again provided both as a total number, and as a percentage of all tweets sent or @replies received, respectively). Again, with the exception of URLs, these will add up to the total:
as well as
But wait, there’s more – we can also calculate the ratio between these incoming @replies and the tweets sent by the user, to get a sense of the resonance of their activities:
Some Results
So, let’s see what these data tell us. In the first place, let’s look more closely at that small group of highly active users: here’s a graph for the top 150 most active participants (i.e. slightly more than the top 1%):
We see immediately that even amongst this top group, there’s a very pronounced long tail distribution: just two #auspol users (you know who you are) contributed more than 10,000 tweets each, and a total eight contributed more than 5,000 tweets each. Beyond those hyper-active few, we’re quickly dropping down towards the just over 500 tweets achieved by each of the users at the end of that top 150 (and further as we move into the second and third percentile groups). Additionally, the graph above also shows a breakdown of those tweets into original tweets, genuine @replies, and retweets – and remarkably, the lead user here achieved their position mainly by sending copious amounts of @replies…
The total activity distribution across all 14,133 active #auspol participants, by the way, looks like this:
An extreme activity distribution if ever I’ve seen one!
But of course, tweeting a lot is only one side of the coin on Twitter: if nobody is reading (and responding), the user’s influence may still not be particularly great. So, instead of tweets sent, we can also examine the @replies received (showing the top 150 users again):
This gives us a much better idea of who’s central to the conversation, I think: these are the users receiving the largest amount of @replies and retweets (and in this case, mostly genuine @replies, which is remarkable in its own right). It should be noted that – since we’re looking at @replies received here – this list may also include users who are only mentioned, but never actively participated in the hashtag; in the case of #auspol, this includes accounts like @JuliaGillard and @TonyAbbottMHR, for example, both of whom are present in the top 50 @reply recipients.
For those users, of course, it’s impossible to calculate the ratio of @replies received to tweets sent (since they didn’t send any) – but for the rest, that ratio may also be valuable, as an indication of what we might call resonance. A user receiving a great number of @replies (whether genuine @replies or retweets) for a comparatively small number of tweets could be said to have substantial resonance; a user tweeting a great deal, but receiving few @replies in return for their efforts, has relatively little resonance.
There are plenty of different ways to examine such resonance, using the different metrics which metrify.awk provides us with; as one example, here I’ve plotted the ratio of genuine @replies (i.e. non-retweets) received per sent tweet against the total number of tweets for the fifty most active users:
For the very lead users, then, their resonance rating isn’t actually all that great: the top user receives a genuine @reply roughly only for every second tweet they’ve sent, and for the next most active users this gradually increases to a 1:1 ratio. A handful of others, on the other hand, break through the parity barrier, receiving (on average) more than one genuine @reply for each tweet they’ve sent. Remarkably, though, one user in the top fifty even received an average of more than two genuine @replies for each of the over 2000 tweets they contributed to #auspol!
(Again, I should stress here that we’re only counting those @replies which are contained in our dataset – which in this example means @replies which were themselves tagged with the #auspol hashtag. In the absence of comprehensive data on non-hashtagged Twitter traffic we have no way of knowing how much non-hashtagged follow-on communication may also have occurred – our measures of tweet resonance, therefore, only measure resonance within the hashtagged conversation.)
Phew – well, with these posts at least we’ve started to scratch the surface of the Twitter metrics which metrify.awk can generate for a given dataset. Exactly how any of these metrics may be used in any specific case depends on the research questions to be examined, of course. Go experiment – and let me know if there are other metrics which we could add to the script as well!
Update: revision 1.2 of metrify.awk is now available (still at the link below), and introduces some further functionality, which is outlined here.
Over the past couple of posts, I’ve introduced our new metrify.awk Twitter metrics script, and looked at the first of the three metrics tables produced by the script. Let’s move on now to the second table, where I’ll use a snapshot of Australian political discussion on Twitter under the #auspol hashtag between February and August 2011, instead of #qldfloods – the overall metrics for the different user percentiles in the #qldfloods dataset turn out not to be particularly interesting… As before, we’re dividing the total userbase according to the 1/9/90 rule into the 1% of most active users, the next 9% of moderately active users, and the final 90% of least active users. (In the case of #auspol, that first percentile contains 142, the second percentile contains 1291, and the final percentile contains 12700 of a total of 14133 users.)
Percentile Metrics
The second table generated by metrify.awk provides us with detailed metrics on these three percentiles, on an overall basis rather than per specific time period.
This table contains the following columns:
Again, too, these figures will add up to the total:
and
(with tweets containing URLs again constituting a separate category, since any type of tweet may also contain URLs).
Applying this to our #royalwedding dataset, here’s what the activities of the different user percentiles look like:
We’re clearly seeing some very significant differences between the various percentile groups here. Interaction amongst the top 1% of most active users is especially discursive, with more than 55% of all of their tweets constituting genuine @replies: these people are very actively talking to (or at) one another.
The next lower group of active users, by contrast, doesn’t engage as much: only one third of their tweets are genuine @replies, but nearly 39% are original tweets. They’re more active at posting their own views and comments, rather than responding to others – or at least (and this is important to keep in mind with any such metrics), they’re less in the habit of also marking their @replies with the #auspol hashtag. By contrast, the top group are much more overtly performing their conversations, making them visible to all followers of #auspol; the second group may well send their own @replies, but if those @replies don’t contain the hashtag #auspol, they’re less visible to others and not included in our hashtag dataset.
Finally, too, the least active 90% of users are participating differently again: some 52% of their tweets are retweets, so (given that they’re not posting to #auspol that often in the first place) they’re probably more likely to be present here simply as ‘drive-by’ retweeters who occasionally pass along an interesting #auspol-tagged message that shows up in their Twitter feeds, but don’t deliberately follow the continuing #auspol conversation itself.
There are two more useful statistics to examine for #auspol, and I’ve combined them in the graph above: first, the percentage of the total volume of #auspol tweets that each group is responsible for (shown here in blue): the one percent of most active users – a total of 142 Twitter users, for the period we’re looking at – accounts for a staggering 62% of all #auspol tweets. In other words, Australian political discussion on Twitter, under the #auspol banner, is dominated by a vanishingly small group of users whose output is massively disproportional to the size of the group. Compare this with the least active 90%: those more than 12,000 users contribute less than 9% of all #auspol posts. Quite a difference – #auspol shows a very strong long-tail distribution amongst its active participants, then. (This is very different for many of the crisis-related hashtags we’ve looked at, by the way: the top 1% of most active users in #qldfloods, for example, are responsible for less than 17% of all tweets; the least active 90% of #qldfloods users for nearly 57%.)
Second, the distribution of tweets containing URLs is also interesting here. We already know that the lowest 90% are more likely to retweet than post their own commentary or @replies – and it looks like many of those retweets are of posts containing URLs: some 37% of all tweets by the bottom 90% include links. By contrast, the discursive few at the top of the activity scale include URLs in only 18% of their tweets.
Percentile Metrics, ComparedBut beyond these metrics for the various user percentiles in individual hashtags, we can also compare these findings across different hashtag datasets – and that’s where things get really interesting. There are very many possible comparisons here: how do the individual percentiles of users compare across the different hashtags (something I’ve already hinted at above, comparing the relative contribution of the top 1% in #auspol and #qldfloods, for example), which hashtags contain more @replies, retweets, URLs, etc.?
We’ve only scratched the surface on these broader comparisons, but one very interesting pattern which has already emerged is shown in the graph below (which remains preliminary; one of my plans for the next month or so is to develop this further):
Here, we’re comparing the total metrics (for all users, rather than for specific percentiles) across a range of different hashtags: #qldfloods, #eqnz, the Japanese #tsunami, #libya, the #londonriots, #ukriots, and #riotcleanup, the #royalwedding, election nights in Australia and Ireland (#ausvotes and #ge11), the Tour de France (#tdf), #eurovision, and #wikileaks. The size of each point on the graph shows the total size of the userbase for each hashtag – so, the #royalwedding and the #tsunami attracted a vastly larger Twitter userbase (of around half a million unique users each) than the Irish election or Queensland floods, for example.
But what the graph shows is that independent of the size of the userbase, there are some very obvious patterns here. All of the crisis events are characterised by a large number of both (unedited) retweets and tweets sharing links; people are actively finding and disseminating information. All of the widely televised events, on the other hand, have very few URLs, and only marginally more retweets: Twitter may be used as a backchannel for the television, in a shared experience of audiencing, but there’s not much additional information sharing going on here. #wikileaks, in turn, is a different story altogether – but perhaps we’ll come across more hashtags with similar metrics, and it’s the first sign of a third major category.
I’m reluctant to read too much more into these patterns as yet – first, I’ll need to do some more work cleaning up the datasets which the graph above is based on (working out which exact periods of time to use for each hashtag, and trying comparisons of a few more different combinations of metrics. I do think there’s a first sign in this of much more fundamental patterns in how Twitter hashtags are used for specific purposes. But that’s a longer discussion for another time.
And we haven’t yet exhausted all the possibilities which metrify.awk itself offers. In addition to the time- and/or percentile-based metrics which we’ve discussed over these last couple of posts, it also calculates metrics for each individual user in the dataset. And that’s what we’ll look at in the final instalment in this series.
Update: I’ve clarified/corrected some of the details relating to the percentile metrics contained in the first table which metrify.awk generates.
Update 2: revision 1.2 of metrify.awk adds further functionality in addition to what is described below. These changes are detailed here.
In the previous post, I’ve introduced metrify.awk, our new multi-purpose tool for generating Twitter metrics. Over the next instalments in this series of posts, I’ll take you through the results it produces. And seeing as we’re coming up to the anniversary of the January 2011 south-east Queensland floods, and as I needed to generate those metrics anyway, for a report on social media in the floods which we’re publishing soon, I’ll be using an archive of #qldfloods tweets between 10 and 17 January 2011 as an example here.
I’m running metrify.awk as follows for this:
gawk -F , -f metrify.awk divisions=90,99 time=day qldfloods.csv >qldfloods-metrics.csv
In other words, we’re using a 1/9/90 division of users, and we’re tracking activities per day; the skipusers switch is not set, so full stats for all users will be generated.
Metrics over Time
The output file from this, qldfloods-metrics.csv, contains three separate data tables in the same spreadsheet, which I’m now loading into Excel. The first of these contains the following information:
Some more side notes are required here: first, as you already know, Twapperkeeper / yourTwapperkeeper does not capture ‘button’ retweets – so all we can examine in the retweet department are ‘manual’ retweets. We count tweets as retweets if they follow any of the four formats listed above (RT = retweet, “@user = quoted tweet, MT = manual retweet, via @user); between them, these formats capture the overwhelming majority of retweets, but some very unusual retweeting formats will slip through the cracks. We also distinguish between edited and unedited retweets simply by checking whether the tweet in question starts with these retweet indicators, or not; that’s the only reliable way of checking without entering vastly more complicated territory. Again, this will miss retweets where the retweeting user added comments at the end of the retweet; these will be (incorrectly) counted as unedited retweets.
These different tweet types will always add up to the total:
and
and
(The odd ones left out from this are the stats on URLs, since URLs may be contained in original tweets as much as in @replies or retweets.)
Second, you see there the stats for our three (in my case) user percentiles make their first appearance. In my example, the following three column headings appear in the table:
This already provides us with some information about how the percentiles ended up being defined in this case (more detailed information appears in the second table generated by metrify.awk – more on that later). First, the activity cutoffs: the least active 90% of users were defined as users who contributed 4 tweets or less to the total dataset; the middle group contributed more than four and up to 18 tweets; the most active 1% of users contributed more than 18 tweets over the entire duration covered by the dataset.
Additionally, we also see the numbers of users included in each group: 177 users posted more than 18 tweets; another 1670 users posted more than 4 and up to 18 tweets, and the rest (15581 – 1670 – 177 = 13734) posted 4 tweets or less. This also exemplifies the slight size creep which I’ve mentioned before: the 177 users in the top group are actually 1.14% of the total group (rather than 1%), the 1670 in the next lot are 10.72% (rather than 9%). If the creep gets too big for your liking, you could adjust the division cutoffs slightly (I could have used divisions=91,99 as a parameter to try to make the middle group smaller, for example).
At any rate, what the data in these columns track is what percentage of the total volume of tweets for each time period is contributed by each of the user percentiles the percentage of the total number of unique users during each period which belong to each of the percentile groups – in other words, the extent to which any of these groups dominate the hashtag feed at any one point. Note that which users get to be in which percentile is determined once, for the entire dataset, rather than on a per-time period basis: what these columns indicate, therefore, is how active present the overall lead (and other) user groups are in each time period, rather than how much a changing current group of most active users have contributed in each time period.
(Again, please note that further stats for those user percentiles were introduced in metrify 1.2 – details are here.)
Some ResultsTime for some first results from this table, then. What these data allow us to do is already quite useful, and I’ll only provide a handful of examples here; you can experiment further on your own. Using my #qldfloods data, and selecting just this first table of metrics from the metrify.awk output, I’ll create a pivot table in Excel, which enables me to plot various metrics over time, for example:
This first table simply shows that the number of unique participating users, and the volume of tweets posted under the hashtag #qldfloods, move together over time; for most hashtags, that’s what you’d expect to see, I think.
Next, we see how different types of tweets contribute to the overall volume of tweets. Retweets (which I haven’t divided into edited and unedited retweets here) are quite prominent at the start of the crisis – as everyone is looking to share what little information is already available – and gradually drop down towards the end (as more information is available, and retweeting isn’t as important any more; there’s a big tick up on the last day, but the overall volume of tweets is very low then, so this may be an outlier); @replies gradually rise, on the other hand (perhaps because there’s a shift from simply sharing news and information to discussing how best to organise the recovery effort). URLs also rise gradually – possibly a sign of more and better information becoming available.
Finally, a look at our user percentiles: what we see here is that the ‘lead’ users aren’t actually that active prominent, especially during the busiest days for the hashtag (11-13 January): on those days, even the top two user percentiles combined don’t account for more than 20% of all messages unique users. This shouldn’t be misunderstood to mean that these top users were being drowned out by the hoi polloi, though: rather – given what we’ve already found out about retweeting rates in the previous graph – much of what the least active 90% of users were doing during these days was to retweet the messages of those lead users. (From all we’ve seen so far, this is a pattern common to crisis-related hashtags; it may be very different for a non-crisis case.)
We’ll see more evidence of this, in fact, when we turn to the next metrics table produced by metrify.awk – in the next post in this series…
So, 2011 is finally over – and what a year it’s been. While the confluence of natural disasters, political crises, and other major events has also provided us with the basis for a new research programme in crisis communication, let’s hope that 2012 is a little less intense, please…
To start the new year on a positive note, I’m finally getting around to sharing some more information about the new approach to generating Twitter metrics which we’ve developed over the past few months – this actually started during the research workshops we had with Stefan Stieglitz’s group at the University of Münster in August, so it’s taken some time to gestate into its present form. What it’s now turned into is quite a powerful tool for generating detailed information about a specific Twitter dataset – intended mainly for the study of hashtags, but with applications well beyond this as well. Amongst other things, it enables us to distinguish more effectively between different groups of participating users (from highly active lead users to much less active casual participants), and to track different types of participation, in total or by these specific groups, over time.
Introducing Metrify.awk
The Gawk script we’re using for this is called metrify.awk, and it’s available here (ZIP file; you’ll need to unpack it). Metrify.awk is unusual in that it generates three different results tables within the one output CSV/TSV file; the output file is designed to be opened in Excel or another spreadsheet software, for further interrogation or charting. Metrify.awk takes standard Twapperkeeper or (with our modification) yourTwapperkeeper archives as input, and it’s run from the command line as follows:
gawk -F , -f metrify.awk divisions=list of percentiles time=time period skipusers=1 (if needed) tweets.csv >metrics.csv
I’ll explain the parameters as we go through the different results tables which metrify.awk generates. Obviously, use \t instead of , if you’re dealing with tab- rather than comma-separated datasets.
Distinguishing Different User GroupsThe first aim of metrify.awk is to develop better distinctions between different groups of users. In any Twitter hashtag dataset, for example, there will usually be a long tail-style distribution of activity: from a handful of highly active ‘lead users’ through to a larger number of hangers-on who may only be present in the dataset because they happened to retweet a hashtagged message, without even paying much attention to the existence of the hashtag in the first place. In many circumstances, we might want to focus only on that central group of most active users, or examine how they act differently from the more marginal members of the hashtag community (to the extent that hashtags can be considered as communities in the first place, obviously).
One standard tool for making such distinctions is the division of the total userbase into different percentiles of more or less active users (where ‘active’ is measured simply by how many tweets they’ve contributed to the hashtag). Common such divisions include the so-called 10/90 or 1/9/90 rules, which distinguish variously between the top 10% of users and the rest of the group, or between the top 1% of lead users, the next 9% of engaged but less active users, and the rest. In other circumstances, a more even division of the userbase into two halves (i.e. 50/50) or four quarters (25/25/25/25) might also be useful.
Metrify.awk supports any such divisions through the divisions command-line parameter. This parameter specifies where the divisions should be made, through a comma-separated list of cutoff points, counting from the least active users to the top: “90”, for example, creates a 10/90 division, “90,99” creates a 1/9/90 division, “25,50,75” divides the userbase into four quarters of the same size. “10”, by contrast, would put the bottom 10% of least active users into one group, and the rest of the userbase into another. If divisions is not specified, metrify.awk defaults to “90,99′” – i.e. dividing the userbase according to the 1/9/90 rule.
One note of caution on these divisions: imagine you’re dealing with a group of ten users, for which you’d like to apply the 10/90 rule – that is, the top user is counted in a different category from the rest. What if two or more of these ten users share the top spot – i.e., they have contributed the same number of tweets? Which of them should be counted as the top user, which of them should be counted with the rest? Such a small group of only ten users is an extreme example, of course – most of the hashtag datasets to which we’re applying metrify.awk will include thousands or tens of thousands of unique users. But still, the problem can occur here, too.
In such cases, metrify.awk takes an inclusive approach, counting from the top on down: if users on either side of the boundary between the first and second percentile group have the same number of tweets, those in the lower group are also moved to the higher group. In our example above, all the users sharing the top number of tweets would be counted in the top percentile, for example, even if this means that the division ends up 20/80 or 30/70 instead. For larger datasets, the effects are usually far less extreme: a 10/90 division might blow out to 11/89 or 12/88 instead, but I think that’s preferable to making an arbitrary choice between equally active users.
Where things can get more problematic is with more even divisions (e.g. 25/25/25/25) and strong long-tail distributions of activity. Here, it’s quite possible that both the lowest 25% and the next higher 25% consist of users who have contributed only one tweet to the total dataset. In such cases, all those users will end up in a combined percentile which actually covers all of the bottom 50%, with the fourth quarter percentile remaining empty. If you see this in your own results, choose different division points (e.g. 25/25/50, i.e. “50,75” as the command-line parameter).
Tracking Metrics over TimeThe second metrify.awk parameter makes it possible not just to generate overall metrics for these different groups of users, but also to track their participation over time. Here, we’re able to specify the specific time period which we’re interested in tracking: options are “minute”, “hour”, “day”, “month”, or “year”, and should cover all eventualities. What time period is appropriate in each case depends on the nature of the hashtag dataset, of course: for a day-long event like the #royalwedding last year, “minute” may be useful; to understand longer-term developments like #egypt or #libya, “day” or even “month” might be better.
Note: metrify.awk expects the input file to contain tweets in chronological order, from the earliest to the latest. Before exporting your tweet archives from yourTwapperkeeper, make sure you select ‘ascending’ as the display order, or use a spreadsheet software to reorder the exported data in ascending order by the timestamp field – otherwise, the result may not be what you expected.
Extracting Individual User StatsFinally, in addition to these aggregate stats, metrify.awk also generates per-user metrics, which I’ll discuss in more detail below. With large datasets, this is the most time-intensive aspect of what metrify does, though – and if you’re only interested in the overall metrics, and not in the details of how individual users fared, generating these additional statistics is overkill. So, metrify includes a command-line switch which turns off metrics generation for individual users almost completely (other than simply counting the number of tweets they’ve contributed to the dataset): skipusers=1. Including this on the command line will considerably speed up processing.
OK, with these preliminaries out of the way, it’s time to take metrify.awk for a spin, and to see what it produces. We’ll do so in the next post in this series…
The book Key Concepts in the Creative Industries will be publihsed by Sage in 2012. It is co-authored by John Hartley, Stuart Cunningham, Jason Potts, Michael Keane, John Banks and myself, and has 42 entries that range from art and aesthetics to representation and technology. It will become an important contribution in ongoing debates about the creative industries, and hopefully be widely used in courses.
To give a taster of the entries that will be in, I have attached the draft entry on Cultural Policy. Other entries will be made available in the near future.
My paper “Creative suburbia: Rethinking urban cultural policy- the Australian case” will be published in the Intenrational Journal of Cultural Studies in 2012, but a pre-publication copy can be accessed here. The abstract for the paper is below:
This article considers the question of whether creative workers demonstrate a preference for inner cities or suburbs, drawing upon research findings from the ‘Creative Suburbia’ project undertaken by a team of Australian researchers over 2008–2010 in selected suburban areas of Brisbane and Melbourne. Locating this question in wider debates about the relationship of the suburbs to the city, as well as the development of new suburban forms such as master-planned communities, the article finds that the number of creative industries workers located in the suburbs is significant, and those creative workforce members living and working in suburban areas are generally happy with this experience, locating in the suburbs out of personal choice rather than economic necessity. It is noted that this runs counter to received wisdom on creative cities, which emphasize cultural amenity in inner city areas as a primary driver of location decisions for the ‘creative class’. The article draws out some implications of the findings for urban cultural policy, arguing that the focus on developing inner urban cultural amenity has been overplayed, and that more attention should be given to how to better enable distributed knowledge systems through high-speed broadband infrastructure.
The Introduction to the special issue, co-authored with Mark Gibson, Christy Collis and Emma Felton, can also be accessed here.