Text Analysis for Monitoring Personal Information Leakage on Twitter
Dongjin Choi (Chosun University, Republic of Korea)
Jeongin Kim (Chosun University, Republic of Korea)
Xeufeng Piao (School of Computer Science and Technology, China)
Pankoo Kim (Chosun University, Republic of Korea)
Abstract: Social networking services (SNSs) such as Twitter and Facebook can be considered as new forms of media. Information spreads much faster through social media than any other forms of traditional news media because people can upload information with no time and location constraints. For this reason, people have embraced SNSs and allowed them to become an integral part of their everyday lives. People express their emotional status to let others know how they feel about certain information or events. However, they are likely not only to share information with others but also to unintentionally expose personal information such as their place of residence, phone number, and date of birth. If such information is provided to users with inappropriate intentions, there may be serious consequences such as online and offline stalking. To prevent information leakages and detect spam, many researchers have monitored e-mail systems and web blogs. This paper considers text messages on Twitter, which is one of the most popular SNSs in the world, to reveal various hidden patterns by using several coefficient approaches. This paper focuses on users who exchange Tweets and examines the types of information that they reciprocate other's Tweets by monitoring samples of 50 million Tweets which were collected by Stanford University in November 2009. We chose an active Twitter user based on "happy birthday" rule and detecting their information related to place to live and personal names by using proposed coefficient method and compared with other coefficient approaches. As a result of this research, we can conclude that the proposed coefficient method is able to detect and recommend the standard English words for non-standard words in few conditions. Eventually, we detected 88,882 (24.287%) more name included Tweets and 14,054 (3.84%) location related Tweets compared by using only standard word matching method.
Keywords: Personal Identifiable Information, Twitter, personal information leakage, social network services, text analysis
Categories: H.3.1, H.3.2, H.3.3, H.3.7, H.5.1