Archive for category Other Projects

Data cleanup trickery

Today I came across another one of those little gems that sometimes come out of the incredible code foundry which is Google. The project is called Google Refine . Admittedly this post looks rather boring and dull, but there is plenty to get excited about! So read on…

Despite the many critics happy to jump on any of the big players for any reason, I fundamentally admire Google’s work philosophy. The fact that they embrace creativity and personal interests in some sort of anarchic way, provides a huge potential to let interesting projects develop and take shape independently from the general roadmap of official applications.

In the past few years I have been working with a number of large datasets extracted from a variety of databases or merged from different sources. The thing they had in common has always been inconsistency. Some are caused by people, some by the systems. For example, people make input mistakes, but moving data from one system to another creates spurious errors, and sometimes careless programming leads to odd computations. The result is always a messy and lengthy process which involves cleaning data or attempting to augment data from external sources.

There are many ways of approaching the problem, however the most simple is opening up data in a spreadsheet and use grouping, sorting and formatting to try to bring up unusual or odd patterns.
Applying data exploration techniques is the next step; plotting distribution graphs and summary tables is useful to spot things, but often a proper statistic package is easier to use to provide certain summaries. (That’s a second application, hence data might need to be moved again as spreadsheets tend to work well with relatively small datasets!)
Spreadsheets are limited in the number of data points they are able to handle: for example MS Excel was limited to about 65000 rows until the Office 2007 version: not close to be enough for large datasets… Statistical packages are expensive, especially if you are not too keen in tinkering with code in packages like R. However the user interface limits what you can actually do, and even popular choices like SPSS (or PASW in its latest incarnation) and STATA are still relying on syntax (i.e. manual coding) for more complex procedures.

It is not a surprise that I was very very intrigued when I came across the Google Refine project. As well as allowing intuitive use, the package (java based, but running on the desktop) allows powerful transformations and operation with very simple operations. The video screencasts are an excellent showcase for its potential, so I simple had to try it!

In the next few weeks I will play with it and report back soon, in the meantime it is a project to watch for me.

Tags: , , , , ,

Sociality of the web

I was unsure if the title ’4 geeks and a pot of money’ was going to be too critical for this post, but I cannot deny that I thought about it!

I came across this article in the NY Times and ended up with a number of odd thoughts:

1- how did 4 geeks (i guess it is better than nerds?) with an idea managed to pull together over $150.000?

2- this is the revenge of the common-man

3- knowledge is power (and knowing how to code is the new scholastica, relegating the real common man to the ignorant and powerless man of the middle age)

4- is it going to work? (or is it just another open source project destined for failure?)

Let’s go with order. The first thought is easily addressed: theirs is a ‘call to arms’ against Facebook and it is not surprising at all that they received such support. It was not long ago that Zuckenberg made changes to Facebook which resulted in a user revolution. Just to get a grasp of the conflict it is worth to look at his online biographies on the Wikipedia and Dickipedia .

Personally, I’m not sure about all this fuss regarding privacy and the tyrannical ruling of corporations. But this might be down to the fact that my use of social networks is not so extensive. I’m actually quite bad in keeping in touch with the phone already…

However, I also double checked what’s public of my profile after the latest changes with open graph (see what facebook publishes about you and your friends with this openGraph app.

The fact is that when I participate into any public activity my identity is exposed in one way or another and snapshots of me are going to be all over the place, with or without consent. Therefore what’s the problem? The fact that the information is becoming more and more aggregated I believe is a side effect of the internet. However my view is that the information is out there anyway.

Let’s use a simple example. In public I’m a teacher in a higher education institution, I’m a football referee and member of a number of clubs and organizations. Each holds an information silo about me which is semi-public (i.e. they might keep this private officially). Nevertheless, because of the type of activities, my face could also be published in some form which i might not have necessarily agreed when i became a member (i.e. a journalist taking a picture and publishing in a newspaper showing me in a cup final, or academic references in places which i didn’t think i would be associated with).

The concern, however goes deeper I believe: for those who are used to public scrutiny, there is a certain awarenss of a thin line between public and private and even though manifestation of identity varies in different domains I can’t avoid to think that people’s privacy is a facade for paranoia.

There is a voyerism these days that leads to a worldwide exposure similar to celebrities. Is this a crave for attention channeled to the world? My question is not only why would I want to share a picture in which I make a fool of myself during a night out, but why would I want to take the picture in the first place? Flipping the coin, why would I want to take a potentially embarassing snapshot of a friend who will most probably regret the moment anyway? Anyway, going off topic now… There is an argument for ease of use though and I like the idea of data portability (the Gigya login on this site tells you only half of the story?!)

The second thought is more optimistic: this shows that anyone with a good idea has a chance of succeding. This is a great ideal and I fully embrace it. Revisiting the last though, however, it is quite unlikely that without the exposure, this project would have had any chance of getting out of the cafe in which it was conceived.

The third thought is a reflection on the true value of the definition of common man. These kids are actually not so common. They are bright individuals in pursue of an idea, but they are equipped with knowledge to pursue the goal. It is actually foolish to think that anyone could do it. In fact, most people, no matter how you present web 2.0 stuff, are still mere users. For example, although it was fashionable a few years back to keep a blog count (see for example Duncan’s post on the blog herald), how many are actually active? My view is that things are very much dynamic, but a lot of people out there don’t blog, but read blogs, don’t tweet or bordcast themselves on youtube, yet they like that others do. A fundamental obstacle might be they don’t want to, they don’t need to, but also because they have no clue how to do it. So knowledge is power (yet again), and ignorant users who can’t even change their privacy setting in facebook, will most definitely not try to host their own node. What are the actual consequences of this model? how private can it really be? The distributed networks of napster, emule or bittorrent didn’t seem to be working, especially when your network provider might be responsible for your traffic (this seems how many countries are producing laws in this sense). So what’s the real difference of having this information stored on the facebook servers or on my home node?

Well, a lot of questions to be asked, but I am a supporter of open source projects, and although I have no clue if this is going to work, for now I will be following the Diaspora project and look forward to see how these guys are doing.

However, I share Jeff Sayre’s views about the usefulness of more streams and reccomend his detailed article on social-networking.

Tags: , ,