People, places, events; our browsers (and other machines) should be able to recognize these things as easily as you and I can. That’s the promise of the semantic web. But users are not the ones who will make the semantic web work. Microformats extend existing XHTML tags to put human-readable information into machine-parsable form (@Argent Hotel, San Francisco, CA@). But as simple of a solution as that is, it will never have mainstream appeal.
Unlike text-formatting XHTML tags & classes like <strong> or <blockquote>, data-formatting XHTML tags have no immediate visible effect on their contents. Can you tell the difference between:
Because of this, users have no reason to ever remember to format their information with machine-readable tags. Even a rich-text-editing-like interface isn’t that helpful, because users lack a reason to *want* to make their data machine-readable. Only techies and smart business people care about that; for everyone else, human-readable is good enough. *People write for people, not machines.* So services will have to write for services.
We already have our information put into semantic form on a regular basis. Every time we type our name into a textfield created for that purpose, the machines of at least one service (Facebook, Yahoo! Mail, Wachovia) store that not just as a bit of text, but text identified specifically as our name. Not just our names, but often our addresses, our interests, our friends, our job titles, and many other kinds of information. Data silos have existed for ages, of course; but *giving the semantic web at large access to data silos is a key step forward in making semantic tools useful* — just as key, if not more so, than collecting data from the distributed web.
Not only is that data all in a central location for easy access, it’s often the kind of data we don’t explicitly state in our natural usage of the rest of the web. How often do you write on your blog or in an IM, “I live in Boston, MA. My interests are coffee, technology, startups, and communications. My relationship status is single, my height is 5′8″, and my eye color varies between blue and green”? (Which reminds me, I really need to increase the female demographic of my readership.)
Finally, access to data silos is as important as access to distribute data because data silos are where the majority of people are (and will be for the foreseeable future). There are “only 165,700,000 sites in existence”:http://news.netcraft.com/archives/web_server_survey.html, whereas Facebook alone has over 70 million active users (and Facebook gets 1/4th the traffic Yahoo! does). The majority of people’s online presences will be through a centralized service, rather than their own site.
But, that doesn’t mean that there isn’t any value in all of the data available on the distributed web. After all, even just a few sites can put out a lot of content. But for the reasons outlined above, most of these sites won’t be putting their own content into a semantic format. That’s where semantic creators come in.
Data scraping is as simple as understanding the common format a specific type of data usually is put in (email@example.com or 555-555-5555), and having a web-crawler find that information, extract it, and put it into the desired markup or database entry fields. As natural language processing advances, we’ll start seeing more services that recognize the kind of information I expressed explicitly above (interests, relationship status, etc.), even when it’s only expressed implicitly.
Those kind of services will operate either constantly, or on-demand. They will be the middlemen between semantic tools and the rest of the web, collecting all of the information we put out, and putting it into a machine-readable format services can use (and changing it from the format one service uses to the format another service prefers, until semantic markup becomes more standardized).
So we have all that information in machine-readable format; great! Now what can we do with it? We can:
* *Find it.* Search, bookmarking, related-items, context, area-specific…
* *Combine it.* “Mashups – filter, visualize, correlate, advertise”:http://jayneely.com/2007/06/02/more-signal-less-noise-the-power-of-rss-mashups, etc.
* *Integrate it.* One-click additions of events, contact info, preferences, licensing information, etc. into the applications you use every day.
* *Aggregate it.* See all of the events for a certain category happening within your city. Discover the most-talked-about TV show amongst your friends. Find all Creative Commons licensed podcasts available for remixing.
The possibilities for users to enhance their web experience with semantic data are endless. But it’s up to developers to create the services that give them that chance.