Tag Archives: searchwiki

A Social Search Engine Proposal.

Overview:

Nutch robots
Image via Wikipedia

IMHO, the current state of search is depressing. This is not a new realization for me. It is seven or eight years ago now that I first imagined a social search engine which would not rely solely on algorithms to determine the relative importance of search results but that would consider both machine and end-user feedback. This was in the early days of Nutch and I began researching the possibility of utilizing Nutch as the underlying core engine for such an endeavor, I rounded up some small-term investment capital, and so on. Unfortunately, this was also at the high peak of my struggle with Obsessive Compulsive Disorder (OCD) and my efforts eventually fell through.

Over the years I have watched as promising engine after promising engine has come along and in their turn failed to take the lead or even maintain their momentum. Years have passed and at each step of the way I have said, “It must just be around the corner…This is ages in technology time.” Even Google came out with SearchWiki, while not a perfect implementation it was a huge step in the right direction. For the last year or two I’ve been using Zakta and I’ve spent time on almost every other social search engine currently (or previously) available – yet I find that in the long-run they have all failed me.

So here I am so many years later longing for just such an engine. I’ve written on this blog about the topic before, but I will write again. In this post I will specifically propose the formation of an endeavor to create a social search engine, and I hope it will foster some interest in the community. I am not ready nor able to undertake such an endeavor myself – but I am interested in being part of such an endeavor.

Open Source: Ensuring Continuity

It is worth noting at this juncture that I’d intend for this project to be open source. Too many times I have lost the social search data I have accumulated because a specific engine has folded. My hope would be that the resultant project would be open source with commercial implementations and would provide a significant amount of data portability between engines, in case one engine should fold. We’ll talk more about the open source and portability aspects of the project later in this proposal.

What is Social Search?

Before we jump into a discussion about how to build a social search engine it is necessary first to define what is meant by social search. Unfortunately the term social search is used to delineate several different concepts which are very different from one another.

There are the real-time search engines which focus on aggregating information from various social media networks – and sometimes prioritizing links based on their popularity within a network. For example (also defunct)┬áTopsy, the no-longer-real-time OneRiot, and the now-defunct Scoopler.

There are the engines which are focused on finding humans – e.g. allowing one to garner information about a person. Wink eventually became this sort of engine, Spokeo would be another example. They are essentially white pages on steroids.

Finally there is what I mean by social search – and I would use another term but there is no other term I am aware of which is so widely used to delineate this type of engine (and I want to ensure the widest possible audience). It is sometimes called a “human-powered search engine”[1] Google and Wikimedia may have come closest by terming it a “Wiki” (SearchWiki and Wikia), but it seems to me that there is a need for an entirely new term that better and more precisely defines the idea…perhaps one result of this proposal and its aftermath will be just such a term.[2]

Core Parameters

In this section I will delineate what I believe are the core required features for a social search engine. An engine which included these features I believe would be a 1.0 release. There is certainly room for numerous improvements, but this would define a baseline by which to measure the proposal’s progress. I am not infallible, and I am sure there are aspects of the baseline which should be edited, removed, or replaced – I am open to suggestions.

  • Web Crawler – The engine must include a robust web crawler which can index the web, not just a subset of sites (e.g. Nutch).
  • Interpretive Ability – The engine must be able to interpret a wide variety of file formats, minimizing the invisible web (e.g. Tika).
  • Engine – The engine must be able to quickly query the aggregated web index and return results in an efficient manner (e.g. Nutch).
  • Search Interface – The engine must include a powerful search interface for quickly and accurately returning relevant results (e.g. Solr).
  • Scalability – The engine must be scalable to sustain worldwide utilization (e.g. Hadoop).
  • Algorithms – In addition to the standard automated algorithms for page relevance the system must integrate human-based feedback including:
    • Positive and negative votes on a per page basis.
    • The ability to add and remove pages from query results.
    • Influence of votes based on a calculation of user trustworthiness (merit).
    • Promotion of results by administrative users.
  • Custom Results – The results must be customized for the user. While the aggregate influence of users affects the results, the individual user is also able to customize results. One should see a search page which reflects the results one has chosen and not the results one has removed.
    • Ability to annotate individual entries.
  • Portability – The engine should define a standard format for user data which can be exported and imported between engines. This should include customized query results, annotations, votes, removed and added pages, etc. This will be available to the user for export/import at any time. While additional data may be maintained by individual engines, the basic customizations should be portable.

I’m sure I’m missing some essentials – please share with me whatever essentials I have forgotten that come to your mind.

Starting from Zero?

It is not necessary for this project to begin from nothing, significant portions of the endeavor have already been undertaken toward creating an open source search engine – largely by Apache’s Nutch project. The available code should be utilized and with customization could integrate social search features. This would allow some of the most significant aspects of the project to be offloaded to already existing projects.

Additionally, it might be hoped that companies and individuals who have previously created endeavors in this direction would open source their code. For example, Wikia was built on Nutch and the code – including the distributed crawler (GRUB) and UI (Wikia) was released into the open source world.[3]

What We Need

Now the question becomes, “What do we need?” and more importantly, “Whom do we need?”

First off, we could use donated hosting. Perhaps one of the larger cloud-based hosting companies would consider offering us space for a period of time? I’m thinking here of someone like Rackspace, Amazon Web Services, or GoGrid.

Secondly, we’d need developers. I’m not a Java developer…though I’ve downloaded the code and am preparing to jump in. I also don’t have a ton of time – so depending on me to get the development done…well, it could take a while.

Thirdly, we’d need content curators…and I think this is key (and also one of the areas I love the most). We’d need people to edit the content and make the results awesome. These individuals would be “power users” whose influence on results would be more significant than the new user. With time individuals could increase their reputation, but this would seed us with a trusted core of individuals[4] who would ensure that the results returned would be high quality right from the get-go for new users[5]

Finally, we’d need some designers. I’m all for simplicity in search – but goodness knows most of us developers have very limited design abilities and an aesthetic touch here and there would be a huge boon to the endeavor.

Next Steps

At this juncture its all about gathering interest. Finding projects that have already begun the process, looking for old hidden open source code that may be of use, etc. Leave a comment if you’d like to be part of the discussion.

Appendixes

Current Open Source Search Engines

  • DataparkSearch – GNU GPL, diverged from mnGoSearch in 2003, coded in C and CGI.
  • Egothor – Open source, written in Java, currently under a complete from scratch rewrite for version 3.
  • Grubng – Open source, distributed crawler..
  • BeeSeek – Open source, P2P, focuses on user anonymity.
  • Yioop! (SearchQuarry) – GNU GPLv3, documentation is very informative.
  • Heritrix – Open source, by Archive.org for their web archives.
  • Seeks Project – AGPLv3, P2P, fairly impressive project which attempts to take social search into consideration.
  • OpenWebSpider – Open source, written in .NET, appears to be abandoned.
  • Ex-Crawler – Open source, Java, impressive, last updated released 2010.
  • Jumper Search – Open source, social search, website appears to be down, currently linking to SF.
  • Open Search Server – Open source.
  1. [1]Or a human search engine, which becomes sadly entangled with engines meant for finding humans such as referenced previously.
  2. [2]A few other terms which might be appropriate are collaborative search engine, though this would have to be prefaced with “active” to distinguish it from passive user feedback aggregation (e.g. how long a user stayed at a site); curation search engine (giving the idea of content curation, but this is sometimes thought of in terms of archival); or crowd-sourced search engine (though this centers too much on democracy, whereas such engines would probably benefit from a meritocracy).
  3. [3]Unfortunately, I have been unable to find a copy of the Wikia UI code.
  4. [4]Taking a page from early Ask Jeeves history.
  5. [5]Obviously not necessarily in the long tail, but in the general topics.

Ramblings on Search.

A Better Search Engine.

X= (C*15) + (E*40) + ((A1*40) + (A2*30) + (A3*20) + (A4*10) + (A5*5)) + (M*20)

X = The ranking value for any given entry for a given search query.
C = The natural ranking given by a natural search engine analysis (such as provided via Google, Yahoo, Bing).
E = The ranking value given by individuals noted as experts in the field. For example – an individual “recognized” as an expert in American Civil War History voting on SiteA as of high importance to queries on “Vicksburg battle” would cause SiteA to receive a ranking value boost of E*40, where E is the ranking given by the expert, ranking inverted so that 1 = 1 (first result) and 10000 = .00001 (ten thousandth result).
A(1,2,3,4,5) = Aggregate of users rankings. These are aggregated based on “trust.” A user receives trust as they demonstrate reliability over time. This would be determined by a sub-algorithm that considered factors such as (a) how often user agreed with experts, (b) how often user agreed with crawler, (c) how often user agreed with other high-level users, etc.
M = Whether the user has verified their identity and linked a valid credit/banking account to their account. Fines would be imposed on individuals abusing the system using this linked information. Linking to valid monetary funds would not be required but would be an optional means of increasing trust.

An Introduction.

Since at least 2003 I have believed there was a better way to do search – and that way is socially. I nearly launched a business at that juncture to create such a search engine. I’ve waited eagerly over the years for someone to implement what seems so common sense to me – only to repeatedly be disappointed.
Google has now killed off SearchWiki. While far from what I envision – it was still closer and a move in the right direction for the company that has always insisted that machines can do it better. I had hoped it was the beginning of a change for Google – but the reversion to stars is devastating. So, I wanted to whine….and here it is.
I’m not going to spend all night on it at this juncture…but I may add to this article on occasion. That’s all.

What Would It Take?

Some would suggest that this project would be nigh impossible to complete – certainly impossible against a behemoth like Google. I don’t think so.

  • There is a free/open source robust, web search engine currently available called Nutch. It’s actually been around for years (I was looking at it back around 2003 as well).
  • There is also the option of using one of the many discarded web search engines – or getting a larger partner on-board like Yahoo! or Microsoft. Wink had something going for a while, Eurekster also liked like it had potential.
  • Matt Wells has demonstrated what can be done on a low budget with web search for ten years now with Gigablast. Think <$10,000 to start for hardware.
  • Hiring “experts” isn’t that hard. For initial seeding one could use educated non-experts (e.g. college students) who are willing to work for a low hourly rate ($10/hr.) but can make intelligent choices between web results.
  • Wikipedia has demonstrated that it is possible to create an open eco-system which remains fairly spam free.

Incentives.

For both businesses and individuals there would be an incentive to play fair, to contribute content, etc.:

  • Businesses would receive “cred” for good submissions/votes which they could then use to promote their own valid content (they’d lose cred quickly if they abused their cred).
  • Individuals could do the same for their own websites.
  • We also find the pride of ownership and accomplishment would play a significant role as seen in Wikipedia, YouTube, and DMoZ.
  • It’d make sense to me to implement revenue sharing (at the “higher” levels of user trust).[1]

For Each userx

myportion = myrelevantresults + (mytrustlevel * mytrustpoints)

Next

We’d then divide the 25% ($250) by the sum of all userx’s myportion (sumx). Then give each user sumx*myportion (ex. $250 / 3000 = $0.083 * 60 = $5). Not a lot of cash – but that is a rough guestimate on a single search query!

Why Don’t You Do It?

I’m sure some will wonder why I didn’t do it in the past and why I haven’t done it now. Ahh – that is the question. There are numerous contributing factors both past and present but the essence comes down to, I like ideas more than implementation (who doesn’t?)…and more importantly, I find myself more the aggregator of knowledge than the creator of methods. In other words, to some small extent, I’m a walking search engine – and I would love to input my knowledge into an engine like this…but I am not a skilled developer. I mean – I program, but I’m no Scott Guthrie (or…more in my realm, Corey Palmer, Ash Khan, or Kevin Clough).

Feedback Requested.

I’m open to feedback from anywhere…so speak up. A few individuals who come to mind are Steve Marder, Grant Ryan, and Jason Calacanis.

  1. [1]Consider, we have say 500 individuals with levels A1, A2, or A3 trust who have voted on at least one result on a query result. This query result over a months time generates $1,000 in revenue. 25% is set aside for user compensation. We’d do something like this: