Tag Archives: Web search engine

A Social Search Engine Proposal.

Overview:

Nutch robots
Image via Wikipedia

IMHO, the current state of search is depressing. This is not a new realization for me. It is seven or eight years ago now that I first imagined a social search engine which would not rely solely on algorithms to determine the relative importance of search results but that would consider both machine and end-user feedback. This was in the early days of Nutch and I began researching the possibility of utilizing Nutch as the underlying core engine for such an endeavor, I rounded up some small-term investment capital, and so on. Unfortunately, this was also at the high peak of my struggle with Obsessive Compulsive Disorder (OCD) and my efforts eventually fell through.

Over the years I have watched as promising engine after promising engine has come along and in their turn failed to take the lead or even maintain their momentum. Years have passed and at each step of the way I have said, “It must just be around the corner…This is ages in technology time.” Even Google came out with SearchWiki, while not a perfect implementation it was a huge step in the right direction. For the last year or two I’ve been using Zakta and I’ve spent time on almost every other social search engine currently (or previously) available – yet I find that in the long-run they have all failed me.

So here I am so many years later longing for just such an engine. I’ve written on this blog about the topic before, but I will write again. In this post I will specifically propose the formation of an endeavor to create a social search engine, and I hope it will foster some interest in the community. I am not ready nor able to undertake such an endeavor myself – but I am interested in being part of such an endeavor.

Open Source: Ensuring Continuity

It is worth noting at this juncture that I’d intend for this project to be open source. Too many times I have lost the social search data I have accumulated because a specific engine has folded. My hope would be that the resultant project would be open source with commercial implementations and would provide a significant amount of data portability between engines, in case one engine should fold. We’ll talk more about the open source and portability aspects of the project later in this proposal.

What is Social Search?

Before we jump into a discussion about how to build a social search engine it is necessary first to define what is meant by social search. Unfortunately the term social search is used to delineate several different concepts which are very different from one another.

There are the real-time search engines which focus on aggregating information from various social media networks – and sometimes prioritizing links based on their popularity within a network. For example (also defunct) Topsy, the no-longer-real-time OneRiot, and the now-defunct Scoopler.

There are the engines which are focused on finding humans – e.g. allowing one to garner information about a person. Wink eventually became this sort of engine, Spokeo would be another example. They are essentially white pages on steroids.

Finally there is what I mean by social search – and I would use another term but there is no other term I am aware of which is so widely used to delineate this type of engine (and I want to ensure the widest possible audience). It is sometimes called a “human-powered search engine”[1] Google and Wikimedia may have come closest by terming it a “Wiki” (SearchWiki and Wikia), but it seems to me that there is a need for an entirely new term that better and more precisely defines the idea…perhaps one result of this proposal and its aftermath will be just such a term.[2]

Core Parameters

In this section I will delineate what I believe are the core required features for a social search engine. An engine which included these features I believe would be a 1.0 release. There is certainly room for numerous improvements, but this would define a baseline by which to measure the proposal’s progress. I am not infallible, and I am sure there are aspects of the baseline which should be edited, removed, or replaced – I am open to suggestions.

  • Web Crawler – The engine must include a robust web crawler which can index the web, not just a subset of sites (e.g. Nutch).
  • Interpretive Ability – The engine must be able to interpret a wide variety of file formats, minimizing the invisible web (e.g. Tika).
  • Engine – The engine must be able to quickly query the aggregated web index and return results in an efficient manner (e.g. Nutch).
  • Search Interface – The engine must include a powerful search interface for quickly and accurately returning relevant results (e.g. Solr).
  • Scalability – The engine must be scalable to sustain worldwide utilization (e.g. Hadoop).
  • Algorithms – In addition to the standard automated algorithms for page relevance the system must integrate human-based feedback including:
    • Positive and negative votes on a per page basis.
    • The ability to add and remove pages from query results.
    • Influence of votes based on a calculation of user trustworthiness (merit).
    • Promotion of results by administrative users.
  • Custom Results – The results must be customized for the user. While the aggregate influence of users affects the results, the individual user is also able to customize results. One should see a search page which reflects the results one has chosen and not the results one has removed.
    • Ability to annotate individual entries.
  • Portability – The engine should define a standard format for user data which can be exported and imported between engines. This should include customized query results, annotations, votes, removed and added pages, etc. This will be available to the user for export/import at any time. While additional data may be maintained by individual engines, the basic customizations should be portable.

I’m sure I’m missing some essentials – please share with me whatever essentials I have forgotten that come to your mind.

Starting from Zero?

It is not necessary for this project to begin from nothing, significant portions of the endeavor have already been undertaken toward creating an open source search engine – largely by Apache’s Nutch project. The available code should be utilized and with customization could integrate social search features. This would allow some of the most significant aspects of the project to be offloaded to already existing projects.

Additionally, it might be hoped that companies and individuals who have previously created endeavors in this direction would open source their code. For example, Wikia was built on Nutch and the code – including the distributed crawler (GRUB) and UI (Wikia) was released into the open source world.[3]

What We Need

Now the question becomes, “What do we need?” and more importantly, “Whom do we need?”

First off, we could use donated hosting. Perhaps one of the larger cloud-based hosting companies would consider offering us space for a period of time? I’m thinking here of someone like Rackspace, Amazon Web Services, or GoGrid.

Secondly, we’d need developers. I’m not a Java developer…though I’ve downloaded the code and am preparing to jump in. I also don’t have a ton of time – so depending on me to get the development done…well, it could take a while.

Thirdly, we’d need content curators…and I think this is key (and also one of the areas I love the most). We’d need people to edit the content and make the results awesome. These individuals would be “power users” whose influence on results would be more significant than the new user. With time individuals could increase their reputation, but this would seed us with a trusted core of individuals[4] who would ensure that the results returned would be high quality right from the get-go for new users[5]

Finally, we’d need some designers. I’m all for simplicity in search – but goodness knows most of us developers have very limited design abilities and an aesthetic touch here and there would be a huge boon to the endeavor.

Next Steps

At this juncture its all about gathering interest. Finding projects that have already begun the process, looking for old hidden open source code that may be of use, etc. Leave a comment if you’d like to be part of the discussion.

Appendixes

Current Open Source Search Engines

  • DataparkSearch – GNU GPL, diverged from mnGoSearch in 2003, coded in C and CGI.
  • Egothor – Open source, written in Java, currently under a complete from scratch rewrite for version 3.
  • Grubng – Open source, distributed crawler..
  • BeeSeek – Open source, P2P, focuses on user anonymity.
  • Yioop! (SearchQuarry) – GNU GPLv3, documentation is very informative.
  • Heritrix – Open source, by Archive.org for their web archives.
  • Seeks Project – AGPLv3, P2P, fairly impressive project which attempts to take social search into consideration.
  • OpenWebSpider – Open source, written in .NET, appears to be abandoned.
  • Ex-Crawler – Open source, Java, impressive, last updated released 2010.
  • Jumper Search – Open source, social search, website appears to be down, currently linking to SF.
  • Open Search Server – Open source.
  1. [1]Or a human search engine, which becomes sadly entangled with engines meant for finding humans such as referenced previously.
  2. [2]A few other terms which might be appropriate are collaborative search engine, though this would have to be prefaced with “active” to distinguish it from passive user feedback aggregation (e.g. how long a user stayed at a site); curation search engine (giving the idea of content curation, but this is sometimes thought of in terms of archival); or crowd-sourced search engine (though this centers too much on democracy, whereas such engines would probably benefit from a meritocracy).
  3. [3]Unfortunately, I have been unable to find a copy of the Wikia UI code.
  4. [4]Taking a page from early Ask Jeeves history.
  5. [5]Obviously not necessarily in the long tail, but in the general topics.

Zakta Search – Take Two.

Image representing Zakta as depicted in CrunchBase
Image via CrunchBase

Back in June I wrote that I had abandoned Google for Zakta. With the announcement of Facebook and Bing‘s new partnership I figured I ought to recap my enthrallment with Zakta.

Most people have never heard of Zakta – but for the last several months they have been my engine of choice. What makes Zakta so special that it could drag me away from Google? Social Search. I mean real social search, not this weak stuff Facebook/Bing are offering.

Zakta offers a number of really awesome features but the one I really care about is the ability to edit my results. See, I spend all day every day working on computers – its my job. I am constantly searching for extremely arcane information. On more than one occasion I’ve driven even Google’s huge indexes to turn up empty. Unfortunately, its not all revolving around the same topic – so the arcane information I may need today won’t be utilized again for another five months or a year. This is where Zakta comes in handy. I can do a search for “rsFieldReference report item expressions can only refer to fields within the current dataset” or “You can only import binary registry files from within the registry editor” – delete the results that aren’t relevant, thumbs up and comment on the results that provide the answer – and then months later perform the same or a similar query and immediately know which links are going to be most helpful!

Zakta has something big in the wings – according to the site. I’m hoping they do, b/c I think it is only a matter of time before one of the “big guys” comes out with a social search engine – and then, if Zakta hasn’t already established itself well in the marketplace – well, it will be all over.

I have a bunch of ideas for how Zakta could “break-out” but my biggest / core idea centers on increasing the role trusted editors play in search results rankings for everyone – and in order to get good trusted editors give us some stock. 😉 I’d love to be an editor…and help out in other ways…

Okay, that’s the end of my Zakta speel for now, expect to see more once Zakta releases there upcoming “big upgrade”.

I’ve Abandoned Google Search for…Zakta!

Image representing Zakta
Image via CrunchBase

I’ve been using Zakta for several months now as my primary search engine – and it is about time for me to spread the word through this blog. Google is now my secondary search engine – yes, I still use it, but 95% of my queries are resolved by Zakta.

Why utilize Zakta? Social Search. Google created SearchWiki which could have knocked the social search competition out of the market – but then they abandoned this important project. Zakta is a better, older, more mature engine with features similar to and exceeding SearchWiki.

Here are the killer points for using Zakta as your primary engine:

  • The results displayed to you are sorted not only by algorithmic methods but also by what others like you think about these results. Thus, results which might be on the tenth page of Google but really are the best result will show up on the first page of Zakta.
  • You can remove and reorganize results and when you come back to that query they remain in the same order. This allows you to create a customized resource showing you the pages you like best for a specific topic.
  • Zakta oftentimes offers a “reference” site for a query. This is a encyclopedic source (e.g. Wikipedia) that is consider to provide generally reliable and objective information on a topic.
  • Zakta oftentimes offers a best bet for a search result – this is when they feel confident that your query’s answer will be found on a specific site – saving you from sorting through the results yourself.

Here are the big areas I think Zakta needs to improve in:

  • They include foreign language results in their results sometimes, especially when it is a more refined query. They need to include primary language results only unless an option is enabled to allow for foreign language results. I’m sorry, I can’t read Chinese – that link won’t help me…and I’m sure my Chinese (Spanish, French, Ethiopian, etc.) compatriots suffer the same frustration when results appear in English (if they don’t know English – I am always amazed and impressed by so many who speak two to four different languages!).
  • If you eliminate a result from a search query it disappears but there is no way to get it back. There needs to be a trash bin for restoring queries.
  • Sometimes, even if you provide quotation marks, Zakta seems to include results where the words aren’t together – or where it is entirely unapparent why the result is in any way related to the query. I should not that as someone who works in IT and a wannabe polymath, I somewhat regularly receive no results pages from Google – so this is not something most users will experience – I seem to always need or want to find arcane information.
  • Zakta allows you to search the other major search engines right through their site – unfortunately they don’t provide a way to say “include this result in my Zakta results” or “allow me to edit this result” – it needs to be a seamless Zakta interface no matter what search engine I am using. This will likely occur through a browser add-on – they have one currently but it doesn’t have this feature.

Google Gadgets Pornographic – Beware!

Google Gadgets are “Gadgets powered by Google are miniature objects made by Google users like you that offer cool and dynamic content that can be placed on any page on the web.”[1] Common gadgets include todo lists, currency converters, calculators, and apps that pull in feeds from various sites.

See No Evil

These Gadgets can be used anywhere on the web, in Google Desktop, in Google Docs, and on one’s Google iTalk pages. There is a huge gallery of Google Gadgets here[2]. I browsed over tonight looking for some functionality to add to a spreadsheet (didn’t find it). Imagine my surprise as browsing through the directory I selected New and was greeted by a score of pornographic images.

Had I accidentally clicked on another directory? Nope, it was New. Was it possible that someone had just spammed the first page? Nope, the second and third pages revealed similarly disturbing images.

Some may think I’m simply overreacting. I’ve seen a Sports Illustrated cover and am screaming bloody murder. This is not the case. I’ve become accustomed to flipping past trashy apps in Google Android Market[3]…this is much worse. I’m talking overtly sexually explicit imagery.

I’m sorry to Google for posting this as a blog post – but I know that the fastest way to get a reaction on this topic is in this manner. Google, “Do No Evil” – a company which has set high standards in many arenas – needs to step up in this one. I don’t understand how an oversight like this has occurred – but it must be stopped immediately. I know Google can do it – I believe in them. 😉 Join me in calling upon Google to act in this manner by sharing this information (or a link to this post) with friends.

Whatever your position on pornography, I hope you can support the need for guarding our children against exposure to explicit images. Google SafeSearch does a fairly good job protecting our children against accidentally stumbling across pornographic materials – why is Google Gadgets not undergoing the same sort of filtering? Is it not as readily available to children? In fact, one does not have to search for the wrong words to turn up explicit materials – simply click “New” and you’ll be overwhelmed with the images.

Struggling with Pornography?

I’m concerned that some who read this post may be struggling with pornography or other sexual addictions and that this post could serve as a stumbling block for them (or you). If this is the case, I want to point you towards a couple resources that may help:

  • XXXChurch – “XXXchurch is designed to bring awareness, openness and accountability to those affected by pornography. We are an online community that tours the world speaking at colleges, churches and community centers. XXXchurch.com exists to help those who are in over their heads with pornography, both consumers and those in the industry.”
  • Porn Again Christian – A free eBook on pornography and masturbation by Mark Driscoll, pastor of Mars Hill Church.

Continue to fight the good fight.

  1. [1]http://www.google.com/webmasters/gadgets/
  2. [2]Working with teenagers I know that the compulsion to look at pornographic images is oftentimes irresistibly strong. Knowing that some teenagers might click through if a link was present, I have decided not to post the link. Anyone who knows how to perform a web search can easily get to the Google Gadgets page – I just don’t want to facilitate such action by those who are already struggling to resist temptation.
  3. [3]Not that the apps in Google Android Market are acceptable either – especially b/c of their accessibility to young children.

Ramblings on Search.

A Better Search Engine.

X= (C*15) + (E*40) + ((A1*40) + (A2*30) + (A3*20) + (A4*10) + (A5*5)) + (M*20)

X = The ranking value for any given entry for a given search query.
C = The natural ranking given by a natural search engine analysis (such as provided via Google, Yahoo, Bing).
E = The ranking value given by individuals noted as experts in the field. For example – an individual “recognized” as an expert in American Civil War History voting on SiteA as of high importance to queries on “Vicksburg battle” would cause SiteA to receive a ranking value boost of E*40, where E is the ranking given by the expert, ranking inverted so that 1 = 1 (first result) and 10000 = .00001 (ten thousandth result).
A(1,2,3,4,5) = Aggregate of users rankings. These are aggregated based on “trust.” A user receives trust as they demonstrate reliability over time. This would be determined by a sub-algorithm that considered factors such as (a) how often user agreed with experts, (b) how often user agreed with crawler, (c) how often user agreed with other high-level users, etc.
M = Whether the user has verified their identity and linked a valid credit/banking account to their account. Fines would be imposed on individuals abusing the system using this linked information. Linking to valid monetary funds would not be required but would be an optional means of increasing trust.

An Introduction.

Since at least 2003 I have believed there was a better way to do search – and that way is socially. I nearly launched a business at that juncture to create such a search engine. I’ve waited eagerly over the years for someone to implement what seems so common sense to me – only to repeatedly be disappointed.
Google has now killed off SearchWiki. While far from what I envision – it was still closer and a move in the right direction for the company that has always insisted that machines can do it better. I had hoped it was the beginning of a change for Google – but the reversion to stars is devastating. So, I wanted to whine….and here it is.
I’m not going to spend all night on it at this juncture…but I may add to this article on occasion. That’s all.

What Would It Take?

Some would suggest that this project would be nigh impossible to complete – certainly impossible against a behemoth like Google. I don’t think so.

  • There is a free/open source robust, web search engine currently available called Nutch. It’s actually been around for years (I was looking at it back around 2003 as well).
  • There is also the option of using one of the many discarded web search engines – or getting a larger partner on-board like Yahoo! or Microsoft. Wink had something going for a while, Eurekster also liked like it had potential.
  • Matt Wells has demonstrated what can be done on a low budget with web search for ten years now with Gigablast. Think <$10,000 to start for hardware.
  • Hiring “experts” isn’t that hard. For initial seeding one could use educated non-experts (e.g. college students) who are willing to work for a low hourly rate ($10/hr.) but can make intelligent choices between web results.
  • Wikipedia has demonstrated that it is possible to create an open eco-system which remains fairly spam free.

Incentives.

For both businesses and individuals there would be an incentive to play fair, to contribute content, etc.:

  • Businesses would receive “cred” for good submissions/votes which they could then use to promote their own valid content (they’d lose cred quickly if they abused their cred).
  • Individuals could do the same for their own websites.
  • We also find the pride of ownership and accomplishment would play a significant role as seen in Wikipedia, YouTube, and DMoZ.
  • It’d make sense to me to implement revenue sharing (at the “higher” levels of user trust).[1]

For Each userx

myportion = myrelevantresults + (mytrustlevel * mytrustpoints)

Next

We’d then divide the 25% ($250) by the sum of all userx’s myportion (sumx). Then give each user sumx*myportion (ex. $250 / 3000 = $0.083 * 60 = $5). Not a lot of cash – but that is a rough guestimate on a single search query!

Why Don’t You Do It?

I’m sure some will wonder why I didn’t do it in the past and why I haven’t done it now. Ahh – that is the question. There are numerous contributing factors both past and present but the essence comes down to, I like ideas more than implementation (who doesn’t?)…and more importantly, I find myself more the aggregator of knowledge than the creator of methods. In other words, to some small extent, I’m a walking search engine – and I would love to input my knowledge into an engine like this…but I am not a skilled developer. I mean – I program, but I’m no Scott Guthrie (or…more in my realm, Corey Palmer, Ash Khan, or Kevin Clough).

Feedback Requested.

I’m open to feedback from anywhere…so speak up. A few individuals who come to mind are Steve Marder, Grant Ryan, and Jason Calacanis.

  1. [1]Consider, we have say 500 individuals with levels A1, A2, or A3 trust who have voted on at least one result on a query result. This query result over a months time generates $1,000 in revenue. 25% is set aside for user compensation. We’d do something like this: