Tag Archives: Search

Diigo – A Great Tool For Readers.

Diigo is a free service which offers higher-tiered features for those who need them. I’ve been paying for the Basic service ($20/year) for the last two or three years – and have no intent of canceling. Someday I might consider upgrading to Premium ($40/year). When I pay for something online – it means I really like it (duhh) and that would be the case with Diigo.

Diigo is a “knowledge management” tool. It allows you to highlight text and create sticky notes on webpages; create a library of links, pages, notes, pictures; archive webpages online for future reference, collaborate with family, friends, or strangers; and so on.

I use Diigo primarily for highlighting. As I’m reading and see something memorable I highlight it with Diigo – it is saved to my Diigo account and I can easily search Diigo to find the reference at a future time when I need it.

I also use Diigo for bookmark management. I moved from Firefox to Chrome as my primary web browser and the one feature I really miss from Firefox is its full-featured bookmark management. Diigo helps ease this pain a bit.

Diigo also provides one with the easy ability to categorize and tag ones links and highlights – this is a huge help in managing a large amount of information…and it doesn’t forget to provide users with intuitive security controls allowing you to decide what others can see. For example, there may be some topics you are researching (especially for personal reasons) you don’t want everyone else to be aware of (i.e. “how to deal with my annoying mother-in-law” [mine isn’t annoying btw]).

There are other tools out there. Zotero is one I’ve used in the past. It is pretty powerful and more aimed at academics. I’d suggest that these are the two frontrunners currently available. Anyone have any other suggestions? Or a preference between Diigo and Zotero?

 If you want to see my publicly shared highlights, notes, and links you can see them here: https://www.diigo.com/user/davemackey.

A Social Search Engine Proposal.

Overview:

Nutch robots
Image via Wikipedia

IMHO, the current state of search is depressing. This is not a new realization for me. It is seven or eight years ago now that I first imagined a social search engine which would not rely solely on algorithms to determine the relative importance of search results but that would consider both machine and end-user feedback. This was in the early days of Nutch and I began researching the possibility of utilizing Nutch as the underlying core engine for such an endeavor, I rounded up some small-term investment capital, and so on. Unfortunately, this was also at the high peak of my struggle with Obsessive Compulsive Disorder (OCD) and my efforts eventually fell through.

Over the years I have watched as promising engine after promising engine has come along and in their turn failed to take the lead or even maintain their momentum. Years have passed and at each step of the way I have said, “It must just be around the corner…This is ages in technology time.” Even Google came out with SearchWiki, while not a perfect implementation it was a huge step in the right direction. For the last year or two I’ve been using Zakta and I’ve spent time on almost every other social search engine currently (or previously) available – yet I find that in the long-run they have all failed me.

So here I am so many years later longing for just such an engine. I’ve written on this blog about the topic before, but I will write again. In this post I will specifically propose the formation of an endeavor to create a social search engine, and I hope it will foster some interest in the community. I am not ready nor able to undertake such an endeavor myself – but I am interested in being part of such an endeavor.

Open Source: Ensuring Continuity

It is worth noting at this juncture that I’d intend for this project to be open source. Too many times I have lost the social search data I have accumulated because a specific engine has folded. My hope would be that the resultant project would be open source with commercial implementations and would provide a significant amount of data portability between engines, in case one engine should fold. We’ll talk more about the open source and portability aspects of the project later in this proposal.

What is Social Search?

Before we jump into a discussion about how to build a social search engine it is necessary first to define what is meant by social search. Unfortunately the term social search is used to delineate several different concepts which are very different from one another.

There are the real-time search engines which focus on aggregating information from various social media networks – and sometimes prioritizing links based on their popularity within a network. For example (also defunct) Topsy, the no-longer-real-time OneRiot, and the now-defunct Scoopler.

There are the engines which are focused on finding humans – e.g. allowing one to garner information about a person. Wink eventually became this sort of engine, Spokeo would be another example. They are essentially white pages on steroids.

Finally there is what I mean by social search – and I would use another term but there is no other term I am aware of which is so widely used to delineate this type of engine (and I want to ensure the widest possible audience). It is sometimes called a “human-powered search engine”[1] Google and Wikimedia may have come closest by terming it a “Wiki” (SearchWiki and Wikia), but it seems to me that there is a need for an entirely new term that better and more precisely defines the idea…perhaps one result of this proposal and its aftermath will be just such a term.[2]

Core Parameters

In this section I will delineate what I believe are the core required features for a social search engine. An engine which included these features I believe would be a 1.0 release. There is certainly room for numerous improvements, but this would define a baseline by which to measure the proposal’s progress. I am not infallible, and I am sure there are aspects of the baseline which should be edited, removed, or replaced – I am open to suggestions.

  • Web Crawler – The engine must include a robust web crawler which can index the web, not just a subset of sites (e.g. Nutch).
  • Interpretive Ability – The engine must be able to interpret a wide variety of file formats, minimizing the invisible web (e.g. Tika).
  • Engine – The engine must be able to quickly query the aggregated web index and return results in an efficient manner (e.g. Nutch).
  • Search Interface – The engine must include a powerful search interface for quickly and accurately returning relevant results (e.g. Solr).
  • Scalability – The engine must be scalable to sustain worldwide utilization (e.g. Hadoop).
  • Algorithms – In addition to the standard automated algorithms for page relevance the system must integrate human-based feedback including:
    • Positive and negative votes on a per page basis.
    • The ability to add and remove pages from query results.
    • Influence of votes based on a calculation of user trustworthiness (merit).
    • Promotion of results by administrative users.
  • Custom Results – The results must be customized for the user. While the aggregate influence of users affects the results, the individual user is also able to customize results. One should see a search page which reflects the results one has chosen and not the results one has removed.
    • Ability to annotate individual entries.
  • Portability – The engine should define a standard format for user data which can be exported and imported between engines. This should include customized query results, annotations, votes, removed and added pages, etc. This will be available to the user for export/import at any time. While additional data may be maintained by individual engines, the basic customizations should be portable.

I’m sure I’m missing some essentials – please share with me whatever essentials I have forgotten that come to your mind.

Starting from Zero?

It is not necessary for this project to begin from nothing, significant portions of the endeavor have already been undertaken toward creating an open source search engine – largely by Apache’s Nutch project. The available code should be utilized and with customization could integrate social search features. This would allow some of the most significant aspects of the project to be offloaded to already existing projects.

Additionally, it might be hoped that companies and individuals who have previously created endeavors in this direction would open source their code. For example, Wikia was built on Nutch and the code – including the distributed crawler (GRUB) and UI (Wikia) was released into the open source world.[3]

What We Need

Now the question becomes, “What do we need?” and more importantly, “Whom do we need?”

First off, we could use donated hosting. Perhaps one of the larger cloud-based hosting companies would consider offering us space for a period of time? I’m thinking here of someone like Rackspace, Amazon Web Services, or GoGrid.

Secondly, we’d need developers. I’m not a Java developer…though I’ve downloaded the code and am preparing to jump in. I also don’t have a ton of time – so depending on me to get the development done…well, it could take a while.

Thirdly, we’d need content curators…and I think this is key (and also one of the areas I love the most). We’d need people to edit the content and make the results awesome. These individuals would be “power users” whose influence on results would be more significant than the new user. With time individuals could increase their reputation, but this would seed us with a trusted core of individuals[4] who would ensure that the results returned would be high quality right from the get-go for new users[5]

Finally, we’d need some designers. I’m all for simplicity in search – but goodness knows most of us developers have very limited design abilities and an aesthetic touch here and there would be a huge boon to the endeavor.

Next Steps

At this juncture its all about gathering interest. Finding projects that have already begun the process, looking for old hidden open source code that may be of use, etc. Leave a comment if you’d like to be part of the discussion.

Appendixes

Current Open Source Search Engines

  • DataparkSearch – GNU GPL, diverged from mnGoSearch in 2003, coded in C and CGI.
  • Egothor – Open source, written in Java, currently under a complete from scratch rewrite for version 3.
  • Grubng – Open source, distributed crawler..
  • BeeSeek – Open source, P2P, focuses on user anonymity.
  • Yioop! (SearchQuarry) – GNU GPLv3, documentation is very informative.
  • Heritrix – Open source, by Archive.org for their web archives.
  • Seeks Project – AGPLv3, P2P, fairly impressive project which attempts to take social search into consideration.
  • OpenWebSpider – Open source, written in .NET, appears to be abandoned.
  • Ex-Crawler – Open source, Java, impressive, last updated released 2010.
  • Jumper Search – Open source, social search, website appears to be down, currently linking to SF.
  • Open Search Server – Open source.
  1. [1]Or a human search engine, which becomes sadly entangled with engines meant for finding humans such as referenced previously.
  2. [2]A few other terms which might be appropriate are collaborative search engine, though this would have to be prefaced with “active” to distinguish it from passive user feedback aggregation (e.g. how long a user stayed at a site); curation search engine (giving the idea of content curation, but this is sometimes thought of in terms of archival); or crowd-sourced search engine (though this centers too much on democracy, whereas such engines would probably benefit from a meritocracy).
  3. [3]Unfortunately, I have been unable to find a copy of the Wikia UI code.
  4. [4]Taking a page from early Ask Jeeves history.
  5. [5]Obviously not necessarily in the long tail, but in the general topics.

Zakta Search – Take Two.

Image representing Zakta as depicted in CrunchBase
Image via CrunchBase

Back in June I wrote that I had abandoned Google for Zakta. With the announcement of Facebook and Bing‘s new partnership I figured I ought to recap my enthrallment with Zakta.

Most people have never heard of Zakta – but for the last several months they have been my engine of choice. What makes Zakta so special that it could drag me away from Google? Social Search. I mean real social search, not this weak stuff Facebook/Bing are offering.

Zakta offers a number of really awesome features but the one I really care about is the ability to edit my results. See, I spend all day every day working on computers – its my job. I am constantly searching for extremely arcane information. On more than one occasion I’ve driven even Google’s huge indexes to turn up empty. Unfortunately, its not all revolving around the same topic – so the arcane information I may need today won’t be utilized again for another five months or a year. This is where Zakta comes in handy. I can do a search for “rsFieldReference report item expressions can only refer to fields within the current dataset” or “You can only import binary registry files from within the registry editor” – delete the results that aren’t relevant, thumbs up and comment on the results that provide the answer – and then months later perform the same or a similar query and immediately know which links are going to be most helpful!

Zakta has something big in the wings – according to the site. I’m hoping they do, b/c I think it is only a matter of time before one of the “big guys” comes out with a social search engine – and then, if Zakta hasn’t already established itself well in the marketplace – well, it will be all over.

I have a bunch of ideas for how Zakta could “break-out” but my biggest / core idea centers on increasing the role trusted editors play in search results rankings for everyone – and in order to get good trusted editors give us some stock. 😉 I’d love to be an editor…and help out in other ways…

Okay, that’s the end of my Zakta speel for now, expect to see more once Zakta releases there upcoming “big upgrade”.